date:20221108

[GitHub] [spark] cloud-fan commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2022-11-08 Thread GitBox



cloud-fan commented on PR #38358:
URL: https://github.com/apache/spark/pull/38358#issuecomment-1308354718

   thanks, merging to 3.3/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



amaliujia commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017553189


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)

Review Comment:
   ```
   IDENTIFIER
   : (LETTER | DIGIT | '_')+
   ;
   ```
   
   The passing in view name might contains more than the defined grammar above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



amaliujia commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017551706


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)

Review Comment:
   This is from Dataset API implementation and I guess it was to pre-validate 
if the view name is in a legal format.
   
   If later there is a place to validate it then we can skip this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL

2022-11-08 Thread GitBox



cloud-fan commented on PR #38573:
URL: https://github.com/apache/spark/pull/38573#issuecomment-1308348049

   shall we add the python client API in this PR as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017549065


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)

Review Comment:
   do we need to parse it? It should be the plain view name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-08 Thread GitBox



cloud-fan closed pull request #38475: [SPARK-40992][CONNECT] Support 
toDF(columnNames) in Connect DSL
URL: https://github.com/apache/spark/pull/38475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL

2022-11-08 Thread GitBox



cloud-fan commented on PR #38475:
URL: https://github.com/apache/spark/pull/38475#issuecomment-1308344423

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



wangyum commented on code in PR #38577:
URL: https://github.com/apache/spark/pull/38577#discussion_r1017537591


##
dev/make-distribution.sh:
##
@@ -161,7 +161,7 @@ fi
 # Build uber fat JAR
 cd "$SPARK_HOME"
 
-export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g 
-XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}"
+export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g 
-XX:ReservedCodeCacheSize=128m}"

Review Comment:
   Thanks for the fix. I can reproduce this issue through github action: 
https://github.com/wangyum/make-distribution.sh/actions/runs/3425964379/jobs/5707275344



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2

2022-11-08 Thread GitBox



HyukjinKwon closed pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE 
behavior introduced in R 4.2
URL: https://github.com/apache/spark/pull/38570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2

2022-11-08 Thread GitBox



HyukjinKwon commented on PR #38570:
URL: https://github.com/apache/spark/pull/38570#issuecomment-1308338748

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2

2022-11-08 Thread GitBox



HyukjinKwon commented on PR #38570:
URL: https://github.com/apache/spark/pull/38570#issuecomment-1308337531

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengchenyu commented on pull request #37949: [SPARK-40504][YARN] Make yarn appmaster load config from client

2022-11-08 Thread GitBox



zhengchenyu commented on PR #37949:
URL: https://github.com/apache/spark/pull/37949#issuecomment-1308320736

   @dongjoon-hyun @srowen Can you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters

2022-11-08 Thread GitBox



cloud-fan commented on PR #38511:
URL: https://github.com/apache/spark/pull/38511#issuecomment-1308318576

   cc @viirya @sigmod @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL

2022-11-08 Thread GitBox



amaliujia commented on PR #38573:
URL: https://github.com/apache/spark/pull/38573#issuecomment-1308314938

   R: @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



LuciferYang commented on code in PR #38577:
URL: https://github.com/apache/spark/pull/38577#discussion_r1017502280


##
dev/make-distribution.sh:
##
@@ -161,7 +161,7 @@ fi
 # Build uber fat JAR
 cd "$SPARK_HOME"
 
-export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g 
-XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}"

Review Comment:
   I see, but there still seems to be build fail. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



LuciferYang commented on PR #38577:
URL: https://github.com/apache/spark/pull/38577#issuecomment-1308298746

   > As you mentioned in the PR description, did you hit this only at master 
branch issue?
   
   Let me double check this 
   
   > Which java version are you using now?
   
   test on linux:
   
   ```
   openjdk version "1.8.0_352"
   OpenJDK Runtime Environment (Zulu 8.66.0.15-CA-linux64) (build 1.8.0_352-b08)
   OpenJDK 64-Bit Server VM (Zulu 8.66.0.15-CA-linux64) (build 25.352-b08, 
mixed mode)
   ```
   
   test on MacOS：
   
   ```
   openjdk version "1.8.0_352"
   OpenJDK Runtime Environment (Zulu 8.66.0.15-CA-linux64) (build 1.8.0_352-b08)
   OpenJDK 64-Bit Server VM (Zulu 8.66.0.15-CA-linux64) (build 25.352-b08, 
mixed mode)
   ```
   
   Wait me do more test with Java 11/17.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



viirya commented on code in PR #38577:
URL: https://github.com/apache/spark/pull/38577#discussion_r1017499034


##
dev/make-distribution.sh:
##
@@ -161,7 +161,7 @@ fi
 # Build uber fat JAR
 cd "$SPARK_HOME"
 
-export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g 
-XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}"

Review Comment:
   It was added by https://github.com/apache/spark/pull/37510. Seems it was 
added also because `make-distributions.sh` failed to build.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



pan3793 commented on PR #38577:
URL: https://github.com/apache/spark/pull/38577#issuecomment-1308297255

   I reproduced the issue w/ openjdk 1.8.0_332 macos-aarch64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



dongjoon-hyun commented on PR #38577:
URL: https://github.com/apache/spark/pull/38577#issuecomment-1308293543

   To @LuciferYang ,
   - As you mentioned in the PR description, did you hit this only at `master` 
branch issue?
   - Which java version are you using now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



amaliujia commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017494208


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)
+  } catch {
+case _: ParseException =>
+  throw QueryCompilationErrors.invalidViewNameError(createView.getName)
+  }
+
+val plan = CreateViewCommand(
+  name = tableIdentifier,
+  userSpecifiedColumns = Nil,
+  comment = None,
+  properties = Map.empty,
+  originalText = None,
+  plan = new SparkConnectPlanner(createView.getInput, session).transform(),

Review Comment:
   In fact it is not...
   
   CommandPlanner and those LogicalPlanPlanner are separated, but Command's 
input usually are logical plan.
   
   I am thinking to refactor this somehow later to better have those logical 
contained together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



dongjoon-hyun commented on PR #38577:
URL: https://github.com/apache/spark/pull/38577#issuecomment-1308290786

   cc @viirya too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38506: [SPARK-41010][CONNECT][PYTHON] Complete Support for Except and Intersect in Python client

2022-11-08 Thread GitBox



zhengruifeng commented on code in PR #38506:
URL: https://github.com/apache/spark/pull/38506#discussion_r1017492795


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -317,7 +319,83 @@ def unionByName(self, other: "DataFrame", 
allowMissingColumns: bool = False) ->
 if other._plan is None:
 raise ValueError("Argument to UnionByName does not contain a valid 
plan.")
 return DataFrame.withPlan(
-plan.UnionAll(self._plan, other._plan, allowMissingColumns), 
session=self._session
+plan.SetOperation(
+self._plan, other._plan, "union", is_all=True, 
by_name=allowMissingColumns
+),
+session=self._session,
+)
+
+def exceptAll(self, other: "DataFrame") -> "DataFrame":
+"""Return a new :class:`DataFrame` containing rows in this 
:class:`DataFrame` but
+not in another :class:`DataFrame` while preserving duplicates.
+
+This is equivalent to `EXCEPT ALL` in SQL.
+As standard in SQL, this function resolves columns by position (not by 
name).
+
+.. versionadded:: 2.4.0
+
+Parameters
+--
+other : :class:`DataFrame`
+The other :class:`DataFrame` to compare to.
+
+Returns
+---
+:class:`DataFrame`
+"""
+return DataFrame.withPlan(
+plan.SetOperation(self._plan, other._plan, "except", is_all=True), 
session=self._session
+)
+
+def intersect(self, other: "DataFrame") -> "DataFrame":
+"""Return a new :class:`DataFrame` containing rows only in
+both this :class:`DataFrame` and another :class:`DataFrame`.
+Note that any duplicates are removed. To preserve duplicates
+use :func:`intersectAll`.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+other : :class:`DataFrame`
+Another :class:`DataFrame` that needs to be combined.
+
+Returns
+---
+:class:`DataFrame`
+Combined DataFrame.
+
+Notes
+-
+This is equivalent to `INTERSECT` in SQL.
+"""
+return DataFrame.withPlan(
+plan.SetOperation(self._plan, other._plan, "intersect", 
is_all=False),
+session=self._session,
+)
+
+def intersectAll(self, other: "DataFrame") -> "DataFrame":
+"""Return a new :class:`DataFrame` containing rows in both this 
:class:`DataFrame`
+and another :class:`DataFrame` while preserving duplicates.
+
+This is equivalent to `INTERSECT ALL` in SQL. As standard in SQL, this 
function
+resolves columns by position (not by name).
+
+.. versionadded:: 2.4.0

Review Comment:
   ```suggestion
   .. versionadded:: 3.4.0
   ```



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -317,7 +319,83 @@ def unionByName(self, other: "DataFrame", 
allowMissingColumns: bool = False) ->
 if other._plan is None:
 raise ValueError("Argument to UnionByName does not contain a valid 
plan.")
 return DataFrame.withPlan(
-plan.UnionAll(self._plan, other._plan, allowMissingColumns), 
session=self._session
+plan.SetOperation(
+self._plan, other._plan, "union", is_all=True, 
by_name=allowMissingColumns
+),
+session=self._session,
+)
+
+def exceptAll(self, other: "DataFrame") -> "DataFrame":
+"""Return a new :class:`DataFrame` containing rows in this 
:class:`DataFrame` but
+not in another :class:`DataFrame` while preserving duplicates.
+
+This is equivalent to `EXCEPT ALL` in SQL.
+As standard in SQL, this function resolves columns by position (not by 
name).
+
+.. versionadded:: 2.4.0

Review Comment:
   ```suggestion
   .. versionadded:: 3.4.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



zhengruifeng commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017490854


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)
+  } catch {
+case _: ParseException =>
+  throw QueryCompilationErrors.invalidViewNameError(createView.getName)
+  }
+
+val plan = CreateViewCommand(
+  name = tableIdentifier,
+  userSpecifiedColumns = Nil,
+  comment = None,
+  properties = Map.empty,
+  originalText = None,
+  plan = new SparkConnectPlanner(createView.getInput, session).transform(),

Review Comment:
   `this.transform()` or `transformRelation(createView.getInput)` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL

2022-11-08 Thread GitBox



zhengruifeng commented on code in PR #38566:
URL: https://github.com/apache/spark/pull/38566#discussion_r1017490854


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala:
##
@@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, 
command: proto.Command)
 session.udf.registerPython(cf.getPartsList.asScala.head, udf)
   }
 
+  def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): 
Unit = {
+val viewType = if (createView.getIsGlobal) GlobalTempView else 
LocalTempView
+
+val tableIdentifier =
+  try {
+session.sessionState.sqlParser.parseTableIdentifier(createView.getName)
+  } catch {
+case _: ParseException =>
+  throw QueryCompilationErrors.invalidViewNameError(createView.getName)
+  }
+
+val plan = CreateViewCommand(
+  name = tableIdentifier,
+  userSpecifiedColumns = Nil,
+  comment = None,
+  properties = Map.empty,
+  originalText = None,
+  plan = new SparkConnectPlanner(createView.getInput, session).transform(),

Review Comment:
   `this.transform()` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto

2022-11-08 Thread GitBox



cloud-fan commented on PR #38491:
URL: https://github.com/apache/spark/pull/38491#issuecomment-1308284185

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto

2022-11-08 Thread GitBox



cloud-fan closed pull request #38491: [SPARK-41058][CONNECT] Remove unused 
import in commands.proto
URL: https://github.com/apache/spark/pull/38491


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38490: [SPARK-41009][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1070` to `LOCATION_ALREADY_EXISTS`

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38490:
URL: https://github.com/apache/spark/pull/38490#discussion_r101748


##
core/src/main/resources/error/error-classes.json:
##
@@ -668,6 +668,23 @@
   }
 }
   },
+  "LOCATION_ALREADY_EXISTS" : {
+"message" : [
+  "Cannot create the location  because it already exists."

Review Comment:
   that's for a specific legacy behavior and will go away eventually. Can we 
give it a dedicated error class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching

2022-11-08 Thread GitBox



amaliujia commented on code in PR #38546:
URL: https://github.com/apache/spark/pull/38546#discussion_r1017484449


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -139,11 +139,9 @@ def columns(self) -> List[str]:
 if self._plan is None:
 return []
 if "columns" not in self._cache and self._plan is not None:
-pdd = self.limit(0).toPandas()

Review Comment:
   I would prefer to not depend on the underly API when doing caching...
   
   E.g. what if someday the cache on the schema is gone but this API is not 
aware of it, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching

2022-11-08 Thread GitBox



amaliujia commented on code in PR #38546:
URL: https://github.com/apache/spark/pull/38546#discussion_r1017484449


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -139,11 +139,9 @@ def columns(self) -> List[str]:
 if self._plan is None:
 return []
 if "columns" not in self._cache and self._plan is not None:
-pdd = self.limit(0).toPandas()

Review Comment:
   I would prefer to not depend on the underly API when doing caching...
   
   E.g. what if someday the cache on the schema is gone but this API is not 
aware of it, etc.
   
   Basically do not make assumptions :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND

2022-11-08 Thread GitBox



LuciferYang commented on code in PR #38575:
URL: https://github.com/apache/spark/pull/38575#discussion_r1017484528


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest
   new File(uuid, "file3").getAbsolutePath,
   uuid).rdd
   }
-  assert(e.getMessage.startsWith("Path does not exist"))
+  assert(e.getErrorClass == "PATH_NOT_FOUND")

Review Comment:
   nit: Can the empty `finally` block in this case be cleaned together in this 
pr?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching

2022-11-08 Thread GitBox



zhengruifeng commented on code in PR #38546:
URL: https://github.com/apache/spark/pull/38546#discussion_r1017482919


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -139,11 +139,9 @@ def columns(self) -> List[str]:
 if self._plan is None:
 return []
 if "columns" not in self._cache and self._plan is not None:
-pdd = self.limit(0).toPandas()

Review Comment:
   maybe we don't need to cache `columns`, just return `self.schema.names`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-08 Thread GitBox



LuciferYang commented on PR #38091:
URL: https://github.com/apache/spark/pull/38091#issuecomment-1308279804

   finally fixed, thanks @wankunde 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #38578: [SPARK-41064][CONNECT][PYTHON] Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab`

2022-11-08 Thread GitBox



zhengruifeng opened a new pull request, #38578:
URL: https://github.com/apache/spark/pull/38578

   ### What changes were proposed in this pull request?
   Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab`
   
   
   ### Why are the changes needed?
   for api coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new api
   
   
   ### How was this patch tested?
   added ut


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



LuciferYang commented on PR #38577:
URL: https://github.com/apache/spark/pull/38577#issuecomment-1308270119

   cc @HyukjinKwon @wangyum @dongjoon-hyun @srowen @pan3793 @panbingkun 
   
   Can you reproduce the issue?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully

2022-11-08 Thread GitBox



LuciferYang opened a new pull request, #38577:
URL: https://github.com/apache/spark/pull/38577

   ### What changes were proposed in this pull request?
   Run 
   
   ```
   dev/make-distribution.sh --tgz -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn 
-Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive  
   ```
   
   on the master branch fails as follows:
   
   ```
   [ERROR] ## Exception when compiling 19 sources to 
/spark-source/connector/avro/target/scala-2.12/classes
   java.lang.OutOfMemoryError: Metaspace
   java.lang.ClassLoader.defineClass1(Native Method)
   java.lang.ClassLoader.defineClass(ClassLoader.java:757)
   java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
   java.net.URLClassLoader.access$100(URLClassLoader.java:74)
   java.net.URLClassLoader$1.run(URLClassLoader.java:369)
   java.net.URLClassLoader$1.run(URLClassLoader.java:363)
   java.security.AccessController.doPrivileged(Native Method)
   java.net.URLClassLoader.findClass(URLClassLoader.java:362)
   java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   scala_maven.ScalaCompilerLoader.loadClass(ScalaCompilerLoader.java:44)
   java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   scala.collection.immutable.Set$Set2.$plus(Set.scala:170)
   scala.collection.immutable.Set$Set2.$plus(Set.scala:164)
   scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:28)
   scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:24)
   scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:467)
   scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:455)
   
scala.collection.mutable.HashMap$$anon$1.$anonfun$foreach$2(HashMap.scala:153)
   
scala.collection.mutable.HashMap$$anon$1$$Lambda$68804/294594059.apply(Unknown 
Source)
   scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
   scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
   scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
   scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:153)
   scala.collection.TraversableLike.groupBy(TraversableLike.scala:524)
   scala.collection.TraversableLike.groupBy$(TraversableLike.scala:454)
   scala.collection.AbstractTraversable.groupBy(Traversable.scala:108)
   scala.tools.nsc.settings.Warnings.$init$(Warnings.scala:91)
   scala.tools.nsc.settings.MutableSettings.(MutableSettings.scala:28)
   scala.tools.nsc.Settings.(Settings.scala:19)
   scala.tools.nsc.Settings.(Settings.scala:20)
   xsbt.CachedCompiler0.(CompilerBridge.scala:79) 
   ```
   
   So this pr remove `MaxMetaspaceSize` option  from `make-distribution.sh` to 
make `MaxMetaspaceSize` can use more memory resources and make 
`make-distribution.sh` run successfully.
   
   
   ### Why are the changes needed?
   Make `make-distribution.sh`  run successfully.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Manual test:
   
   
   ```
   dev/make-distribution.sh --tgz -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn 
-Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive  
   ```
   
   - Before: java.lang.OutOfMemoryError: Metaspace
   - After: successfully
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #38545: [MINOR][DOCS] Fix links in the sql-pyspark-pandas-with-arrow

2022-11-08 Thread GitBox



panbingkun commented on PR #38545:
URL: https://github.com/apache/spark/pull/38545#issuecomment-1308245910

   cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`

2022-11-08 Thread GitBox



itholic opened a new pull request, #38576:
URL: https://github.com/apache/spark/pull/38576

   ### What changes were proposed in this pull request?
   
   This PR proposes to rename `UNSUPPORTED_CORRELATED_REFERENCE` to 
`CORRELATED_REFERENCE`.
   
   
   ### Why are the changes needed?
   
   The sub-error class name is duplicated with its main class, 
`UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY`.
   
   We should make the all error class name clear and briefly.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   ```
   ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*”
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND

2022-11-08 Thread GitBox



itholic commented on code in PR #38575:
URL: https://github.com/apache/spark/pull/38575#discussion_r1017450594


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest
   new File(uuid, "file3").getAbsolutePath,
   uuid).rdd
   }
-  assert(e.getMessage.startsWith("Path does not exist"))
+  assert(e.getErrorClass == "PATH_NOT_FOUND")

Review Comment:
   The current `checkError` seems not allow pattern matching for `parameter` 
checking.
   
   Let me just check the errorClass for now.
   
   Otherwise, it fails on different test envs: 
https://github.com/apache/spark/pull/38422#discussion_r1016116865



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND

2022-11-08 Thread GitBox



itholic commented on code in PR #38575:
URL: https://github.com/apache/spark/pull/38575#discussion_r1017450594


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest
   new File(uuid, "file3").getAbsolutePath,
   uuid).rdd
   }
-  assert(e.getMessage.startsWith("Path does not exist"))
+  assert(e.getErrorClass == "PATH_NOT_FOUND")

Review Comment:
   The current `checkError` seems not allow pattern matching for `parameter` 
checking.
   
   Let me just check the errorClass for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.

2022-11-08 Thread GitBox



WweiL commented on code in PR #38503:
URL: https://github.com/apache/spark/pull/38503#discussion_r1017447511


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala:
##
@@ -240,25 +240,30 @@ class FlatMapGroupsInPandasWithStateSuite extends 
StateStoreMetricsTest {
 .groupBy("key")
 .count()
 
-testStream(result, Complete)(
-  AddData(inputData, "a"),
-  CheckNewAnswer(("a", 1)),
-  AddData(inputData, "a", "b"),
-  // mapGroups generates ("a", "2"), ("b", "1"); so increases counts of a 
and b by 1
-  CheckNewAnswer(("a", 2), ("b", 1)),
-  StopStream,
-  StartStream(),
-  AddData(inputData, "a", "b"),
-  // mapGroups should remove state for "a" and generate ("a", "-1"), ("b", 
"2") ;
-  // so increment a and b by 1
-  CheckNewAnswer(("a", 3), ("b", 2)),
-  StopStream,
-  StartStream(),
-  AddData(inputData, "a", "c"),
-  // mapGroups should recreate state for "a" and generate ("a", "1"), 
("c", "1") ;
-  // so increment a and c by 1
-  CheckNewAnswer(("a", 4), ("b", 2), ("c", 1))
-)
+// As of [SPARK-40940], multiple state operator with Complete mode is 
disabled by default
+val exp = intercept[AnalysisException] {

Review Comment:
   This is the test I was mentioning @alex-balikov 
   
   Just to confirm, in the above suite, 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala,
 there is a query which is flatMapGroupsWithState followed by an agg in 
complete mode, it also seems to be supported before. But here after the change 
it throws because there are two stateful ops in complete mode. Should we also 
allow this case? I'm not sure here because I remember you mentioned 
flatMapGroupsWithState could change the eventtime so we should disallow... So 
is the test wrong..?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.

2022-11-08 Thread GitBox



WweiL commented on code in PR #38503:
URL: https://github.com/apache/spark/pull/38503#discussion_r1017447320


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala:
##
@@ -190,20 +190,25 @@ class StreamingDeduplicationSuite extends 
StateStoreMetricsTest {
   .agg(sum("num"))
   .as[(String, Long)]
 
-testStream(result, Update)(
-  AddData(inputData, "a" -> 1),
-  CheckLastBatch("a" -> 1L),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)),
-  AddData(inputData, "a" -> 1), // Dropped
-  CheckLastBatch(),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)),
-  AddData(inputData, "a" -> 2),
-  CheckLastBatch("a" -> 3L),
-  assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)),
-  AddData(inputData, "b" -> 1),
-  CheckLastBatch("b" -> 1L),
-  assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L))
-)
+// As of [SPARK-40940], multiple state operator with Complete mode is 
disabled by default

Review Comment:
   I see. No problem at all! 
   
   Just to confirm, in the above suite, 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala,
 there is a query which is flatMapGroupsWithState followed by an agg in 
complete mode, it also seems to be supported before. But here after the change 
it throws because there are two stateful ops in complete mode. Should we also 
allow this case? I'm not sure here because I remember you mentioned 
flatMapGroupsWithState could change the eventtime so we should disallow... So 
is the test wrong..?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND

2022-11-08 Thread GitBox



itholic opened a new pull request, #38575:
URL: https://github.com/apache/spark/pull/38575

   
   ### What changes were proposed in this pull request?
   
   The original PR to introduce the error class `PATH_NOT_FOUND` was reverted 
since it breaks the tests in different test env.
   
   This PR proposes to restore it back.
   
   
   ### Why are the changes needed?
   
   Restoring the reverted changes with proper fix.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   The existing CI should pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38535: [SPARK-41001] [CONNECT] Make `user_id` optional in SparkRemoteSession.

2022-11-08 Thread GitBox



AmplabJenkins commented on PR #38535:
URL: https://github.com/apache/spark/pull/38535#issuecomment-1308229533

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-08 Thread GitBox



gengliangwang commented on PR #38567:
URL: https://github.com/apache/spark/pull/38567#issuecomment-1308227478

   > Or part of a larger set of changes 
   
   Yes, I created Jira https://issues.apache.org/jira/browse/SPARK-41053, and 
this one is just the beginning. 
   
   > Maintaining state at driver on disk backed store and copying that to dfs 
has a few things which impact it - particularly for larger applications.
   
   Yes, but it worthies starting the effort. We can keep developing it until it 
is ready. I run a benchmark with Protobuf serializer, and the results are 
promising. 
   For large applications,  the current in-memory live UI can bring memory 
pressure which affects the driver stability as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-08 Thread GitBox



gengliangwang commented on PR #38567:
URL: https://github.com/apache/spark/pull/38567#issuecomment-1308222974

   > It seems to not match what is described in the pr description or add 
equivalent functionality which was reverted
   
   I believe it covers the reverted PR 
https://github.com/apache/spark/pull/38542. The previous approach is to create 
an optional disk store for storing a subset of the `SQLExecutionUIData`.
   There are also two shortages:
   * There are no evictions for the UI data on the disk
   * It requires extra memory for reading/writing LevelDB/RocksDB
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.

2022-11-08 Thread GitBox



WweiL commented on code in PR #38503:
URL: https://github.com/apache/spark/pull/38503#discussion_r1017435393


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala:
##
@@ -188,17 +194,26 @@ class UnsupportedOperationsSuite extends SparkFunSuite 
with SQLHelper {
 expectedMsgs = Seq("Complete"))
 
   // FlatMapGroupsWithState(Update) in streaming with aggregation
-  for (outputMode <- Seq(Append, Update, Complete)) {
+  for (outputMode <- Seq(Update, Complete)) {
 assertNotSupportedInStreamingPlan(
   "flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming 
relation " +
 s"with aggregation in $outputMode mode",
   TestFlatMapGroupsWithState(
 null, att, att, Seq(att), Seq(att), att, null, Update, 
isMapGroupsWithState = false, null,
 Aggregate(Seq(attributeWithWatermark), aggExprs("c"), streamRelation)),
   outputMode = outputMode,
-  expectedMsgs = Seq("flatMapGroupsWithState in update mode", "with 
aggregation"))
+  expectedMsgs = Seq("Multiple stateful operators", "Update", "Complete"))
   }
 
+  assertNotSupportedInStreamingPlan(
+"flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming 
relation " +
+  s"with aggregation in Append mode",
+TestFlatMapGroupsWithState(
+  null, att, att, Seq(att), Seq(att), att, null, Update, 
isMapGroupsWithState = false, null,

Review Comment:
   Sorry I didn't really get it and I lack the context here. Do you mean 
flatMapGroupWithState(Update) with aggregation and output mode Append doesn't 
make sense? I think before there is a test on each output mode: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala#L191
   
   The reason I separate (Update, Complete) and Append is with the change, 
(Update, Complete) will be captured before the original check, so their error 
message is changed. But in Append mode it is still blocked by the original 
error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] alex-balikov commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.

2022-11-08 Thread GitBox



alex-balikov commented on code in PR #38503:
URL: https://github.com/apache/spark/pull/38503#discussion_r1017413562


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala:
##
@@ -190,20 +190,25 @@ class StreamingDeduplicationSuite extends 
StateStoreMetricsTest {
   .agg(sum("num"))
   .as[(String, Long)]
 
-testStream(result, Update)(
-  AddData(inputData, "a" -> 1),
-  CheckLastBatch("a" -> 1L),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)),
-  AddData(inputData, "a" -> 1), // Dropped
-  CheckLastBatch(),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)),
-  AddData(inputData, "a" -> 2),
-  CheckLastBatch("a" -> 3L),
-  assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)),
-  AddData(inputData, "b" -> 1),
-  CheckLastBatch("b" -> 1L),
-  assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L))
-)
+// As of [SPARK-40940], multiple state operator with Complete mode is 
disabled by default

Review Comment:
   hmm, the above scenario dropDuplicates -> aggregation was supported before. 
So I was wrong - dropDuplicates and also inner join with timestamp equality 
condition can be followed by a stateful operator in any mode - these operators 
- dropDuplicates and inner equality join  do not delay the output records. I 
apologize for the randomization, I think the above scenario is important to 
continue supporting.



##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala:
##
@@ -215,20 +220,25 @@ class StreamingDeduplicationSuite extends 
StateStoreMetricsTest {
   .agg(sum("num"))
   .as[(String, Long)]
 
-testStream(result, Complete)(
-  AddData(inputData, "a" -> 1),
-  CheckLastBatch("a" -> 1L),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)),
-  AddData(inputData, "a" -> 1), // Dropped
-  CheckLastBatch("a" -> 1L),
-  assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)),
-  AddData(inputData, "a" -> 2),
-  CheckLastBatch("a" -> 3L),
-  assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)),
-  AddData(inputData, "b" -> 1),
-  CheckLastBatch("a" -> 3L, "b" -> 1L),
-  assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L))
-)
+// As of [SPARK-40940], multiple state operator with Complete mode is 
disabled by default
+val exp = intercept[AnalysisException] {

Review Comment:
   same here - the above scenario should work



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala:
##
@@ -188,17 +194,26 @@ class UnsupportedOperationsSuite extends SparkFunSuite 
with SQLHelper {
 expectedMsgs = Seq("Complete"))
 
   // FlatMapGroupsWithState(Update) in streaming with aggregation
-  for (outputMode <- Seq(Append, Update, Complete)) {
+  for (outputMode <- Seq(Update, Complete)) {
 assertNotSupportedInStreamingPlan(
   "flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming 
relation " +
 s"with aggregation in $outputMode mode",
   TestFlatMapGroupsWithState(
 null, att, att, Seq(att), Seq(att), att, null, Update, 
isMapGroupsWithState = false, null,
 Aggregate(Seq(attributeWithWatermark), aggExprs("c"), streamRelation)),
   outputMode = outputMode,
-  expectedMsgs = Seq("flatMapGroupsWithState in update mode", "with 
aggregation"))
+  expectedMsgs = Seq("Multiple stateful operators", "Update", "Complete"))
   }
 
+  assertNotSupportedInStreamingPlan(
+"flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming 
relation " +
+  s"with aggregation in Append mode",
+TestFlatMapGroupsWithState(
+  null, att, att, Seq(att), Seq(att), att, null, Update, 
isMapGroupsWithState = false, null,

Review Comment:
   I am not sure what is the semantics of setting output mode to Update on 
flatMapGroupsWithState but it does not match the Append output mode below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] 19Serhii99 opened a new pull request, #38574: [SPARK-41060] [K8S] Made the spark submitter generate new names for driver and executor config maps

2022-11-08 Thread GitBox



19Serhii99 opened a new pull request, #38574:
URL: https://github.com/apache/spark/pull/38574

   
   ### What changes were proposed in this pull request?
   There's a problem with submitting spark jobs to K8s cluster: the library 
generates and reuses the same name for config maps (for drivers and executors). 
Ideally, for each job 2 config maps should created: for a driver and an 
executor. However, the library creates only one driver config map for all jobs 
(in some cases it generates only one executor map for all jobs). So, if I run 5 
jobs, then only one driver config map will be generated and used for every job. 
 During those runs we experience issues when deleting pods from the cluster: 
executors pods are endlessly created and immediately terminated overloading 
cluster resources.
   
   This problem occurs because of the **KubernetesClientUtils** class in which 
we have **configMapNameExecutor** and **configMapNameDriver** as constants. It 
seems to be incorrect and should be urgently fixed. I've prepared some changes 
for review to fix the issue (tested in the cluster of our project).
   
   
   ### Why are the changes needed?
   To make the spark submitter generate new names for driver and executor 
config maps to let jobs complete successfully, not overloading cluster 
resources.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Executors pods should stop being generated and terminated endlessly.
   
   
   ### How was this patch tested?
   Modified the unit tests in the module. Tested with the unit tests and by 
submitting jobs to the K8S cluster of our project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

2022-11-08 Thread GitBox



mridulm commented on PR #38064:
URL: https://github.com/apache/spark/pull/38064#issuecomment-1308159592

   The test failures look like due to the memory requirements for the test 
'collect data with single partition larger than 2GB bytes array limit' is too 
high - causing OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL

2022-11-08 Thread GitBox



amaliujia opened a new pull request, #38573:
URL: https://github.com/apache/spark/pull/38573

   
   
   ### What changes were proposed in this pull request?
   
   1. support `def selectExpr(exprs: String*)` in Connect DSL.
   2. Server side supports translation Expressions in Strings. 
   
   ### Why are the changes needed?
   
   Improve API coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #38572: [SPARK-41059][SQL] Rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION`

2022-11-08 Thread GitBox



itholic opened a new pull request, #38572:
URL: https://github.com/apache/spark/pull/38572

   
   ### What changes were proposed in this pull request?
   
   This PR proposes to rename `_LEGACY_ERROR_TEMP_2420` to 
`NESTED_AGGREGATE_FUNCTION`
   
   ### Why are the changes needed?
   
   We should rename the LEGACY errors properly.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   ```
   ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*”
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache

2022-11-08 Thread GitBox



ulysses-you commented on code in PR #38558:
URL: https://github.com/apache/spark/pull/38558#discussion_r1017369252


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec(
 
   override def output: Seq[Attribute] = inputPlan.output
 
+  // Try our best to give a stable output partitioning and ordering.

Review Comment:
   Unfortunately we hint this.. per my experience, user always caches an 
arbitrary df and use the cached df to build an another arbitrary df. So why 
can't we preserve the partitioning/ordering of the cached plan ? If you really 
feel inconsistency in AdaptiveSparkPlanExec, we can probably move to 
InMemoryRelationExec.
   
   My original idea is to do the both but feel a little overkill 
(requiredOrdering should be inferred separately like 
https://github.com/apache/spark/pull/35924)
   ```scala
   
requiredDistribution.map(_.createPartitioning(conf.shufflePartitions)).getOrElse
 {
 if (isFinalPlan) {
   executedPlan.outputPartitioning
 } else {
   super.outputPartitioning
 }
   }
   ```
   A useful distribution before caching is few in production since 
`repartition(col)` will introduce skew



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



srielau commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017369097


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I like it that they are distinct. GROUP_BY_POS_... might be fat fingering, 
GROUP_BY_AGGREGATE is: You don't know how GROUP BY works...
   
   Have reviewed? GROUP_BY_POS_REFERS_AGG_EXPR... that's not very pretty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13

2022-11-08 Thread GitBox



LuciferYang commented on PR #38550:
URL: https://github.com/apache/spark/pull/38550#issuecomment-1308146770

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #34815: [SPARK-37555][SQL] spark-sql should pass last unclosed comment to backend

2022-11-08 Thread GitBox



AngersZh commented on code in PR #34815:
URL: https://github.com/apache/spark/pull/34815#discussion_r1017365581


##
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala:
##
@@ -620,4 +620,17 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with Logging {
 |""".stripMargin -> "SELECT 1"
 )
   }
+
+  test("SPARK-37555: spark-sql should pass last unclosed comment to backend") {

Review Comment:
   > @AngersZh shall we increase the timeout? The github action machines 
are not stable.
   
   Yea, the logic should be ok...
   Add below PR and we can trigger it many times to see if it's stable?
   https://github.com/apache/spark/pull/38571 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13

2022-11-08 Thread GitBox



srowen commented on PR #38550:
URL: https://github.com/apache/spark/pull/38550#issuecomment-1308144543

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13

2022-11-08 Thread GitBox



srowen closed pull request #38550: [SPARK-41039][BUILD] Upgrade 
`scala-parallel-collections` to 1.0.4 for Scala 2.13
URL: https://github.com/apache/spark/pull/38550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu opened a new pull request, #38571: [SPARK-37555][TEST][FOLLOWUP] Increase timeout of CLI test `spark-sql should pass last unclosed comment to backend`

2022-11-08 Thread GitBox



AngersZh opened a new pull request, #38571:
URL: https://github.com/apache/spark/pull/38571

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017361159


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala:
##
@@ -89,10 +88,8 @@ case class 
RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
   buildKeys: Seq[Attribute],
   pruningKeys: Seq[Attribute]): Expression = {
 
-val buildQuery = Project(buildKeys, matchingRowsPlan)
-val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, 
index) =>
-  DynamicPruningSubquery(key, buildQuery, buildKeys, index, 
onlyInBroadcast = false)
-}
-dynamicPruningSubqueries.reduce(And)
+val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)

Review Comment:
   My rationale is, what we really need is a subquery here. This is completely 
different from dynamic partition pruning. One limitation is DS v2 runtime 
filter pushdown only applies to `DynamicPruningExpression`. We can probably fix 
that and accept normal non-correlated subqueries as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto

2022-11-08 Thread GitBox



amaliujia commented on PR #38491:
URL: https://github.com/apache/spark/pull/38491#issuecomment-1308133299

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request, #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2

2022-11-08 Thread GitBox



HyukjinKwon opened a new pull request, #38570:
URL: https://github.com/apache/spark/pull/38570

   ### What changes were proposed in this pull request?
   
   This PR proposes to keep the `R_LIBS_SITE` as was. It has been changed from 
R 4.2.
   
   ### Why are the changes needed?
   
   To keep the behaviour same as the previous. This especially affects the 
external libraries installed in SparkR worker sides. Especially this can break 
the user-installed libraries. See the paths below:
   
   **R 4.2**
   
   ```
   # R
   > Sys.getenv("R_LIBS_SITE")
   [1] 
"/usr/local/lib/R/site-library/:/usr/lib/R/site-library:/usr/lib/R/library'"
   # R --vanilla
   > Sys.getenv("R_LIBS_SITE")
   [1] "/usr/lib/R/site-library"
   ```
   
   **R 4.1**
   
   ```
   # R
   > Sys.getenv("R_LIBS_SITE")
   [1] 
"/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library"
   # R --vanilla
   > Sys.getenv("R_LIBS_SITE")
   [1] 
"/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library"
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. With R 4.2, user-installed libraries won't be found in SparkR workers.
   
   ### How was this patch tested?
   
   Manually tested, unittest added. It's difficult to add an e2e tests.
   
   I also manually tested `getROptions` in Scala shall.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, 
but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the 
group by cause or not.
   
   Maybe we can manually pass the missing value for `` in 
`GROUP_BY_AGGREGATE`, but not sure how much it makes sense.
   
   WDYT, @srielau @MaxGekk ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, 
but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the 
group by cause or not.
   
   Maybe we can manually pass the missing value for `` in 
`GROUP_BY_AGGREGATE`, but not sure it makes sense or not.
   
   WDYT, @srielau @MaxGekk ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



aokolnychyi commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017352098


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala:
##
@@ -89,10 +88,8 @@ case class 
RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
   buildKeys: Seq[Attribute],
   pruningKeys: Seq[Attribute]): Expression = {
 
-val buildQuery = Project(buildKeys, matchingRowsPlan)
-val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, 
index) =>
-  DynamicPruningSubquery(key, buildQuery, buildKeys, index, 
onlyInBroadcast = false)
-}
-dynamicPruningSubqueries.reduce(And)
+val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)

Review Comment:
   Got it. I was originally worried we could miss some future optimizations 
given that dynamic pruning for row-level operations would go through a 
different route compared to the normal DPP.
   
   One alternative could be to extend `DynamicPruningSubquery` with a flag 
whether it should be optimized or not. Up to you, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, 
but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the 
group by cause.
   
   We can manually pass the missing value for ``, but not sure it makes 
sense or not.
   
   WDYT, @srielau @MaxGekk ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, 
but it's only care if the aggregate function is used in the group by cause.
   
   We can manually pass the missing value for ``, but not sure it makes 
sense or not.
   
   WDYT, @srielau @MaxGekk ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires , 
but it's only care if the aggregate function is used in the group by cause.
   
   We can manually pass the missing value for , but not sure it makes 
sense or not.
   
   WDYT, @srielau @MaxGekk ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



LuciferYang commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017347010


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   Could we re-use `GROUP_BY_POS_REFERS_AGG_EXPR ` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



LuciferYang commented on code in PR #38569:
URL: https://github.com/apache/spark/pull/38569#discussion_r1017346760


##
core/src/main/resources/error/error-classes.json:
##
@@ -469,6 +469,11 @@
   "Grouping sets size cannot be greater than "
 ]
   },
+  "GROUP_BY_AGGREGATE" : {

Review Comment:
   looks similar to `GROUP_BY_POS_REFERS_AGG_EXPR`
   
   
https://github.com/apache/spark/blob/0add57a1c0290a158666027afb3e035728d2dcee/core/src/main/resources/error/error-classes.json#L478-L483



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38491: [MINOR][CONNECT] Remove unused import in commands.proto

2022-11-08 Thread GitBox



amaliujia commented on PR #38491:
URL: https://github.com/apache/spark/pull/38491#issuecomment-1308120960

   @dengziming I have missed this PR because there was no JIRA created under 
https://issues.apache.org/jira/browse/SPARK-39375 (I was monitoring works 
happened there). 
   
   Since you are doing this change, can you create a JIRA for removing unused 
code (then update PR title with the JIRA included) and BTW remove some other 
unused code in connect project? 
   
   I am seeing one example at 
https://github.com/apache/spark/blob/0add57a1c0290a158666027afb3e035728d2dcee/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala#L61.
 
   
   
   There are also unused imports probably existing that we can further clean up.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-08 Thread GitBox



mridulm commented on PR #38567:
URL: https://github.com/apache/spark/pull/38567#issuecomment-1308118224

   Also note that we cannot avoid parsing event files at history server with db 
generated at driver - unless the configs match for both (retained stages, 
tasks, queries, etc): particularly as these can be specified by users (for 
example, to aggressively cleanup for perf sensitive apps).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38491: [MINOR][CONNECT] Remove unused import in commands.proto

2022-11-08 Thread GitBox



cloud-fan commented on PR #38491:
URL: https://github.com/apache/spark/pull/38491#issuecomment-1308116015

   @amaliujia can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-08 Thread GitBox



mridulm commented on PR #38567:
URL: https://github.com/apache/spark/pull/38567#issuecomment-1308113372

   To comment on proposal in description, based on past prototypes I have 
worked on/seen: 
   
   Maintaining state at driver on disk backed store and copying that to dfs has 
a few things which impact it - particularly for larger applications.
   
   They are not very robust to application crashes, interact in nontrivial ways 
with shutdown hook (hdfs failures) and increase application termination time 
during graceful shutdown.
   
   Depending on application characteristics, the impact of disk backed store 
can positively or negatively impact driver performance (positively - as updates 
are faster due to index, which was lacking in in memory store (when I added 
index, memory requirements increased :-( ), negatively due to increased disk 
activity): was difficult to predict.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic commented on PR #38569:
URL: https://github.com/apache/spark/pull/38569#issuecomment-1308111274

   cc @MaxGekk @srielau FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`

2022-11-08 Thread GitBox



itholic opened a new pull request, #38569:
URL: https://github.com/apache/spark/pull/38569

   ### What changes were proposed in this pull request?
   
   This PR proposes to rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
   
   
   ### Why are the changes needed?
   
   To use proper error class name.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   ```
   ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*”
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-08 Thread GitBox



mridulm commented on PR #38567:
URL: https://github.com/apache/spark/pull/38567#issuecomment-1308103370

   This pr is mostly adding ability to use a disk backed store in addition to 
in memory store.
   It seems to not match what is described in the pr description - is this wip ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes

2022-11-08 Thread GitBox



LuciferYang commented on PR #38507:
URL: https://github.com/apache/spark/pull/38507#issuecomment-130810

   GA passed, @MaxGekk could you help to review this pr again, thanks
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13

2022-11-08 Thread GitBox



LuciferYang commented on PR #38550:
URL: https://github.com/apache/spark/pull/38550#issuecomment-1308100945

   maven test all UTs with Scala 2.13 and this pr passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maryannxue commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache

2022-11-08 Thread GitBox



maryannxue commented on code in PR #38558:
URL: https://github.com/apache/spark/pull/38558#discussion_r1017328200


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec(
 
   override def output: Seq[Attribute] = inputPlan.output
 
+  // Try our best to give a stable output partitioning and ordering.

Review Comment:
   This would be super limited use... and cause inconsistency.
   
   I'd only return output partitioning if there is a user repartition op in the 
end. In other words, only if AQE plan is required to preserve user specified 
partitioning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38263:
URL: https://github.com/apache/spark/pull/38263#discussion_r1017320620


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.commons.codec.digest.DigestUtils
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
ExprCode}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(str) - Returns hash of the given value. The hash is 
consistent and can be used to join masked values together across tables",

Review Comment:
   Since we already have `sha2`, this PR is just adding an alias. We need to 
prove that this alias is common in the industry. Except for Hive, do other 
systems/products use these names?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017318193


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -320,6 +320,9 @@ abstract class Optimizer(catalogManager: CatalogManager)
 }
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformAllExpressionsWithPruning(
   _.containsPattern(PLAN_EXPRESSION), ruleId) {
+  // Do not optimize DPP subquery, as it was created from optimized plan 
and we should not
+  // optimize it again, to save optimization time and avoid breaking 
broadcast/subquery reuse.
+  case d: DynamicPruningSubquery => d

Review Comment:
   Yes, because this PR adds `OptimizeSubqueries` to the batch 
`PartitionPruning` and we should not break 
https://github.com/apache/spark/pull/33664



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bersprockets commented on pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`

2022-11-08 Thread GitBox



bersprockets commented on PR #38565:
URL: https://github.com/apache/spark/pull/38565#issuecomment-1308083533

   @HyukjinKwon Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #34815: [SPARK-37555][SQL] spark-sql should pass last unclosed comment to backend

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #34815:
URL: https://github.com/apache/spark/pull/34815#discussion_r1017317387


##
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala:
##
@@ -620,4 +620,17 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with Logging {
 |""".stripMargin -> "SELECT 1"
 )
   }
+
+  test("SPARK-37555: spark-sql should pass last unclosed comment to backend") {

Review Comment:
   @AngersZh shall we increase the timeout? The github action machines are 
not stable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers

2022-11-08 Thread GitBox



cloud-fan commented on PR #38419:
URL: https://github.com/apache/spark/pull/38419#issuecomment-1308082603

   ceil/floor also takes a second parameter for num of digits to retain. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



viirya commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017317077


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala:
##
@@ -66,7 +65,7 @@ case class 
RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
   }
 
   // optimize subqueries to rewrite them as joins and trigger job planning

Review Comment:
   This comment can be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



viirya commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017316276


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -320,6 +320,9 @@ abstract class Optimizer(catalogManager: CatalogManager)
 }
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformAllExpressionsWithPruning(
   _.containsPattern(PLAN_EXPRESSION), ruleId) {
+  // Do not optimize DPP subquery, as it was created from optimized plan 
and we should not
+  // optimize it again, to save optimization time and avoid breaking 
broadcast/subquery reuse.
+  case d: DynamicPruningSubquery => d

Review Comment:
   This makes sense. Just wondering that is this particularly related to 
SPARK-38959?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`

2022-11-08 Thread GitBox



HyukjinKwon commented on PR #38565:
URL: https://github.com/apache/spark/pull/38565#issuecomment-1308081567

   Merged to master, branch-3.3, and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark-docker] Yikun closed pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker

2022-11-08 Thread GitBox



Yikun closed pull request #21: [SPARK-40569][TESTS] Add smoke test in 
standalone cluster for spark-docker
URL: https://github.com/apache/spark-docker/pull/21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker

2022-11-08 Thread GitBox



Yikun commented on PR #21:
URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1308072699

   Merge to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker

2022-11-08 Thread GitBox



Yikun commented on PR #21:
URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1308072327

   ```
   testing/run_tests.sh --image-url ghcr.io/yikun/spark-docker/spark:python3 
--scala-version 2.12 --spark-version 3.3.0
   
   ===> Smoke test for ghcr.io/yikun/spark-docker/spark:python3 as root
   ===> Starting spark-master
   ===> Starting spark-worker
   ===> Waiting for spark-master to be ready...
   ===> spark-master is ready.
   ===> Waiting for spark-worker to be ready...
   ===> spark-worker is ready.
   ===> Starting spark-submit
   // ...
   Pi is roughly 3.140891570445785
   ==> Killing 2 orphaned container(s)... done.
   ===> Smoke test for ghcr.io/yikun/spark-docker/spark:python3 as non-root
   ===> Starting spark-master
   ===> Starting spark-worker
   ===> Waiting for spark-master to be ready...
   ===> spark-master is ready.
   ===> Waiting for spark-worker to be ready...
   ===> spark-worker is ready.
   ===> Starting spark-submit
   // ...
   Pi is roughly 3.140739570369785
   // ...
   ==> Killing 2 orphaned container(s)... done.
   Test successfully finished
   ```
   
   Also retest it in my arm env.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38541: [SPARK-41034][CONNECT][PYTHON] Connect DataFrame should require a RemoteSparkSession

2022-11-08 Thread GitBox



AmplabJenkins commented on PR #38541:
URL: https://github.com/apache/spark/pull/38541#issuecomment-1308071934

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`

2022-11-08 Thread GitBox



HyukjinKwon closed pull request #38565: [SPARK-41035][SQL] Don't patch foldable 
children of aggregate functions in `RewriteDistinctAggregates`
URL: https://github.com/apache/spark/pull/38565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Narcasserun opened a new pull request, #38568: [SSPARK-41051][CORE] Optimize ProcfsMetrics file acquisition

2022-11-08 Thread GitBox



Narcasserun opened a new pull request, #38568:
URL: https://github.com/apache/spark/pull/38568

   What changes were proposed in this pull request?
   Reuse variables from declared procfs files instead of duplicate code
   
   Why are the changes needed?
   The cost of looking up the config is often insignificant, but there
   reduce some duplicate code
   
   Does this PR introduce any user-facing change?
   No.
   
   How was this patch tested?
   Existing unit tests.
   
   Closes #41051 from 
sus/[SPARK-41051](https://issues.apache.org/jira/browse/SPARK-41051).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017300121


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala:
##
@@ -89,10 +88,8 @@ case class 
RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
   buildKeys: Seq[Attribute],
   pruningKeys: Seq[Attribute]): Expression = {
 
-val buildQuery = Project(buildKeys, matchingRowsPlan)
-val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, 
index) =>
-  DynamicPruningSubquery(key, buildQuery, buildKeys, index, 
onlyInBroadcast = false)
-}
-dynamicPruningSubqueries.reduce(And)
+val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)

Review Comment:
   I don't see any downside. We can only reuse broadcast if the DPP filter is 
derived from a join, which doesn't apply here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries

2022-11-08 Thread GitBox



cloud-fan commented on code in PR #38557:
URL: https://github.com/apache/spark/pull/38557#discussion_r1017300121


##
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala:
##
@@ -89,10 +88,8 @@ case class 
RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
   buildKeys: Seq[Attribute],
   pruningKeys: Seq[Attribute]): Expression = {
 
-val buildQuery = Project(buildKeys, matchingRowsPlan)
-val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, 
index) =>
-  DynamicPruningSubquery(key, buildQuery, buildKeys, index, 
onlyInBroadcast = false)
-}
-dynamicPruningSubqueries.reduce(And)
+val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)

Review Comment:
   I don't see any downside. We can only reuse broadcast if the DPP filter is 
derived from a join, which doesn't apply here. What's better, now this is a 
normal subquery and we can trigger subquery reuse which is not possible for DPP 
subqueries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

2022-11-08 Thread GitBox



zhengruifeng commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1308066292

   thanks @cloud-fan @HyukjinKwon @amaliujia for reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache

2022-11-08 Thread GitBox



ulysses-you commented on code in PR #38558:
URL: https://github.com/apache/spark/pull/38558#discussion_r1017299064


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec(
 
   override def output: Seq[Attribute] = inputPlan.output
 
+  // Try our best to give a stable output partitioning and ordering.

Review Comment:
   yes, in general the first action for a cached plan is `count`, e.g. 
`CacheTableAsSelectExec`, so I think it is a not big  issue that we can not 
optimize the shuffle/sort for the first action.
   
   The usage of the cache is: user wants to reference it multi-times, then this 
optimization will help a lot.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

2022-11-08 Thread GitBox



cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce 
`StatFunction` in proto and implement `DataFrame.summary`
URL: https://github.com/apache/spark/pull/38318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #38542: Revert "[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI"

2022-11-08 Thread GitBox



gengliangwang closed pull request #38542: Revert "[SPARK-38550][SQL][CORE] Use 
a disk-based store to save more debug information for live UI"
URL: https://github.com/apache/spark/pull/38542


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

2022-11-08 Thread GitBox



cloud-fan commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1308063610

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 234 matches

Mail list logo