date:20220523

[GitHub] [spark] dongjoon-hyun commented on pull request #36386: [SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36386:
URL: https://github.com/apache/spark/pull/36386#issuecomment-1135428525

   Thank you for rechecking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36628:
URL: https://github.com/apache/spark/pull/36628#discussion_r880068159


##
sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out:
##
@@ -142,6 +142,108 @@ struct<(12345678912345.123456789123 / 
1.2345678E-8):decimal(38,9)>
 100073899961059796.725866332
 
 
+-- !query
+select 1.0123456789012345678901234567890123456e36BD / 0.1
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkArithmeticException
+[CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded, 
10123456789012345678901234567890123456.00, 
76, 38) cannot be represented as Decimal(38, 6). If necessary set 
"spark.sql.ansi.enabled" to "false" to bypass this error.
+== SQL(line 1, position 7) ==
+select 1.0123456789012345678901234567890123456e36BD / 0.1
+   ^^
+
+
+-- !query
+select 1.01234567890123456789012345678901234567e36BD / 0.1
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.catalyst.parser.ParseException
+
+decimal can only support precision up to 38(line 1, pos 7)
+
+== SQL ==
+select 1.01234567890123456789012345678901234567e36BD / 0.1

Review Comment:
   I don't think this test is useful. It fails at the parser side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36628:
URL: https://github.com/apache/spark/pull/36628#discussion_r880067844


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -502,7 +502,8 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
 Decimal(toJavaBigDecimal.multiply(that.toJavaBigDecimal, MATH_CONTEXT))
 
   def / (that: Decimal): Decimal =
-if (that.isZero) null else 
Decimal(toJavaBigDecimal.divide(that.toJavaBigDecimal, MATH_CONTEXT))
+if (that.isZero) null else 
Decimal(toJavaBigDecimal.divide(that.toJavaBigDecimal,
+  DecimalType.MAX_SCALE, MATH_CONTEXT.getRoundingMode))

Review Comment:
   what does Hive use after 2.2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



itholic commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880065419


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   > Oh got it. What about we check if it is `collections.abc.Sized`, and if 
the size is `0` here instead?
   
   +1 for this comment.
   
   Otherwise looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on PR #36647:
URL: https://github.com/apache/spark/pull/36647#issuecomment-1135414414

   > Does the URL change? We might need to keep the mapping between old URL and 
new URL. If that's complicated, we can do it in a followup.
   
   Yes, the URL has changed, let me find a way to keep the url as before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880061918


##
python/docs/source/reference/pyspark.sql/catalog.rst:
##
@@ -0,0 +1,48 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Catalog
+===
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+Catalog.cacheTable

Review Comment:
   ditto as https://github.com/apache/spark/pull/36647#discussion_r880061516.
   
   Let me remove `DataFrame`, in `dataframe.rst` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880061516


##
python/docs/source/reference/pyspark.sql/spark_session.rst:
##
@@ -0,0 +1,54 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Spark Session
+=
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+The entry point to programming Spark with the Dataset and DataFrame API.
+To create a Spark session, you should use ``SparkSession.builder`` attribute.
+See also :class:`SparkSession`.
+
+.. autosummary::
+:toctree: api/
+
+SparkSession

Review Comment:
   Thanks. I think we can just remove `SparkSession` here, it's duplicated with 
`SparkSession` under "Core Classes"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880061219


##
python/docs/source/reference/pyspark.sql/functions.rst:
##
@@ -0,0 +1,272 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Functions
+=
+
+.. currentmodule:: pyspark.sql.functions
+
+.. autosummary::
+:toctree: api/
+
+abs

Review Comment:
   Let me make a sub-group



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



cloud-fan closed pull request #36503: [SPARK-39144][SQL] Nested subquery 
expressions deduplicate relations should be done bottom up
URL: https://github.com/apache/spark/pull/36503


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



cloud-fan commented on PR #36503:
URL: https://github.com/apache/spark/pull/36503#issuecomment-1135412773

   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36586:
URL: https://github.com/apache/spark/pull/36586#discussion_r880060129


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -125,6 +135,31 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   isTemporary = isTemp)
   }
 
+  private def makeTable(ident: Seq[String]): Table = {
+val plan = UnresolvedTableOrView(ident, "Catalog.listTables", true)
+val node = sparkSession.sessionState.executePlan(plan).analyzed
+node match {
+  case t: ResolvedTable =>
+val isExternal = t.table.properties().getOrDefault("external", 
"false").equals("true")
+new Table(
+  name = t.identifier.name(),
+  database = t.identifier.namespace().head,
+  description = t.table.properties().get("comment"),
+  tableType =
+if (isExternal) CatalogTableType.EXTERNAL.name
+else CatalogTableType.MANAGED.name,
+  isTemporary = false)
+  case v: ResolvedView =>
+new Table(

Review Comment:
   add a new constructor to match the old class definition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36586:
URL: https://github.com/apache/spark/pull/36586#discussion_r880059787


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -117,14 +128,42 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   case NonFatal(_) => None
 }
 val isTemp = sessionCatalog.isTempView(tableIdent)
+val qualifier =
+  
Array(metadata.map(_.identifier.database).getOrElse(tableIdent.database).orNull)

Review Comment:
   shall we include catalog name (`spark_catalog`) as well? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

2022-05-23 Thread GitBox



HyukjinKwon commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1135409978

   I will defer to @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36645:
URL: https://github.com/apache/spark/pull/36645#issuecomment-1135409291

   Merged to master. Thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs

2022-05-23 Thread GitBox



dongjoon-hyun closed pull request #36645: [SPARK-39266][CORE] Cleanup unused 
`spark.rpc.numRetries` and `spark.rpc.retry.wait` configs
URL: https://github.com/apache/spark/pull/36645


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880057867


##
python/docs/source/reference/pyspark.sql/catalog.rst:
##
@@ -0,0 +1,48 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Catalog
+===
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+Catalog.cacheTable

Review Comment:
   Here it doesn't have `Catalog` class but others like DataFrame or 
SparkSession have them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880057519


##
python/docs/source/reference/pyspark.sql/spark_session.rst:
##
@@ -0,0 +1,54 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Spark Session
+=
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+The entry point to programming Spark with the Dataset and DataFrame API.
+To create a Spark session, you should use ``SparkSession.builder`` attribute.
+See also :class:`SparkSession`.
+
+.. autosummary::
+:toctree: api/
+
+SparkSession

Review Comment:
   Here it contains `SparkSession` class but 
`python/docs/source/reference/pyspark.sql/window.rst` doesn't have `Window` 
class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880057050


##
python/docs/source/reference/pyspark.sql/core_classes.rst:
##
@@ -0,0 +1,37 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+Core Classes
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+SparkSession
+Catalog
+DataFrame
+Column
+Observation
+Row
+GroupedData
+PandasCogroupedOps
+DataFrameNaFunctions
+DataFrameStatFunctions
+Window

Review Comment:
   I meant the class name itself. The list here only contains methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880056609


##
python/docs/source/reference/pyspark.sql/functions.rst:
##
@@ -0,0 +1,272 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Functions
+=
+
+.. currentmodule:: pyspark.sql.functions
+
+.. autosummary::
+:toctree: api/
+
+abs

Review Comment:
   Should we maybe have a sub-group like we do in Pandas API on Spark?  e.g., 
Aggregate Function, Higher Order functions. See also 
https://github.com/apache/spark/blob/master/R/pkg/R/functions.R.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880055945


##
python/docs/source/reference/pyspark.sql/core_classes.rst:
##
@@ -0,0 +1,37 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+Core Classes
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+SparkSession
+Catalog
+DataFrame
+Column
+Observation
+Row
+GroupedData
+PandasCogroupedOps
+DataFrameNaFunctions
+DataFrameStatFunctions
+Window

Review Comment:
   It's included in Input/Output as below:
   https://user-images.githubusercontent.com/44108233/169951885-97d7bf9f-7f75-4a63-9c72-5fcdb6bd3c86.png;>
   
   do we want to have separated category  ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880055945


##
python/docs/source/reference/pyspark.sql/core_classes.rst:
##
@@ -0,0 +1,37 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+Core Classes
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+SparkSession
+Catalog
+DataFrame
+Column
+Observation
+Row
+GroupedData
+PandasCogroupedOps
+DataFrameNaFunctions
+DataFrameStatFunctions
+Window

Review Comment:
   It's included in Input/Output, do we want to have separated category  ?
   https://user-images.githubusercontent.com/44108233/169951885-97d7bf9f-7f75-4a63-9c72-5fcdb6bd3c86.png;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880055802


##
python/docs/source/reference/pyspark.sql/core_classes.rst:
##
@@ -0,0 +1,37 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+Core Classes
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+SparkSession
+Catalog
+DataFrame
+Column
+Observation
+Row
+GroupedData
+PandasCogroupedOps
+DataFrameNaFunctions
+DataFrameStatFunctions
+Window

Review Comment:
   Can we have DataFrameReader DataFrameWriter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic commented on code in PR #36647:
URL: https://github.com/apache/spark/pull/36647#discussion_r880055945


##
python/docs/source/reference/pyspark.sql/core_classes.rst:
##
@@ -0,0 +1,37 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+Core Classes
+
+.. currentmodule:: pyspark.sql
+
+.. autosummary::
+:toctree: api/
+
+SparkSession
+Catalog
+DataFrame
+Column
+Observation
+Row
+GroupedData
+PandasCogroupedOps
+DataFrameNaFunctions
+DataFrameStatFunctions
+Window

Review Comment:
   It's included in I/O, do we want to have separated category ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable

2022-05-23 Thread GitBox



itholic opened a new pull request, #36647:
URL: https://github.com/apache/spark/pull/36647

   ### What changes were proposed in this pull request?
   
   This PR proposes to improve the PySpark API reference page to be more 
readable,
   
   So far, the PySpark documentation especially ["Spark SQL" 
part](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#)
 is not-well organized so it's a bit uncomfortable to be read as below:
   
   https://user-images.githubusercontent.com/44108233/169951148-f77ba1d1-3e0f-411e-81be-65a5d669f75d.png;>
   
   
   
   For example, whereas [pandas API on Spark reference 
page](https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html)
 is relatively well-organized as below:
   
   
   https://user-images.githubusercontent.com/44108233/169951169-938e13c1-36ba-4d5c-b02e-7b7e80366e2c.png;>
   
   
   ### Why are the changes needed?
   
   
   The improvement of document readability will also improve the usability for 
PySpark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now the documentation is categorized by its class or their own purpose 
more clearly as below:
   
   https://user-images.githubusercontent.com/44108233/169951517-f8b9cb72-7408-46d6-8cd7-15ae890a7a7f.png;>
   
   
   
   ### How was this patch tested?
   
   The CI already include the doc build, so the existing test should cover.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on PR #36503:
URL: https://github.com/apache/spark/pull/36503#issuecomment-1135403198

   @cloud-fan tests are green now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #36599: [SPARK-39228][PYTHON][PS] Implement `skipna` of `Series.argmax`

2022-05-23 Thread GitBox



zhengruifeng commented on code in PR #36599:
URL: https://github.com/apache/spark/pull/36599#discussion_r880050644


##
python/pyspark/pandas/series.py:
##
@@ -6255,36 +6261,47 @@ def argmax(self) -> int:
 
 Consider dataset containing cereal calories
 
->>> s = ps.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
+>>> s = ps.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0, 
'Unknown': np.nan,
 ...'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 
110.0})
->>> s  # doctest: +SKIP
+>>> s
 Corn Flakes  100.0
 Almond Delight   110.0
+UnknownNaN
 Cinnamon Toast Crunch120.0
 Cocoa Puff   110.0
 dtype: float64
 
->>> s.argmax()  # doctest: +SKIP
-2
+>>> s.argmax()
+3
+
+>>> s.argmax(skipna=False)
+-1
 """
 sdf = self._internal.spark_frame.select(self.spark.column, 
NATURAL_ORDER_COLUMN_NAME)
+seq_col_name = verify_temp_column_name(sdf, 
"__distributed_sequence_column__")
+sdf = InternalFrame.attach_distributed_sequence_column(
+sdf,
+seq_col_name,
+)
+scol = scol_for(sdf, self._internal.data_spark_column_names[0])
+
+if skipna:
+sdf = sdf.orderBy(scol.desc_nulls_last(), 
NATURAL_ORDER_COLUMN_NAME)
+else:
+sdf = sdf.orderBy(scol.desc_nulls_first(), 
NATURAL_ORDER_COLUMN_NAME)
+
 max_value = sdf.select(
-F.max(scol_for(sdf, self._internal.data_spark_column_names[0])),
+F.first(scol),
 F.first(NATURAL_ORDER_COLUMN_NAME),
 ).head()
+
 if max_value[1] is None:
 raise ValueError("attempt to get argmax of an empty sequence")
 elif max_value[0] is None:
 return -1
-# We should remember the natural sequence started from 0
-seq_col_name = verify_temp_column_name(sdf, 
"__distributed_sequence_column__")
-sdf = InternalFrame.attach_distributed_sequence_column(
-sdf.drop(NATURAL_ORDER_COLUMN_NAME), seq_col_name
-)
+
 # If the maximum is achieved in multiple locations, the first row 
position is returned.
-return sdf.filter(
-scol_for(sdf, self._internal.data_spark_column_names[0]) == 
max_value[0]
-).head()[0]
+return sdf.filter(scol == max_value[0]).head()[0]

Review Comment:
   I had a try to apply `max_by` here but found it can not guarantee the `If 
the maximum is achieved in multiple locations, the first row position is 
returned.`
   
   let's keep current code. I'll take another look at `max_by`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type

2022-05-23 Thread GitBox



wangyum commented on PR #36628:
URL: https://github.com/apache/spark/pull/36628#issuecomment-1135393260

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



HeartSaVioR closed pull request #36642: [SPARK-39264][SS] Fix type check and 
conversion to longOffset for awaitOffset fix
URL: https://github.com/apache/spark/pull/36642


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



HeartSaVioR commented on PR #36642:
URL: https://github.com/apache/spark/pull/36642#issuecomment-1135391746

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages

2022-05-23 Thread GitBox



srielau commented on code in PR #36639:
URL: https://github.com/apache/spark/pull/36639#discussion_r880044226


##
core/src/main/resources/error/error-classes.json:
##
@@ -1,319 +1,519 @@
 {
   "AMBIGUOUS_FIELD_NAME" : {
-"message" : [ "Field name  is ambiguous and has  matching 
fields in the struct." ],
+"message" : [
+  "Field name  is ambiguous and has  matching fields in the 
struct."
+],
 "sqlState" : "42000"
   },
   "ARITHMETIC_OVERFLOW" : {
-"message" : [ ". If necessary set  to 
\"false\" (except for ANSI interval type) to bypass this error." ],
+"message" : [
+  ". If necessary set  to \"false\" (except 
for ANSI interval type) to bypass this error."
+],
 "sqlState" : "22003"
   },
   "CANNOT_CAST_DATATYPE" : {
-"message" : [ "Cannot cast  to ." ],
+"message" : [
+  "Cannot cast  to ."
+],
 "sqlState" : "22005"
   },
   "CANNOT_CHANGE_DECIMAL_PRECISION" : {
-"message" : [ " cannot be represented as Decimal(, 
). If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  " cannot be represented as Decimal(, ). If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CANNOT_PARSE_DECIMAL" : {
-"message" : [ "Cannot parse decimal" ],
+"message" : [
+  "Cannot parse decimal"
+],
 "sqlState" : "42000"
   },
   "CANNOT_UP_CAST_DATATYPE" : {
-"message" : [ "Cannot up cast  from  to 
.\n" ]
+"message" : [
+  "Cannot up cast  from  to .",
+  ""
+]
   },
   "CAST_INVALID_INPUT" : {
-"message" : [ "The value  of the type  cannot be cast 
to  because it is malformed. To return NULL instead, use 
`try_cast`. If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 because it is malformed. To return NULL instead, use `try_cast`. 
If necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "42000"
   },
   "CAST_OVERFLOW" : {
-"message" : [ "The value  of the type  cannot be cast 
to  due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CONCURRENT_QUERY" : {
-"message" : [ "Another instance of this query was just started by a 
concurrent session." ]
+"message" : [
+  "Another instance of this query was just started by a concurrent 
session."
+]
   },
   "DATETIME_OVERFLOW" : {
-"message" : [ "Datetime operation overflow: ." ],
+"message" : [
+  "Datetime operation overflow: ."
+],
 "sqlState" : "22008"
   },
   "DIVIDE_BY_ZERO" : {
-"message" : [ "Division by zero. To return NULL instead, use `try_divide`. 
If necessary set  to \"false\" (except for ANSI interval type) to 
bypass this error." ],
+"message" : [
+  "Division by zero. To return NULL instead, use `try_divide`. If 
necessary set  to \"false\" (except for ANSI interval type) to bypass 
this error."
+],
 "sqlState" : "22012"
   },
   "DUPLICATE_KEY" : {
-"message" : [ "Found duplicate keys " ],
+"message" : [
+  "Found duplicate keys "
+],
 "sqlState" : "23000"
   },
   "FAILED_EXECUTE_UDF" : {
-"message" : [ "Failed to execute user defined function (: 
() => )" ]
+"message" : [
+  "Failed to execute user defined function (: () 
=> )"
+]
   },
   "FAILED_RENAME_PATH" : {
-"message" : [ "Failed to rename  to  as 
destination already exists" ],
+"message" : [
+  "Failed to rename  to  as destination already 
exists"
+],
 "sqlState" : "22023"
   },
   "FAILED_SET_ORIGINAL_PERMISSION_BACK" : {
-"message" : [ "Failed to set original permission  back to the 
created path: . Exception: " ]
+"message" : [
+  "Failed to set original permission  back to the created 
path: . Exception: "
+]
   },
   "FORBIDDEN_OPERATION" : {
-"message" : [ "The operation  is not allowed on : 
" ]
+"message" : [
+  "The operation  is not allowed on : "
+]
   },
   "GRAPHITE_SINK_INVALID_PROTOCOL" : {
-"message" : [ "Invalid Graphite protocol: " ]
+"message" : [
+  "Invalid Graphite protocol: "
+]
   },
   "GRAPHITE_SINK_PROPERTY_MISSING" : {
-"message" : [ "Graphite sink requires '' property." ]
+"message" : [
+  "Graphite sink requires '' property."
+]
   },
   "GROUPING_COLUMN_MISMATCH" : {
-"message" : [ "Column of grouping () can't be found in grouping 
columns " ],
+"message" : [
+  "Column of grouping () can't be found in grouping columns 
"
+],
 "sqlState" : "42000"
   },
   "GROUPING_ID_COLUMN_MISMATCH" : {
-"message" : [ "Columns of grouping_id () does not match 
grouping columns ()" ],
+"message"

[GitHub] [spark] zhengruifeng commented on pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark

2022-05-23 Thread GitBox



zhengruifeng commented on PR #36555:
URL: https://github.com/apache/spark/pull/36555#issuecomment-1135383195

   thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #36353: [SPARK-38946][PYTHON] Fix different behavior in setitem

2022-05-23 Thread GitBox



Yikun commented on PR #36353:
URL: https://github.com/apache/spark/pull/36353#issuecomment-1135376632

   @itholic We might want to do some special process when calling 
`_update_internal_frame` for pandas 1.4.x. Will update today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880035876


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   Even if we let it pass this if-else check, PySpark returns an empty 
DataFrame as below
   ```py
   >>> spark.createDataFrame(spark._sc.parallelize([[],[]])).show()
   ++
   ||
   ++
   ||
   ||
   ++
   ```
   whereas Spark SQL returns a DataFrame where each row is an empty array.
   Let me check the code paths further and adjust that. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880035876


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   Even if we let it pass this if-else check, PySpark returns an empty 
DataFrame as below
   ```py
   >>> spark.createDataFrame(spark._sc.parallelize([[],[]])).show()
   ++
   ||
   ++
   ||
   ||
   ++
   ```
   whereas Spark SQL returns a DataFrame of empty arrays.
   Let me check the code paths further and adjust that. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #36355: [SPARK-38982][PYTHON] Skip categories setter test

2022-05-23 Thread GitBox



Yikun commented on PR #36355:
URL: https://github.com/apache/spark/pull/36355#issuecomment-1135372411

   @itholic Now we'd better to only fix test, becasue `categories.setter` will 
not be completed supported in Pandas, and also some inplace update methods like 
`set_categories` are deprecated.
   
   So, I suggest only fix test and keep current behavior, then remove this 
setter support when we remove all deprecated methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880031452


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   Oh got it. What about we check if it is `collections.abc.Sized`, and if the 
size is `0` here instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you opened a new pull request, #36646: [SPARK-39267][SQL] Clean up dsl unnecessary symbol

2022-05-23 Thread GitBox



ulysses-you opened a new pull request, #36646:
URL: https://github.com/apache/spark/pull/36646

   
   
   ### What changes were proposed in this pull request?
   
   remove two methods in dsl:
   - subquery(alias: Symbol)
   - as(alias: Symbol)
   
   ### Why are the changes needed?
   
   dsl is a test helper file which provide easy used functions. But some of 
these are unnecessary, for example:
   `def subquery(alias: Symbol): LogicalPlan`
   For a subquery, we only need the name, so a string type parameter is enough. 
   
   Besides, the symbol is not a very stable interface in scala. It should be 
better to clean up it if unnecessary.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   pass exists test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880030149


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   Sorry I meant an RDD like `spark._sc.parallelize([[],[]])`, Spark SQL is 
able to create a DataFrame with an array column but PySpark raises an exception.
   
   ```py
   >>> spark._sc.parallelize([[],[]]).first()
   []
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880030149


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   Sorry I meant an RDD like `spark._sc.parallelize([[],[]])`, Spark SQL is 
able to create a DataFrame with an array column but PySpark raises an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36639:
URL: https://github.com/apache/spark/pull/36639#discussion_r880027150


##
core/src/main/resources/error/error-classes.json:
##
@@ -1,319 +1,519 @@
 {
   "AMBIGUOUS_FIELD_NAME" : {
-"message" : [ "Field name  is ambiguous and has  matching 
fields in the struct." ],
+"message" : [
+  "Field name  is ambiguous and has  matching fields in the 
struct."
+],
 "sqlState" : "42000"
   },
   "ARITHMETIC_OVERFLOW" : {
-"message" : [ ". If necessary set  to 
\"false\" (except for ANSI interval type) to bypass this error." ],
+"message" : [
+  ". If necessary set  to \"false\" (except 
for ANSI interval type) to bypass this error."
+],
 "sqlState" : "22003"
   },
   "CANNOT_CAST_DATATYPE" : {
-"message" : [ "Cannot cast  to ." ],
+"message" : [
+  "Cannot cast  to ."
+],
 "sqlState" : "22005"
   },
   "CANNOT_CHANGE_DECIMAL_PRECISION" : {
-"message" : [ " cannot be represented as Decimal(, 
). If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  " cannot be represented as Decimal(, ). If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CANNOT_PARSE_DECIMAL" : {
-"message" : [ "Cannot parse decimal" ],
+"message" : [
+  "Cannot parse decimal"
+],
 "sqlState" : "42000"
   },
   "CANNOT_UP_CAST_DATATYPE" : {
-"message" : [ "Cannot up cast  from  to 
.\n" ]
+"message" : [
+  "Cannot up cast  from  to .",
+  ""
+]
   },
   "CAST_INVALID_INPUT" : {
-"message" : [ "The value  of the type  cannot be cast 
to  because it is malformed. To return NULL instead, use 
`try_cast`. If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 because it is malformed. To return NULL instead, use `try_cast`. 
If necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "42000"
   },
   "CAST_OVERFLOW" : {
-"message" : [ "The value  of the type  cannot be cast 
to  due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CONCURRENT_QUERY" : {
-"message" : [ "Another instance of this query was just started by a 
concurrent session." ]
+"message" : [
+  "Another instance of this query was just started by a concurrent 
session."
+]
   },
   "DATETIME_OVERFLOW" : {
-"message" : [ "Datetime operation overflow: ." ],
+"message" : [
+  "Datetime operation overflow: ."
+],
 "sqlState" : "22008"
   },
   "DIVIDE_BY_ZERO" : {
-"message" : [ "Division by zero. To return NULL instead, use `try_divide`. 
If necessary set  to \"false\" (except for ANSI interval type) to 
bypass this error." ],
+"message" : [
+  "Division by zero. To return NULL instead, use `try_divide`. If 
necessary set  to \"false\" (except for ANSI interval type) to bypass 
this error."
+],
 "sqlState" : "22012"
   },
   "DUPLICATE_KEY" : {
-"message" : [ "Found duplicate keys " ],
+"message" : [
+  "Found duplicate keys "
+],
 "sqlState" : "23000"
   },
   "FAILED_EXECUTE_UDF" : {
-"message" : [ "Failed to execute user defined function (: 
() => )" ]
+"message" : [
+  "Failed to execute user defined function (: () 
=> )"
+]
   },
   "FAILED_RENAME_PATH" : {
-"message" : [ "Failed to rename  to  as 
destination already exists" ],
+"message" : [
+  "Failed to rename  to  as destination already 
exists"
+],
 "sqlState" : "22023"
   },
   "FAILED_SET_ORIGINAL_PERMISSION_BACK" : {
-"message" : [ "Failed to set original permission  back to the 
created path: . Exception: " ]
+"message" : [
+  "Failed to set original permission  back to the created 
path: . Exception: "
+]
   },
   "FORBIDDEN_OPERATION" : {
-"message" : [ "The operation  is not allowed on : 
" ]
+"message" : [
+  "The operation  is not allowed on : "
+]
   },
   "GRAPHITE_SINK_INVALID_PROTOCOL" : {
-"message" : [ "Invalid Graphite protocol: " ]
+"message" : [
+  "Invalid Graphite protocol: "
+]
   },
   "GRAPHITE_SINK_PROPERTY_MISSING" : {
-"message" : [ "Graphite sink requires '' property." ]
+"message" : [
+  "Graphite sink requires '' property."
+]
   },
   "GROUPING_COLUMN_MISMATCH" : {
-"message" : [ "Column of grouping () can't be found in grouping 
columns " ],
+"message" : [
+  "Column of grouping () can't be found in grouping columns 
"
+],
 "sqlState" : "42000"
   },
   "GROUPING_ID_COLUMN_MISMATCH" : {
-"message" : [ "Columns of grouping_id () does not match 
grouping columns ()" ],
+

[GitHub] [spark] itholic commented on pull request #34212: [SPARK-36402][PYTHON] Implement Series.combine

2022-05-23 Thread GitBox



itholic commented on PR #34212:
URL: https://github.com/apache/spark/pull/34212#issuecomment-1135358555

   The Github CI bug was fixed, can you rebase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #36353: [SPARK-38946][PYTHON] Fix different behavior in setitem

2022-05-23 Thread GitBox



itholic commented on PR #36353:
URL: https://github.com/apache/spark/pull/36353#issuecomment-1135358115

   Just confirming, any update on this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #36355: [SPARK-38982][PYTHON] Skip categories setter test

2022-05-23 Thread GitBox



itholic commented on PR #36355:
URL: https://github.com/apache/spark/pull/36355#issuecomment-1135357556

   Seems like they replied 
https://github.com/pandas-dev/pandas/issues/46820#issuecomment-1113889315.
   
   Maybe can we just close ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



HyukjinKwon commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r880021145


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   I think we can just ignore this condition. If RDD is empty, `RDD.first` 
seems throwing an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



AmplabJenkins commented on PR #36642:
URL: https://github.com/apache/spark/pull/36642#issuecomment-1135352468

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36643: [SPARK-39265][SQL] Support non-vectorized Parquet scans with DEFAULT values

2022-05-23 Thread GitBox



AmplabJenkins commented on PR #36643:
URL: https://github.com/apache/spark/pull/36643#issuecomment-1135352445

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36641: [SPARK-39263] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-05-23 Thread GitBox



AmplabJenkins commented on PR #36641:
URL: https://github.com/apache/spark/pull/36641#issuecomment-1135352493

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark

2022-05-23 Thread GitBox



HyukjinKwon closed pull request #36555: [SPARK-39189][PS] Support limit_area 
parameter in pandas API on Spark
URL: https://github.com/apache/spark/pull/36555


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark

2022-05-23 Thread GitBox



HyukjinKwon commented on PR #36555:
URL: https://github.com/apache/spark/pull/36555#issuecomment-1135351855

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on pull request #36386: [SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation

2022-05-23 Thread GitBox



allisonwang-db commented on PR #36386:
URL: https://github.com/apache/spark/pull/36386#issuecomment-1135332088

   @dongjoon-hyun Looks like it still has the same issue:
   ```bash
   # Test passed
   build/sbt "sql/testOnly *PlanStabilitySuite"
   
   # Test failed for q4 and q5
   build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q5)"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages

2022-05-23 Thread GitBox



srielau commented on code in PR #36639:
URL: https://github.com/apache/spark/pull/36639#discussion_r880001283


##
core/src/main/resources/error/error-classes.json:
##
@@ -1,319 +1,519 @@
 {
   "AMBIGUOUS_FIELD_NAME" : {
-"message" : [ "Field name  is ambiguous and has  matching 
fields in the struct." ],
+"message" : [
+  "Field name  is ambiguous and has  matching fields in the 
struct."
+],
 "sqlState" : "42000"
   },
   "ARITHMETIC_OVERFLOW" : {
-"message" : [ ". If necessary set  to 
\"false\" (except for ANSI interval type) to bypass this error." ],
+"message" : [
+  ". If necessary set  to \"false\" (except 
for ANSI interval type) to bypass this error."
+],
 "sqlState" : "22003"
   },
   "CANNOT_CAST_DATATYPE" : {
-"message" : [ "Cannot cast  to ." ],
+"message" : [
+  "Cannot cast  to ."
+],
 "sqlState" : "22005"
   },
   "CANNOT_CHANGE_DECIMAL_PRECISION" : {
-"message" : [ " cannot be represented as Decimal(, 
). If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  " cannot be represented as Decimal(, ). If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CANNOT_PARSE_DECIMAL" : {
-"message" : [ "Cannot parse decimal" ],
+"message" : [
+  "Cannot parse decimal"
+],
 "sqlState" : "42000"
   },
   "CANNOT_UP_CAST_DATATYPE" : {
-"message" : [ "Cannot up cast  from  to 
.\n" ]
+"message" : [
+  "Cannot up cast  from  to .",
+  ""
+]
   },
   "CAST_INVALID_INPUT" : {
-"message" : [ "The value  of the type  cannot be cast 
to  because it is malformed. To return NULL instead, use 
`try_cast`. If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 because it is malformed. To return NULL instead, use `try_cast`. 
If necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "42000"
   },
   "CAST_OVERFLOW" : {
-"message" : [ "The value  of the type  cannot be cast 
to  due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CONCURRENT_QUERY" : {
-"message" : [ "Another instance of this query was just started by a 
concurrent session." ]
+"message" : [
+  "Another instance of this query was just started by a concurrent 
session."
+]
   },
   "DATETIME_OVERFLOW" : {
-"message" : [ "Datetime operation overflow: ." ],
+"message" : [
+  "Datetime operation overflow: ."
+],
 "sqlState" : "22008"
   },
   "DIVIDE_BY_ZERO" : {
-"message" : [ "Division by zero. To return NULL instead, use `try_divide`. 
If necessary set  to \"false\" (except for ANSI interval type) to 
bypass this error." ],
+"message" : [
+  "Division by zero. To return NULL instead, use `try_divide`. If 
necessary set  to \"false\" (except for ANSI interval type) to bypass 
this error."
+],
 "sqlState" : "22012"
   },
   "DUPLICATE_KEY" : {
-"message" : [ "Found duplicate keys " ],
+"message" : [
+  "Found duplicate keys "
+],
 "sqlState" : "23000"
   },
   "FAILED_EXECUTE_UDF" : {
-"message" : [ "Failed to execute user defined function (: 
() => )" ]
+"message" : [
+  "Failed to execute user defined function (: () 
=> )"
+]
   },
   "FAILED_RENAME_PATH" : {
-"message" : [ "Failed to rename  to  as 
destination already exists" ],
+"message" : [
+  "Failed to rename  to  as destination already 
exists"
+],
 "sqlState" : "22023"
   },
   "FAILED_SET_ORIGINAL_PERMISSION_BACK" : {
-"message" : [ "Failed to set original permission  back to the 
created path: . Exception: " ]
+"message" : [
+  "Failed to set original permission  back to the created 
path: . Exception: "
+]
   },
   "FORBIDDEN_OPERATION" : {
-"message" : [ "The operation  is not allowed on : 
" ]
+"message" : [
+  "The operation  is not allowed on : "
+]
   },
   "GRAPHITE_SINK_INVALID_PROTOCOL" : {
-"message" : [ "Invalid Graphite protocol: " ]
+"message" : [
+  "Invalid Graphite protocol: "
+]
   },
   "GRAPHITE_SINK_PROPERTY_MISSING" : {
-"message" : [ "Graphite sink requires '' property." ]
+"message" : [
+  "Graphite sink requires '' property."
+]
   },
   "GROUPING_COLUMN_MISMATCH" : {
-"message" : [ "Column of grouping () can't be found in grouping 
columns " ],
+"message" : [
+  "Column of grouping () can't be found in grouping columns 
"
+],
 "sqlState" : "42000"
   },
   "GROUPING_ID_COLUMN_MISMATCH" : {
-"message" : [ "Columns of grouping_id () does not match 
grouping columns ()" ],
+"message"

[GitHub] [spark] huaxingao commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

2022-05-23 Thread GitBox



huaxingao commented on PR #36644:
URL: https://github.com/apache/spark/pull/36644#issuecomment-1135317504

   cc @cloud-fan @aokolnychyi @ulysses-you 
   This PR is ready for review. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36639:
URL: https://github.com/apache/spark/pull/36639#discussion_r879992776


##
core/src/main/resources/error/error-classes.json:
##
@@ -1,319 +1,519 @@
 {
   "AMBIGUOUS_FIELD_NAME" : {
-"message" : [ "Field name  is ambiguous and has  matching 
fields in the struct." ],
+"message" : [
+  "Field name  is ambiguous and has  matching fields in the 
struct."
+],
 "sqlState" : "42000"
   },
   "ARITHMETIC_OVERFLOW" : {
-"message" : [ ". If necessary set  to 
\"false\" (except for ANSI interval type) to bypass this error." ],
+"message" : [
+  ". If necessary set  to \"false\" (except 
for ANSI interval type) to bypass this error."
+],
 "sqlState" : "22003"
   },
   "CANNOT_CAST_DATATYPE" : {
-"message" : [ "Cannot cast  to ." ],
+"message" : [
+  "Cannot cast  to ."
+],
 "sqlState" : "22005"
   },
   "CANNOT_CHANGE_DECIMAL_PRECISION" : {
-"message" : [ " cannot be represented as Decimal(, 
). If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  " cannot be represented as Decimal(, ). If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CANNOT_PARSE_DECIMAL" : {
-"message" : [ "Cannot parse decimal" ],
+"message" : [
+  "Cannot parse decimal"
+],
 "sqlState" : "42000"
   },
   "CANNOT_UP_CAST_DATATYPE" : {
-"message" : [ "Cannot up cast  from  to 
.\n" ]
+"message" : [
+  "Cannot up cast  from  to .",
+  ""
+]
   },
   "CAST_INVALID_INPUT" : {
-"message" : [ "The value  of the type  cannot be cast 
to  because it is malformed. To return NULL instead, use 
`try_cast`. If necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 because it is malformed. To return NULL instead, use `try_cast`. 
If necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "42000"
   },
   "CAST_OVERFLOW" : {
-"message" : [ "The value  of the type  cannot be cast 
to  due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error." ],
+"message" : [
+  "The value  of the type  cannot be cast to 
 due to an overflow. To return NULL instead, use `try_cast`. If 
necessary set  to \"false\" to bypass this error."
+],
 "sqlState" : "22005"
   },
   "CONCURRENT_QUERY" : {
-"message" : [ "Another instance of this query was just started by a 
concurrent session." ]
+"message" : [
+  "Another instance of this query was just started by a concurrent 
session."
+]
   },
   "DATETIME_OVERFLOW" : {
-"message" : [ "Datetime operation overflow: ." ],
+"message" : [
+  "Datetime operation overflow: ."
+],
 "sqlState" : "22008"
   },
   "DIVIDE_BY_ZERO" : {
-"message" : [ "Division by zero. To return NULL instead, use `try_divide`. 
If necessary set  to \"false\" (except for ANSI interval type) to 
bypass this error." ],
+"message" : [
+  "Division by zero. To return NULL instead, use `try_divide`. If 
necessary set  to \"false\" (except for ANSI interval type) to bypass 
this error."
+],
 "sqlState" : "22012"
   },
   "DUPLICATE_KEY" : {
-"message" : [ "Found duplicate keys " ],
+"message" : [
+  "Found duplicate keys "
+],
 "sqlState" : "23000"
   },
   "FAILED_EXECUTE_UDF" : {
-"message" : [ "Failed to execute user defined function (: 
() => )" ]
+"message" : [
+  "Failed to execute user defined function (: () 
=> )"
+]
   },
   "FAILED_RENAME_PATH" : {
-"message" : [ "Failed to rename  to  as 
destination already exists" ],
+"message" : [
+  "Failed to rename  to  as destination already 
exists"
+],
 "sqlState" : "22023"
   },
   "FAILED_SET_ORIGINAL_PERMISSION_BACK" : {
-"message" : [ "Failed to set original permission  back to the 
created path: . Exception: " ]
+"message" : [
+  "Failed to set original permission  back to the created 
path: . Exception: "
+]
   },
   "FORBIDDEN_OPERATION" : {
-"message" : [ "The operation  is not allowed on : 
" ]
+"message" : [
+  "The operation  is not allowed on : "
+]
   },
   "GRAPHITE_SINK_INVALID_PROTOCOL" : {
-"message" : [ "Invalid Graphite protocol: " ]
+"message" : [
+  "Invalid Graphite protocol: "
+]
   },
   "GRAPHITE_SINK_PROPERTY_MISSING" : {
-"message" : [ "Graphite sink requires '' property." ]
+"message" : [
+  "Graphite sink requires '' property."
+]
   },
   "GROUPING_COLUMN_MISMATCH" : {
-"message" : [ "Column of grouping () can't be found in grouping 
columns " ],
+"message" : [
+  "Column of grouping () can't be found in grouping columns 
"
+],
 "sqlState" : "42000"
   },
   "GROUPING_ID_COLUMN_MISMATCH" : {
-"message" : [ "Columns of grouping_id () does not match 
grouping columns ()" ],
+

[GitHub] [spark] anishshri-db commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



anishshri-db commented on code in PR #36642:
URL: https://github.com/apache/spark/pull/36642#discussion_r879991863


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala:
##
@@ -451,8 +451,10 @@ abstract class StreamExecution(
 // Offset has been reached and the equality condition might never be 
met.
 if (!localCommittedOffsets.contains(source)) {
   true
-} else if (newOffset.isInstanceOf[LongOffset]) {
-  localCommittedOffsets(source).toString.toLong < 
newOffset.asInstanceOf[LongOffset].offset
+} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] &&

Review Comment:
   Simplified this further, based on convo with @HeartSaVioR 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



anishshri-db commented on code in PR #36642:
URL: https://github.com/apache/spark/pull/36642#discussion_r879982515


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala:
##
@@ -451,8 +451,10 @@ abstract class StreamExecution(
 // Offset has been reached and the equality condition might never be 
met.
 if (!localCommittedOffsets.contains(source)) {
   true
-} else if (newOffset.isInstanceOf[LongOffset]) {
-  localCommittedOffsets(source).toString.toLong < 
newOffset.asInstanceOf[LongOffset].offset
+} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] &&

Review Comment:
   Verified and it does seem to work in this case. Made the change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879978761


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   Filter is the case where we saw this issue (Filter contains a exists or not 
exists and a subplan as its child, and in both there are the same relation that 
will be deduplicated).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #36636: [SPARK-39256][CORE] Reduce multiple file attribute calls of `JavaUtils#deleteRecursivelyUsingJavaIO`

2022-05-23 Thread GitBox



LuciferYang commented on PR #36636:
URL: https://github.com/apache/spark/pull/36636#issuecomment-1135303165

   
[127a515](https://github.com/apache/spark/pull/36636/commits/127a51581da94995b51ece11caf181200416d3bf)
 merge with master and rerun all tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on a diff in pull request #36519: [SPARK-39159][SQL] Add new Dataset API for Offset

2022-05-23 Thread GitBox



jerqi commented on code in PR #36519:
URL: https://github.com/apache/spark/pull/36519#discussion_r879981480


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2102,6 +2102,16 @@ class Dataset[T] private[sql](
 Limit(Literal(n), logicalPlan)
   }
 
+  /**
+   * Returns a new Dataset by skipping the first `m` rows.

Review Comment:
a typo? may be 'n'?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879980347


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {

Review Comment:
   sure done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879978761


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   Filter is the case where we saw this issue (Filter contains a exists or not 
exists and a subplan as its child, and in both there are the same relation that 
will be deduplicated).
   
   SQL UDF is a case that is not filter but also contains outer reference. That 
is a case that this fix does not aim at.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879978761


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   Filter is the case where we saw this issue (Filter contains a exists or not 
exists and a subplan as its child, and in both there are the same relation that 
will be deduplicated).
   
   SQL UDF is a case that is not filter but also contains outer reference. That 
is a case that this fix is not aim at.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879979107


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   Generally speaking if we want to improve outer reference check in general, 
that will beyond SPARK-39144 and worth a specific JIRA to track.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879978761


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   Filter is the case where we saw this issue (Filter contains a exists or not 
exists and a subplan as its child).
   
   SQL UDF is a case that is not filter but also contains outer reference. That 
is a case that this fix is not aim at.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`

2022-05-23 Thread GitBox



LuciferYang commented on PR #36637:
URL: https://github.com/apache/spark/pull/36637#issuecomment-1135297918

   thanks all ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879972900


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {
+  if (!attrs.exists(_.exprId == outer.exprId)) {
+failAnalysis("outer attribute not found")
+  }
+}
+
+def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = 
p match {
+  case f: Filter =>

Review Comment:
   why do we only apply the check for Filter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up

2022-05-23 Thread GitBox



cloud-fan commented on code in PR #36503:
URL: https://github.com/apache/spark/pull/36503#discussion_r879972792


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   expressions.exists(_.exists(_.semanticEquals(expr)))
 }
 
+def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): 
Unit = {

Review Comment:
   This is only called once, can we inline it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35425: [SPARK-38129][SQL] Adaptively enable timeout for BroadcastQueryStageExec in AQE

2022-05-23 Thread GitBox



github-actions[bot] closed pull request #35425: [SPARK-38129][SQL] Adaptively 
enable timeout for BroadcastQueryStageExec in AQE
URL: https://github.com/apache/spark/pull/35425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35484: [SPARK-38181][SS] Update comments in KafkaDataConsumer.scala

2022-05-23 Thread GitBox



github-actions[bot] closed pull request #35484: [SPARK-38181][SS] Update 
comments in KafkaDataConsumer.scala
URL: https://github.com/apache/spark/pull/35484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36627: [SPARK-39250][BUILD] Upgrade Jackson to 2.13.3

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36627:
URL: https://github.com/apache/spark/pull/36627#issuecomment-1135262491

   Yes, that failure was irrelevant to this PR, @MaxGekk .
   - It was caused via https://github.com/apache/spark/pull/36632
   - It was fixed via https://github.com/apache/spark/pull/36637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen opened a new pull request, #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs

2022-05-23 Thread GitBox



JoshRosen opened a new pull request, #36645:
URL: https://github.com/apache/spark/pull/36645

   ### What changes were proposed in this pull request?
   
   This PR cleans up the `spark.rpc.numRetries` and `spark.rpc.retry.wait` 
configs, both of which are ununused.
   
   
   ### Why are the changes needed?
   
   Since SPARK-19450 / #16790 in Spark 2.2.0, both of these configurations are 
unused and setting them has no effect. Marking the configs as deprecated and 
cleaning them up from the docs helps to avoid user confusion.
   
   In addition, this cleanup slightly improves the performance of constructing 
`RpcEndpointRef`s because it removes two unused fields that were initialized by 
reading the deprecated configs.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   n/a
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-23 Thread GitBox



huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize 
skewed partitions in Distribution and Ordering if numPartitions is not specified
URL: https://github.com/apache/spark/pull/34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-23 Thread GitBox



huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1135252175

   It's a bit too hard to rebase. I will close this PR and the new one is here 
https://github.com/apache/spark/pull/36644. Thank you all very much for 
reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao opened a new pull request, #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

2022-05-23 Thread GitBox



huaxingao opened a new pull request, #36644:
URL: https://github.com/apache/spark/pull/36644

   
   ### What changes were proposed in this pull request?
   Re-optimize partitions in Distribution and Ordering if numPartitions is not 
specified.
   
   Support both strictly required distribution and best effort distribution. 
For best effort distribution, if user doesn't request a specific number of 
partitions, Spark will split skewed partitions and coalesce small partitions 
(`RebalancePartitions` will be used and  both 
`OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be 
applied). For strictly required distribution, if user doesn't request a 
specific number of partitions, Spark will coalesce small partitions but will 
NOT split skewed partitions(`RepartitionByExpression(distribution, query, 
None)` will be used and `CoalesceShufflePartitions` will be applied).
   
   
   
   ### Why are the changes needed?
   
   optimization
   
   
   ### Does this PR introduce _any_ user-facing change?
   In interface `RequiresDistributionAndOrdering`, a new API 
`distributionStrictlyRequired()` will be added to tell if the distribution 
required by this write is strictly required or best effort only
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   
   
   ### How was this patch tested?
   Existing and new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



HeartSaVioR commented on code in PR #36642:
URL: https://github.com/apache/spark/pull/36642#discussion_r879934124


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala:
##
@@ -451,8 +451,10 @@ abstract class StreamExecution(
 // Offset has been reached and the equality condition might never be 
met.
 if (!localCommittedOffsets.contains(source)) {
   true
-} else if (newOffset.isInstanceOf[LongOffset]) {
-  localCommittedOffsets(source).toString.toLong < 
newOffset.asInstanceOf[LongOffset].offset
+} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] &&

Review Comment:
   This is going to be verbose on checking the type and casting, so could you 
please check whether we can enjoy the benefit of pattern matching or not?
   
   ```
   (localCommittedOffsets.get(source), newOffset) match {
 case (None, newOff) => true
 case (Some(localOff: LongOffset), newOff: LongOffset) => localOff.offset < 
newOff.offset
 case (Some(localOff), newOff) => localOff != newOff
   }
   ```
   
   Worth noting that awaitOffset is not on the critical path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



HeartSaVioR commented on code in PR #36642:
URL: https://github.com/apache/spark/pull/36642#discussion_r879934124


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala:
##
@@ -451,8 +451,10 @@ abstract class StreamExecution(
 // Offset has been reached and the equality condition might never be 
met.
 if (!localCommittedOffsets.contains(source)) {
   true
-} else if (newOffset.isInstanceOf[LongOffset]) {
-  localCommittedOffsets(source).toString.toLong < 
newOffset.asInstanceOf[LongOffset].offset
+} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] &&

Review Comment:
   This is going to be verbose on checking the type and casting, so could you 
please check whether we can enjoy the benefit of pattern matching or not?
   
   ```
   (localCommittedOffsets.get(source), newOffset) match {
 case (Some(localOff: LongOffset), newOff: LongOffset) => localOff.offset < 
newOff.offset
 case (Some(localOff), newOff) => localOff != newOff
 case (None, newOff) => true
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



anishshri-db commented on PR #36642:
URL: https://github.com/apache/spark/pull/36642#issuecomment-1135224215

   @HeartSaVioR - Could you please take a look and merge ? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor opened a new pull request, #36643: [SPARK-39245][SQL] Support non-vectorized Parquet scans with DEFAULT values

2022-05-23 Thread GitBox



dtenedor opened a new pull request, #36643:
URL: https://github.com/apache/spark/pull/36643

   ### What changes were proposed in this pull request?
   
   Support non-vectorized Parquet scans when the table schema has associated 
DEFAULT column values.
   
   (Note: this PR depends on https://github.com/apache/spark/pull/36583)
   
   Example:
   
   ```
   create table t(i int) using parquet;
   insert into t values(42);
   alter table t add column s string default concat('abc', def');
   select * from t;
   > 42, 'abcdef'
   ```
   
   ### Why are the changes needed?
   
   This change makes it easier to build, query, and maintain tables backed by 
Parquet data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   This PR includes new test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on pull request #36642: [SPARK-39264] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



anishshri-db commented on PR #36642:
URL: https://github.com/apache/spark/pull/36642#issuecomment-1135208135

   @alex-balikov @HeartSaVioR @viirya - Could you folks please take a look ? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db opened a new pull request, #36642: [SPARK-39264] Fix type check and conversion to longOffset for awaitOffset fix

2022-05-23 Thread GitBox



anishshri-db opened a new pull request, #36642:
URL: https://github.com/apache/spark/pull/36642

   …
   
   
   
   ### What changes were proposed in this pull request?
   Fix type check and conversion to longOffset for awaitOffset fix. Based on 
discussion with comments from @alex-balikov 
   
   
   ### Why are the changes needed?
   To ensure type safety while doing comparisons and avoid type mismatch 
related bugs/issues.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Change only for making type checks explicit. Ran existing tests and verified 
that they pass:
   ```
   [info] RateStreamProviderSuite:
   15:37:58.700 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
   [info] - RateStreamProvider in registry (438 milliseconds)
   [info] - compatible with old path in registry (1 millisecond)
   15:38:00.958 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - microbatch - basic (2 seconds, 346 milliseconds)
   15:38:02.074 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   15:38:04.391 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - microbatch - restart (4 seconds, 294 milliseconds)
   15:38:06.450 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - microbatch - uniform distribution of event timestamps (547 
milliseconds)
   [info] - microbatch - infer offsets (90 milliseconds)
   [info] - microbatch - predetermined batch size (74 milliseconds)
   [info] - microbatch - data read (73 milliseconds)
   [info] - valueAtSecond (0 milliseconds)
   15:38:07.243 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - rampUpTime (1 second, 633 milliseconds)
   15:38:08.806 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - numPartitions (924 milliseconds)
   15:38:09.702 WARN 
org.apache.spark.sql.execution.streaming.ResolveWriteToStream: 
spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets 
and will be disabled.
   [info] - overflow (241 milliseconds)
   [info] - illegal option values (3 milliseconds)
   [info] - user-specified schema given (8 milliseconds)
   [info] - continuous data (1 second, 12 milliseconds)
   15:38:11.035 WARN 
org.apache.spark.sql.execution.streaming.sources.RateStreamProviderSuite:
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.streaming.sources.RateStreamProviderSuite, threads: 
rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true), 
state-store-maintenance-task (daemon=true) =
   [info] Run completed in 13 seconds, 606 milliseconds.
   [info] Total number of tests run: 15
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [WIP][SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-05-23 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r879901024


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -210,8 +252,23 @@ private AppShufflePartitionInfo 
getOrCreateAppShufflePartitionInfo(
   File metaFile =
 appShuffleInfo.getMergedShuffleMetaFile(shuffleId, shuffleMergeId, 
reduceId);
   try {
-return newAppShufflePartitionInfo(appShuffleInfo.appId, shuffleId, 
shuffleMergeId,
-  reduceId, dataFile, indexFile, metaFile);
+AppShufflePartitionInfo appShufflePartitionInfo =
+newAppShufflePartitionInfo(appShuffleInfo.appId, 
appShuffleInfo.attemptId,
+shuffleId, shuffleMergeId, reduceId, dataFile, indexFile, 
metaFile);
+if (db != null) {
+  try{
+byte[] dbKey =
+dbAppAttemptShufflePartitionKeyWithCustomPrefix(

Review Comment:
   Created two methods for two different types of DB writes. It is not that 
straightforward to get it generalized in one method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [WIP][SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-05-23 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r879899830


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -655,6 +776,238 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+
+  @Override
+  public void close() {
+if (db != null) {
+  try {
+db.close();
+  } catch (IOException e) {
+logger.error("Exception closing leveldb with registered app paths info 
and "
++ "shuffle partition info", e);
+  }
+}
+  }
+
+  private static byte[] dbAppAttemptShufflePartitionKeyWithCustomPrefix(
+  AppShufflePartitionId id,
+  String prefix) throws IOException {
+// we stick a common prefix on all the keys so we can find them in the DB
+String appAttemptShufflePartitionJson = mapper.writeValueAsString(id);
+String key = prefix + ";" + appAttemptShufflePartitionJson;
+return key.getBytes(StandardCharsets.UTF_8);
+  }
+
+  private static AppShufflePartitionId 
parseDbAppAttemptShufflePartitionKeyWithCustomPrefix(
+  String s,
+  String prefix) throws IOException {
+if (!s.startsWith(prefix)) {
+  throw new IllegalArgumentException("expected a string starting with " + 
prefix);
+}
+String json = s.substring(prefix.length() + 1);
+return mapper.readValue(json, AppShufflePartitionId.class);
+  }
+
+  private static byte[] dbAppAttemptPathsKey(String appId, int attemptId) 
throws IOException {
+// we stick a common prefix on all the keys so we can find them in the DB
+String appAttemptIdJson = mapper.writeValueAsString(new 
AppAttemptId(appId, attemptId));
+String key = APP_ATTEMPT_PATH_KEY_PREFIX + ";" + appAttemptIdJson;
+return key.getBytes(StandardCharsets.UTF_8);
+  }
+
+  private static AppAttemptId parseDbAppAttemptPathsKey(String s) throws 
IOException {
+if (!s.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) {
+  throw new IllegalArgumentException("expected a string starting with "
++ APP_ATTEMPT_PATH_KEY_PREFIX);
+}
+String json = s.substring(APP_ATTEMPT_PATH_KEY_PREFIX.length() + 1);
+return mapper.readValue(json, AppAttemptId.class);
+  }
+
+  public void reloadAppShuffleInfo(DB db) throws IOException {
+logger.debug("Reload applications merged shuffle information from DB");
+reloadActiveAppAttemptsPathInfo(db);
+reloadFinalizedAppAttemptsShuffleInfo(db);
+reloadActiveAppAttemptsShufflePartitionInfo(db);
+  }
+
+  @VisibleForTesting
+  public void reloadActiveAppAttemptsPathInfo(DB db) throws IOException {
+if (db != null) {
+  DBIterator itr = db.iterator();
+  itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8));
+  while (itr.hasNext()) {
+Map.Entry e = itr.next();
+String key = new String(e.getKey(), StandardCharsets.UTF_8);
+if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) {
+  break;
+}
+AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key);
+try{
+  AppPathsInfo appPathsInfo = mapper.readValue(e.getValue(), 
AppPathsInfo.class);
+  logger.debug("Reloading active application {}_{} merged shuffle 
files paths",
+appAttemptId.appId, appAttemptId.attemptId);
+  appsShuffleInfo.computeIfAbsent(appAttemptId.appId, id ->
+new AppShuffleInfo(appAttemptId.appId, appAttemptId.attemptId, 
appPathsInfo));
+} catch (Exception exception) {
+  logger.error("Parsing exception is {}", exception);
+}
+  }
+}
+  }
+
+  @VisibleForTesting
+  public void reloadFinalizedAppAttemptsShuffleInfo(DB db) throws IOException {
+if (db != null) {
+  DBIterator itr = db.iterator();
+  
itr.seek(APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX.getBytes(StandardCharsets.UTF_8));
+  while (itr.hasNext()) {
+Map.Entry e = itr.next();
+String key = new String(e.getKey(), StandardCharsets.UTF_8);
+if (!key.startsWith(APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX)) {
+  break;
+}
+AppShufflePartitionId partitionId =
+  parseDbAppAttemptShufflePartitionKeyWithCustomPrefix(
+key, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX);
+if (partitionId.reduceId != -1) {

Review Comment:
   Removed. It was for former implementation where we can reuse the 
AppShufflePartitionId for both finalized and on-going shuffle merges.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail:

[GitHub] [spark] dtenedor commented on pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans with DEFAULT values

2022-05-23 Thread GitBox



dtenedor commented on PR #36623:
URL: https://github.com/apache/spark/pull/36623#issuecomment-1135171752

   It appears Spark does not allow ALTER ADD COLUMN DML with Avro tables. 
Closing this for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor closed pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans with DEFAULT values

2022-05-23 Thread GitBox



dtenedor closed pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans 
with DEFAULT values
URL: https://github.com/apache/spark/pull/36623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #36641: [SPARK-39263] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-05-23 Thread GitBox



amaliujia opened a new pull request, #36641:
URL: https://github.com/apache/spark/pull/36641

   
   
   ### What changes were proposed in this pull request?
   
   
   Make GetTable, TableExists and DatabaseExists be compatible with 3 layer 
namespace
   
   ### Why are the changes needed?
   
   
   This is a part of effort to make catalog API be compatible with 3 layer 
namespace
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. The API change here is backward compatible and it extends the API to 
further support 3 layer namespace (e.g. catalog.database.table).
   ### How was this patch tested?
   
   
   UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] akpatnam25 commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is se

2022-05-23 Thread GitBox



akpatnam25 commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r879883905


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2449,7 +2459,12 @@ private[spark] class DAGScheduler(
 val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
 logDebug(s"Considering removal of executor $execId; " +
   s"fileLost: $fileLost, currentEpoch: $currentEpoch")
-if (!executorFailureEpoch.contains(execId) || executorFailureEpoch(execId) 
< currentEpoch) {
+// Check if the execId is a shuffle push merger
+// We do not remove the executor if it is,
+// and only remove the outputs on that host/executor.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36638:
URL: https://github.com/apache/spark/pull/36638#issuecomment-1135155332

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`

2022-05-23 Thread GitBox



dongjoon-hyun closed pull request #36638: [SPARK-39260][SQL] Use 
`Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`
URL: https://github.com/apache/spark/pull/36638


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36637:
URL: https://github.com/apache/spark/pull/36637#issuecomment-1135154660

   Merged to master/3.3/3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`

2022-05-23 Thread GitBox



dongjoon-hyun closed pull request #36637: [SPARK-39258][TESTS] Fix `Hide 
credentials in show create table`
URL: https://github.com/apache/spark/pull/36637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36637:
URL: https://github.com/apache/spark/pull/36637#issuecomment-1135153418

   PySpark GitHub Action job is still running over 5h 40minute. It's irrelevant 
to this test PR. I'll merge this PR.
   
   https://user-images.githubusercontent.com/9700541/169908543-d94e0bc8-5489-42c4-bac5-fdc5e80b7767.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`

2022-05-23 Thread GitBox



dongjoon-hyun commented on PR #36638:
URL: https://github.com/apache/spark/pull/36638#issuecomment-1135152042

   The PySpark GitHub Action jobs are still running exceptionally, but they are 
not related to this ORC patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vli-databricks closed pull request #36527: [SPARK-39169][SQL] Optimize FIRST when used as a single aggregate fun…

2022-05-23 Thread GitBox



vli-databricks closed pull request #36527: [SPARK-39169][SQL] Optimize FIRST 
when used as a single aggregate fun…
URL: https://github.com/apache/spark/pull/36527


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace

2022-05-23 Thread GitBox



amaliujia commented on code in PR #36586:
URL: https://github.com/apache/spark/pull/36586#discussion_r879816903


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -125,6 +135,31 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   isTemporary = isTemp)
   }
 
+  private def makeTable(ident: Seq[String]): Table = {
+val plan = UnresolvedTableOrView(ident, "Catalog.listTables", true)
+val node = sparkSession.sessionState.executePlan(plan).analyzed
+node match {
+  case t: ResolvedTable =>
+val isExternal = t.table.properties().getOrDefault("external", 
"false").equals("true")
+new Table(
+  name = t.identifier.name(),
+  database = t.identifier.namespace().head,
+  description = t.table.properties().get("comment"),
+  tableType =
+if (isExternal) CatalogTableType.EXTERNAL.name
+else CatalogTableType.MANAGED.name,
+  isTemporary = false)
+  case v: ResolvedView =>
+new Table(

Review Comment:
   BTW this idea triggered a compatibility check issue
   ```
   spark-sql: Failed binary compatibility check against 
org.apache.spark:spark-sql_2.12:3.2.0! Found 1 potential problems (filtered 586)
   [error]  * method 
this(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Boolean)Unit
 in class org.apache.spark.sql.catalog.Table's type is different in current 
version, where it is 
(java.lang.String,Array[java.lang.String],java.lang.String,java.lang.String,Boolean)Unit
 instead of 
(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Boolean)Unit
   ```
   How do we usually solve such issues?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] otterc commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to

2022-05-23 Thread GitBox



otterc commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r879790477


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2449,7 +2459,12 @@ private[spark] class DAGScheduler(
 val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
 logDebug(s"Considering removal of executor $execId; " +
   s"fileLost: $fileLost, currentEpoch: $currentEpoch")
-if (!executorFailureEpoch.contains(execId) || executorFailureEpoch(execId) 
< currentEpoch) {
+// Check if the execId is a shuffle push merger
+// We do not remove the executor if it is,
+// and only remove the outputs on that host/executor.

Review Comment:
   Nit: `and only remove the outputs on the host`



##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2461,7 +2476,7 @@ private[spark] class DAGScheduler(
   clearCacheLocs()
 }
 if (fileLost) {
-  val remove = if (ignoreShuffleFileLostEpoch) {
+  val remove = if (ignoreShuffleFileLostEpoch || isShuffleMerger) {

Review Comment:
   `removeExecutorAndUnregisterOutputs` is being called from 2 places:
   1. Fetch failure handling: So here `ignoreShuffleFileLostEpoch` is going to 
be `true`. Then it doesn't matter if the execId is a shuffleMergerId or not.
   2. Executor Lost handling: When it is called from here, then it is never 
going to be a `shuffleMergerId`. 
   Given this, I think it might be better to remove this `isShuffleMerger` 
check here and just add a comment that when the fetch failure is for a merged 
shuffle chunk, `ignoreShuffleFileLostEpoch` is true, and so all the files will 
be removed.
   WDYT? cc. @mridulm 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r879786294


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   When `first` is a nested list with the first element is an empty list, 
PySpark diverges from scala DataFrame API. A follow-up PR will be filed to 
adjust that.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`

2022-05-23 Thread GitBox



xinrong-databricks commented on code in PR #36640:
URL: https://github.com/apache/spark/pull/36640#discussion_r879786294


##
python/pyspark/sql/session.py:
##
@@ -611,7 +611,7 @@ def _inferSchema(
 :class:`pyspark.sql.types.StructType`
 """
 first = rdd.first()
-if not first:
+if first != 0 and not first:

Review Comment:
   When `first` is a nested list with the first element is an empty list, 
PySpark diverges from Spark SQL. A follow-up PR will be filed to adjust that.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-databricks opened a new pull request, #36640: Correct error messages when creating DataFrame

2022-05-23 Thread GitBox



xinrong-databricks opened a new pull request, #36640:
URL: https://github.com/apache/spark/pull/36640

   ### What changes were proposed in this pull request?
   Correct error messages when creating DataFrame.
   
   Previously, when the first element of RDD is 0, we raise a ValueError "The 
first row in RDD is empty, can not infer schema".
   
   However, a TypeError "Can not infer schema for type: " should 
be raised instead.
   
   ### Why are the changes needed?
   Improve error messages and debugability.
   
   ### Does this PR introduce _any_ user-facing change?
   Error message change.
   
   Before:
   ```py
   >>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
   Traceback (most recent call last):
   ...
   ValueError: The first row in RDD is empty, can not infer schema
   ```
   
   After:
   ```py
   >>> spark.createDataFrame(spark._sc.parallelize([0, 1]))
   Traceback (most recent call last):   
   
   ...
   TypeError: Can not infer schema for type: 
   ```
   
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages

2022-05-23 Thread GitBox



MaxGekk commented on code in PR #36635:
URL: https://github.com/apache/spark/pull/36635#discussion_r879761080


##
core/src/main/resources/error/error-classes.json:
##
@@ -48,13 +48,13 @@
   "FAILED_EXECUTE_UDF" : {
 "message" : [ "Failed to execute user defined function (: 
() => )" ]
   },
+  "FAILED_PERMISSION_RESET_ORIGINAL" : {

Review Comment:
   SET_PERMISSION_BACK is shorter than RESET_PERMISSION_TO_ORIGINAL ;-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages

2022-05-23 Thread GitBox



MaxGekk commented on code in PR #36635:
URL: https://github.com/apache/spark/pull/36635#discussion_r879750969


##
core/src/main/resources/error/error-classes.json:
##
@@ -48,13 +48,13 @@
   "FAILED_EXECUTE_UDF" : {
 "message" : [ "Failed to execute user defined function (: 
() => )" ]
   },
+  "FAILED_PERMISSION_RESET_ORIGINAL" : {

Review Comment:
   > How about FAILED_TO_SET_ORIGINAL_PERMISSION_BACK?
   
   It would be nice to have shorter names for error classes since we are going 
to use them as tags in docs headers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages

2022-05-23 Thread GitBox



srielau commented on code in PR #36635:
URL: https://github.com/apache/spark/pull/36635#discussion_r879757479


##
core/src/main/resources/error/error-classes.json:
##
@@ -48,13 +48,13 @@
   "FAILED_EXECUTE_UDF" : {
 "message" : [ "Failed to execute user defined function (: 
() => )" ]
   },
+  "FAILED_PERMISSION_RESET_ORIGINAL" : {

Review Comment:
   Yeah, it's hard to come up with a pattern that really works.
   Some thought:
   
   - The error class is typically followed by the message. No need for a 
sentence.. That's what the message does.
   - Short is beautiful
   - Try to avoid stating that an error is an error in the name.. provides no 
new information. It's an error class! ;-)
   
   How about 
   RESET_PERMISSION_TO_ORIGINAL
   (again we KNOW it's failed)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 163 matches

Mail list logo