[GitHub] [spark] dongjoon-hyun commented on pull request #36386: [SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation
dongjoon-hyun commented on PR #36386: URL: https://github.com/apache/spark/pull/36386#issuecomment-1135428525 Thank you for rechecking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type
cloud-fan commented on code in PR #36628: URL: https://github.com/apache/spark/pull/36628#discussion_r880068159 ## sql/core/src/test/resources/sql-tests/results/ansi/decimalArithmeticOperations.sql.out: ## @@ -142,6 +142,108 @@ struct<(12345678912345.123456789123 / 1.2345678E-8):decimal(38,9)> 100073899961059796.725866332 +-- !query +select 1.0123456789012345678901234567890123456e36BD / 0.1 +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkArithmeticException +[CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded, 10123456789012345678901234567890123456.00, 76, 38) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. +== SQL(line 1, position 7) == +select 1.0123456789012345678901234567890123456e36BD / 0.1 + ^^ + + +-- !query +select 1.01234567890123456789012345678901234567e36BD / 0.1 +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.catalyst.parser.ParseException + +decimal can only support precision up to 38(line 1, pos 7) + +== SQL == +select 1.01234567890123456789012345678901234567e36BD / 0.1 Review Comment: I don't think this test is useful. It fails at the parser side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type
cloud-fan commented on code in PR #36628: URL: https://github.com/apache/spark/pull/36628#discussion_r880067844 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -502,7 +502,8 @@ final class Decimal extends Ordered[Decimal] with Serializable { Decimal(toJavaBigDecimal.multiply(that.toJavaBigDecimal, MATH_CONTEXT)) def / (that: Decimal): Decimal = -if (that.isZero) null else Decimal(toJavaBigDecimal.divide(that.toJavaBigDecimal, MATH_CONTEXT)) +if (that.isZero) null else Decimal(toJavaBigDecimal.divide(that.toJavaBigDecimal, + DecimalType.MAX_SCALE, MATH_CONTEXT.getRoundingMode)) Review Comment: what does Hive use after 2.2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
itholic commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880065419 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: > Oh got it. What about we check if it is `collections.abc.Sized`, and if the size is `0` here instead? +1 for this comment. Otherwise looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on PR #36647: URL: https://github.com/apache/spark/pull/36647#issuecomment-1135414414 > Does the URL change? We might need to keep the mapping between old URL and new URL. If that's complicated, we can do it in a followup. Yes, the URL has changed, let me find a way to keep the url as before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880061918 ## python/docs/source/reference/pyspark.sql/catalog.rst: ## @@ -0,0 +1,48 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + +=== +Catalog +=== + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +Catalog.cacheTable Review Comment: ditto as https://github.com/apache/spark/pull/36647#discussion_r880061516. Let me remove `DataFrame`, in `dataframe.rst` too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880061516 ## python/docs/source/reference/pyspark.sql/spark_session.rst: ## @@ -0,0 +1,54 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + += +Spark Session += +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +The entry point to programming Spark with the Dataset and DataFrame API. +To create a Spark session, you should use ``SparkSession.builder`` attribute. +See also :class:`SparkSession`. + +.. autosummary:: +:toctree: api/ + +SparkSession Review Comment: Thanks. I think we can just remove `SparkSession` here, it's duplicated with `SparkSession` under "Core Classes" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880061219 ## python/docs/source/reference/pyspark.sql/functions.rst: ## @@ -0,0 +1,272 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + += +Functions += + +.. currentmodule:: pyspark.sql.functions + +.. autosummary:: +:toctree: api/ + +abs Review Comment: Let me make a sub-group -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
cloud-fan closed pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up URL: https://github.com/apache/spark/pull/36503 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
cloud-fan commented on PR #36503: URL: https://github.com/apache/spark/pull/36503#issuecomment-1135412773 thanks, merging to master/3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
cloud-fan commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r880060129 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -125,6 +135,31 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { isTemporary = isTemp) } + private def makeTable(ident: Seq[String]): Table = { +val plan = UnresolvedTableOrView(ident, "Catalog.listTables", true) +val node = sparkSession.sessionState.executePlan(plan).analyzed +node match { + case t: ResolvedTable => +val isExternal = t.table.properties().getOrDefault("external", "false").equals("true") +new Table( + name = t.identifier.name(), + database = t.identifier.namespace().head, + description = t.table.properties().get("comment"), + tableType = +if (isExternal) CatalogTableType.EXTERNAL.name +else CatalogTableType.MANAGED.name, + isTemporary = false) + case v: ResolvedView => +new Table( Review Comment: add a new constructor to match the old class definition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
cloud-fan commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r880059787 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -117,14 +128,42 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { case NonFatal(_) => None } val isTemp = sessionCatalog.isTempView(tableIdent) +val qualifier = + Array(metadata.map(_.identifier.database).getOrElse(tableIdent.database).orNull) Review Comment: shall we include catalog name (`spark_catalog`) as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values
HyukjinKwon commented on PR #36583: URL: https://github.com/apache/spark/pull/36583#issuecomment-1135409978 I will defer to @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs
dongjoon-hyun commented on PR #36645: URL: https://github.com/apache/spark/pull/36645#issuecomment-1135409291 Merged to master. Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs
dongjoon-hyun closed pull request #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs URL: https://github.com/apache/spark/pull/36645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
HyukjinKwon commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880057867 ## python/docs/source/reference/pyspark.sql/catalog.rst: ## @@ -0,0 +1,48 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + +=== +Catalog +=== + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +Catalog.cacheTable Review Comment: Here it doesn't have `Catalog` class but others like DataFrame or SparkSession have them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
HyukjinKwon commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880057519 ## python/docs/source/reference/pyspark.sql/spark_session.rst: ## @@ -0,0 +1,54 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + += +Spark Session += +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +The entry point to programming Spark with the Dataset and DataFrame API. +To create a Spark session, you should use ``SparkSession.builder`` attribute. +See also :class:`SparkSession`. + +.. autosummary:: +:toctree: api/ + +SparkSession Review Comment: Here it contains `SparkSession` class but `python/docs/source/reference/pyspark.sql/window.rst` doesn't have `Window` class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
HyukjinKwon commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880057050 ## python/docs/source/reference/pyspark.sql/core_classes.rst: ## @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + + +Core Classes + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +SparkSession +Catalog +DataFrame +Column +Observation +Row +GroupedData +PandasCogroupedOps +DataFrameNaFunctions +DataFrameStatFunctions +Window Review Comment: I meant the class name itself. The list here only contains methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
HyukjinKwon commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880056609 ## python/docs/source/reference/pyspark.sql/functions.rst: ## @@ -0,0 +1,272 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + += +Functions += + +.. currentmodule:: pyspark.sql.functions + +.. autosummary:: +:toctree: api/ + +abs Review Comment: Should we maybe have a sub-group like we do in Pandas API on Spark? e.g., Aggregate Function, Higher Order functions. See also https://github.com/apache/spark/blob/master/R/pkg/R/functions.R. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880055945 ## python/docs/source/reference/pyspark.sql/core_classes.rst: ## @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + + +Core Classes + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +SparkSession +Catalog +DataFrame +Column +Observation +Row +GroupedData +PandasCogroupedOps +DataFrameNaFunctions +DataFrameStatFunctions +Window Review Comment: It's included in Input/Output as below: https://user-images.githubusercontent.com/44108233/169951885-97d7bf9f-7f75-4a63-9c72-5fcdb6bd3c86.png;> do we want to have separated category ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880055945 ## python/docs/source/reference/pyspark.sql/core_classes.rst: ## @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + + +Core Classes + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +SparkSession +Catalog +DataFrame +Column +Observation +Row +GroupedData +PandasCogroupedOps +DataFrameNaFunctions +DataFrameStatFunctions +Window Review Comment: It's included in Input/Output, do we want to have separated category ? https://user-images.githubusercontent.com/44108233/169951885-97d7bf9f-7f75-4a63-9c72-5fcdb6bd3c86.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
HyukjinKwon commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880055802 ## python/docs/source/reference/pyspark.sql/core_classes.rst: ## @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + + +Core Classes + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +SparkSession +Catalog +DataFrame +Column +Observation +Row +GroupedData +PandasCogroupedOps +DataFrameNaFunctions +DataFrameStatFunctions +Window Review Comment: Can we have DataFrameReader DataFrameWriter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic commented on code in PR #36647: URL: https://github.com/apache/spark/pull/36647#discussion_r880055945 ## python/docs/source/reference/pyspark.sql/core_classes.rst: ## @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + + + +Core Classes + +.. currentmodule:: pyspark.sql + +.. autosummary:: +:toctree: api/ + +SparkSession +Catalog +DataFrame +Column +Observation +Row +GroupedData +PandasCogroupedOps +DataFrameNaFunctions +DataFrameStatFunctions +Window Review Comment: It's included in I/O, do we want to have separated category ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #36647: [SPARK-39253][DOCS][PYTHON] Improve PySpark API reference to be more readable
itholic opened a new pull request, #36647: URL: https://github.com/apache/spark/pull/36647 ### What changes were proposed in this pull request? This PR proposes to improve the PySpark API reference page to be more readable, So far, the PySpark documentation especially ["Spark SQL" part](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#) is not-well organized so it's a bit uncomfortable to be read as below: https://user-images.githubusercontent.com/44108233/169951148-f77ba1d1-3e0f-411e-81be-65a5d669f75d.png;> For example, whereas [pandas API on Spark reference page](https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html) is relatively well-organized as below: https://user-images.githubusercontent.com/44108233/169951169-938e13c1-36ba-4d5c-b02e-7b7e80366e2c.png;> ### Why are the changes needed? The improvement of document readability will also improve the usability for PySpark. ### Does this PR introduce _any_ user-facing change? Yes, now the documentation is categorized by its class or their own purpose more clearly as below: https://user-images.githubusercontent.com/44108233/169951517-f8b9cb72-7408-46d6-8cd7-15ae890a7a7f.png;> ### How was this patch tested? The CI already include the doc build, so the existing test should cover. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on PR #36503: URL: https://github.com/apache/spark/pull/36503#issuecomment-1135403198 @cloud-fan tests are green now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #36599: [SPARK-39228][PYTHON][PS] Implement `skipna` of `Series.argmax`
zhengruifeng commented on code in PR #36599: URL: https://github.com/apache/spark/pull/36599#discussion_r880050644 ## python/pyspark/pandas/series.py: ## @@ -6255,36 +6261,47 @@ def argmax(self) -> int: Consider dataset containing cereal calories ->>> s = ps.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0, +>>> s = ps.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0, 'Unknown': np.nan, ...'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0}) ->>> s # doctest: +SKIP +>>> s Corn Flakes 100.0 Almond Delight 110.0 +UnknownNaN Cinnamon Toast Crunch120.0 Cocoa Puff 110.0 dtype: float64 ->>> s.argmax() # doctest: +SKIP -2 +>>> s.argmax() +3 + +>>> s.argmax(skipna=False) +-1 """ sdf = self._internal.spark_frame.select(self.spark.column, NATURAL_ORDER_COLUMN_NAME) +seq_col_name = verify_temp_column_name(sdf, "__distributed_sequence_column__") +sdf = InternalFrame.attach_distributed_sequence_column( +sdf, +seq_col_name, +) +scol = scol_for(sdf, self._internal.data_spark_column_names[0]) + +if skipna: +sdf = sdf.orderBy(scol.desc_nulls_last(), NATURAL_ORDER_COLUMN_NAME) +else: +sdf = sdf.orderBy(scol.desc_nulls_first(), NATURAL_ORDER_COLUMN_NAME) + max_value = sdf.select( -F.max(scol_for(sdf, self._internal.data_spark_column_names[0])), +F.first(scol), F.first(NATURAL_ORDER_COLUMN_NAME), ).head() + if max_value[1] is None: raise ValueError("attempt to get argmax of an empty sequence") elif max_value[0] is None: return -1 -# We should remember the natural sequence started from 0 -seq_col_name = verify_temp_column_name(sdf, "__distributed_sequence_column__") -sdf = InternalFrame.attach_distributed_sequence_column( -sdf.drop(NATURAL_ORDER_COLUMN_NAME), seq_col_name -) + # If the maximum is achieved in multiple locations, the first row position is returned. -return sdf.filter( -scol_for(sdf, self._internal.data_spark_column_names[0]) == max_value[0] -).head()[0] +return sdf.filter(scol == max_value[0]).head()[0] Review Comment: I had a try to apply `max_by` here but found it can not guarantee the `If the maximum is achieved in multiple locations, the first row position is returned.` let's keep current code. I'll take another look at `max_by` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #36628: [SPARK-39248][SQL] Improve divide performance for decimal type
wangyum commented on PR #36628: URL: https://github.com/apache/spark/pull/36628#issuecomment-1135393260 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
HeartSaVioR closed pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix URL: https://github.com/apache/spark/pull/36642 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
HeartSaVioR commented on PR #36642: URL: https://github.com/apache/spark/pull/36642#issuecomment-1135391746 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages
srielau commented on code in PR #36639: URL: https://github.com/apache/spark/pull/36639#discussion_r880044226 ## core/src/main/resources/error/error-classes.json: ## @@ -1,319 +1,519 @@ { "AMBIGUOUS_FIELD_NAME" : { -"message" : [ "Field name is ambiguous and has matching fields in the struct." ], +"message" : [ + "Field name is ambiguous and has matching fields in the struct." +], "sqlState" : "42000" }, "ARITHMETIC_OVERFLOW" : { -"message" : [ ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22003" }, "CANNOT_CAST_DATATYPE" : { -"message" : [ "Cannot cast to ." ], +"message" : [ + "Cannot cast to ." +], "sqlState" : "22005" }, "CANNOT_CHANGE_DECIMAL_PRECISION" : { -"message" : [ " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." ], +"message" : [ + " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CANNOT_PARSE_DECIMAL" : { -"message" : [ "Cannot parse decimal" ], +"message" : [ + "Cannot parse decimal" +], "sqlState" : "42000" }, "CANNOT_UP_CAST_DATATYPE" : { -"message" : [ "Cannot up cast from to .\n" ] +"message" : [ + "Cannot up cast from to .", + "" +] }, "CAST_INVALID_INPUT" : { -"message" : [ "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "42000" }, "CAST_OVERFLOW" : { -"message" : [ "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CONCURRENT_QUERY" : { -"message" : [ "Another instance of this query was just started by a concurrent session." ] +"message" : [ + "Another instance of this query was just started by a concurrent session." +] }, "DATETIME_OVERFLOW" : { -"message" : [ "Datetime operation overflow: ." ], +"message" : [ + "Datetime operation overflow: ." +], "sqlState" : "22008" }, "DIVIDE_BY_ZERO" : { -"message" : [ "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22012" }, "DUPLICATE_KEY" : { -"message" : [ "Found duplicate keys " ], +"message" : [ + "Found duplicate keys " +], "sqlState" : "23000" }, "FAILED_EXECUTE_UDF" : { -"message" : [ "Failed to execute user defined function (: () => )" ] +"message" : [ + "Failed to execute user defined function (: () => )" +] }, "FAILED_RENAME_PATH" : { -"message" : [ "Failed to rename to as destination already exists" ], +"message" : [ + "Failed to rename to as destination already exists" +], "sqlState" : "22023" }, "FAILED_SET_ORIGINAL_PERMISSION_BACK" : { -"message" : [ "Failed to set original permission back to the created path: . Exception: " ] +"message" : [ + "Failed to set original permission back to the created path: . Exception: " +] }, "FORBIDDEN_OPERATION" : { -"message" : [ "The operation is not allowed on : " ] +"message" : [ + "The operation is not allowed on : " +] }, "GRAPHITE_SINK_INVALID_PROTOCOL" : { -"message" : [ "Invalid Graphite protocol: " ] +"message" : [ + "Invalid Graphite protocol: " +] }, "GRAPHITE_SINK_PROPERTY_MISSING" : { -"message" : [ "Graphite sink requires '' property." ] +"message" : [ + "Graphite sink requires '' property." +] }, "GROUPING_COLUMN_MISMATCH" : { -"message" : [ "Column of grouping () can't be found in grouping columns " ], +"message" : [ + "Column of grouping () can't be found in grouping columns " +], "sqlState" : "42000" }, "GROUPING_ID_COLUMN_MISMATCH" : { -"message" : [ "Columns of grouping_id () does not match grouping columns ()" ], +"message"
[GitHub] [spark] zhengruifeng commented on pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark
zhengruifeng commented on PR #36555: URL: https://github.com/apache/spark/pull/36555#issuecomment-1135383195 thanks all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #36353: [SPARK-38946][PYTHON] Fix different behavior in setitem
Yikun commented on PR #36353: URL: https://github.com/apache/spark/pull/36353#issuecomment-1135376632 @itholic We might want to do some special process when calling `_update_internal_frame` for pandas 1.4.x. Will update today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880035876 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: Even if we let it pass this if-else check, PySpark returns an empty DataFrame as below ```py >>> spark.createDataFrame(spark._sc.parallelize([[],[]])).show() ++ || ++ || || ++ ``` whereas Spark SQL returns a DataFrame where each row is an empty array. Let me check the code paths further and adjust that. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880035876 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: Even if we let it pass this if-else check, PySpark returns an empty DataFrame as below ```py >>> spark.createDataFrame(spark._sc.parallelize([[],[]])).show() ++ || ++ || || ++ ``` whereas Spark SQL returns a DataFrame of empty arrays. Let me check the code paths further and adjust that. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #36355: [SPARK-38982][PYTHON] Skip categories setter test
Yikun commented on PR #36355: URL: https://github.com/apache/spark/pull/36355#issuecomment-1135372411 @itholic Now we'd better to only fix test, becasue `categories.setter` will not be completed supported in Pandas, and also some inplace update methods like `set_categories` are deprecated. So, I suggest only fix test and keep current behavior, then remove this setter support when we remove all deprecated methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
HyukjinKwon commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880031452 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: Oh got it. What about we check if it is `collections.abc.Sized`, and if the size is `0` here instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #36646: [SPARK-39267][SQL] Clean up dsl unnecessary symbol
ulysses-you opened a new pull request, #36646: URL: https://github.com/apache/spark/pull/36646 ### What changes were proposed in this pull request? remove two methods in dsl: - subquery(alias: Symbol) - as(alias: Symbol) ### Why are the changes needed? dsl is a test helper file which provide easy used functions. But some of these are unnecessary, for example: `def subquery(alias: Symbol): LogicalPlan` For a subquery, we only need the name, so a string type parameter is enough. Besides, the symbol is not a very stable interface in scala. It should be better to clean up it if unnecessary. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass exists test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880030149 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: Sorry I meant an RDD like `spark._sc.parallelize([[],[]])`, Spark SQL is able to create a DataFrame with an array column but PySpark raises an exception. ```py >>> spark._sc.parallelize([[],[]]).first() [] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880030149 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: Sorry I meant an RDD like `spark._sc.parallelize([[],[]])`, Spark SQL is able to create a DataFrame with an array column but PySpark raises an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages
cloud-fan commented on code in PR #36639: URL: https://github.com/apache/spark/pull/36639#discussion_r880027150 ## core/src/main/resources/error/error-classes.json: ## @@ -1,319 +1,519 @@ { "AMBIGUOUS_FIELD_NAME" : { -"message" : [ "Field name is ambiguous and has matching fields in the struct." ], +"message" : [ + "Field name is ambiguous and has matching fields in the struct." +], "sqlState" : "42000" }, "ARITHMETIC_OVERFLOW" : { -"message" : [ ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22003" }, "CANNOT_CAST_DATATYPE" : { -"message" : [ "Cannot cast to ." ], +"message" : [ + "Cannot cast to ." +], "sqlState" : "22005" }, "CANNOT_CHANGE_DECIMAL_PRECISION" : { -"message" : [ " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." ], +"message" : [ + " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CANNOT_PARSE_DECIMAL" : { -"message" : [ "Cannot parse decimal" ], +"message" : [ + "Cannot parse decimal" +], "sqlState" : "42000" }, "CANNOT_UP_CAST_DATATYPE" : { -"message" : [ "Cannot up cast from to .\n" ] +"message" : [ + "Cannot up cast from to .", + "" +] }, "CAST_INVALID_INPUT" : { -"message" : [ "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "42000" }, "CAST_OVERFLOW" : { -"message" : [ "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CONCURRENT_QUERY" : { -"message" : [ "Another instance of this query was just started by a concurrent session." ] +"message" : [ + "Another instance of this query was just started by a concurrent session." +] }, "DATETIME_OVERFLOW" : { -"message" : [ "Datetime operation overflow: ." ], +"message" : [ + "Datetime operation overflow: ." +], "sqlState" : "22008" }, "DIVIDE_BY_ZERO" : { -"message" : [ "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22012" }, "DUPLICATE_KEY" : { -"message" : [ "Found duplicate keys " ], +"message" : [ + "Found duplicate keys " +], "sqlState" : "23000" }, "FAILED_EXECUTE_UDF" : { -"message" : [ "Failed to execute user defined function (: () => )" ] +"message" : [ + "Failed to execute user defined function (: () => )" +] }, "FAILED_RENAME_PATH" : { -"message" : [ "Failed to rename to as destination already exists" ], +"message" : [ + "Failed to rename to as destination already exists" +], "sqlState" : "22023" }, "FAILED_SET_ORIGINAL_PERMISSION_BACK" : { -"message" : [ "Failed to set original permission back to the created path: . Exception: " ] +"message" : [ + "Failed to set original permission back to the created path: . Exception: " +] }, "FORBIDDEN_OPERATION" : { -"message" : [ "The operation is not allowed on : " ] +"message" : [ + "The operation is not allowed on : " +] }, "GRAPHITE_SINK_INVALID_PROTOCOL" : { -"message" : [ "Invalid Graphite protocol: " ] +"message" : [ + "Invalid Graphite protocol: " +] }, "GRAPHITE_SINK_PROPERTY_MISSING" : { -"message" : [ "Graphite sink requires '' property." ] +"message" : [ + "Graphite sink requires '' property." +] }, "GROUPING_COLUMN_MISMATCH" : { -"message" : [ "Column of grouping () can't be found in grouping columns " ], +"message" : [ + "Column of grouping () can't be found in grouping columns " +], "sqlState" : "42000" }, "GROUPING_ID_COLUMN_MISMATCH" : { -"message" : [ "Columns of grouping_id () does not match grouping columns ()" ], +
[GitHub] [spark] itholic commented on pull request #34212: [SPARK-36402][PYTHON] Implement Series.combine
itholic commented on PR #34212: URL: https://github.com/apache/spark/pull/34212#issuecomment-1135358555 The Github CI bug was fixed, can you rebase? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #36353: [SPARK-38946][PYTHON] Fix different behavior in setitem
itholic commented on PR #36353: URL: https://github.com/apache/spark/pull/36353#issuecomment-1135358115 Just confirming, any update on this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #36355: [SPARK-38982][PYTHON] Skip categories setter test
itholic commented on PR #36355: URL: https://github.com/apache/spark/pull/36355#issuecomment-1135357556 Seems like they replied https://github.com/pandas-dev/pandas/issues/46820#issuecomment-1113889315. Maybe can we just close ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
HyukjinKwon commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r880021145 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: I think we can just ignore this condition. If RDD is empty, `RDD.first` seems throwing an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
AmplabJenkins commented on PR #36642: URL: https://github.com/apache/spark/pull/36642#issuecomment-1135352468 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36643: [SPARK-39265][SQL] Support non-vectorized Parquet scans with DEFAULT values
AmplabJenkins commented on PR #36643: URL: https://github.com/apache/spark/pull/36643#issuecomment-1135352445 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36641: [SPARK-39263] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
AmplabJenkins commented on PR #36641: URL: https://github.com/apache/spark/pull/36641#issuecomment-1135352493 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark
HyukjinKwon closed pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark URL: https://github.com/apache/spark/pull/36555 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36555: [SPARK-39189][PS] Support limit_area parameter in pandas API on Spark
HyukjinKwon commented on PR #36555: URL: https://github.com/apache/spark/pull/36555#issuecomment-1135351855 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #36386: [SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation
allisonwang-db commented on PR #36386: URL: https://github.com/apache/spark/pull/36386#issuecomment-1135332088 @dongjoon-hyun Looks like it still has the same issue: ```bash # Test passed build/sbt "sql/testOnly *PlanStabilitySuite" # Test failed for q4 and q5 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q5)" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages
srielau commented on code in PR #36639: URL: https://github.com/apache/spark/pull/36639#discussion_r880001283 ## core/src/main/resources/error/error-classes.json: ## @@ -1,319 +1,519 @@ { "AMBIGUOUS_FIELD_NAME" : { -"message" : [ "Field name is ambiguous and has matching fields in the struct." ], +"message" : [ + "Field name is ambiguous and has matching fields in the struct." +], "sqlState" : "42000" }, "ARITHMETIC_OVERFLOW" : { -"message" : [ ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22003" }, "CANNOT_CAST_DATATYPE" : { -"message" : [ "Cannot cast to ." ], +"message" : [ + "Cannot cast to ." +], "sqlState" : "22005" }, "CANNOT_CHANGE_DECIMAL_PRECISION" : { -"message" : [ " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." ], +"message" : [ + " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CANNOT_PARSE_DECIMAL" : { -"message" : [ "Cannot parse decimal" ], +"message" : [ + "Cannot parse decimal" +], "sqlState" : "42000" }, "CANNOT_UP_CAST_DATATYPE" : { -"message" : [ "Cannot up cast from to .\n" ] +"message" : [ + "Cannot up cast from to .", + "" +] }, "CAST_INVALID_INPUT" : { -"message" : [ "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "42000" }, "CAST_OVERFLOW" : { -"message" : [ "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CONCURRENT_QUERY" : { -"message" : [ "Another instance of this query was just started by a concurrent session." ] +"message" : [ + "Another instance of this query was just started by a concurrent session." +] }, "DATETIME_OVERFLOW" : { -"message" : [ "Datetime operation overflow: ." ], +"message" : [ + "Datetime operation overflow: ." +], "sqlState" : "22008" }, "DIVIDE_BY_ZERO" : { -"message" : [ "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22012" }, "DUPLICATE_KEY" : { -"message" : [ "Found duplicate keys " ], +"message" : [ + "Found duplicate keys " +], "sqlState" : "23000" }, "FAILED_EXECUTE_UDF" : { -"message" : [ "Failed to execute user defined function (: () => )" ] +"message" : [ + "Failed to execute user defined function (: () => )" +] }, "FAILED_RENAME_PATH" : { -"message" : [ "Failed to rename to as destination already exists" ], +"message" : [ + "Failed to rename to as destination already exists" +], "sqlState" : "22023" }, "FAILED_SET_ORIGINAL_PERMISSION_BACK" : { -"message" : [ "Failed to set original permission back to the created path: . Exception: " ] +"message" : [ + "Failed to set original permission back to the created path: . Exception: " +] }, "FORBIDDEN_OPERATION" : { -"message" : [ "The operation is not allowed on : " ] +"message" : [ + "The operation is not allowed on : " +] }, "GRAPHITE_SINK_INVALID_PROTOCOL" : { -"message" : [ "Invalid Graphite protocol: " ] +"message" : [ + "Invalid Graphite protocol: " +] }, "GRAPHITE_SINK_PROPERTY_MISSING" : { -"message" : [ "Graphite sink requires '' property." ] +"message" : [ + "Graphite sink requires '' property." +] }, "GROUPING_COLUMN_MISMATCH" : { -"message" : [ "Column of grouping () can't be found in grouping columns " ], +"message" : [ + "Column of grouping () can't be found in grouping columns " +], "sqlState" : "42000" }, "GROUPING_ID_COLUMN_MISMATCH" : { -"message" : [ "Columns of grouping_id () does not match grouping columns ()" ], +"message"
[GitHub] [spark] huaxingao commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified
huaxingao commented on PR #36644: URL: https://github.com/apache/spark/pull/36644#issuecomment-1135317504 cc @cloud-fan @aokolnychyi @ulysses-you This PR is ready for review. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36639: [SPARK-39261][CORE] Improve newline formatting for error messages
cloud-fan commented on code in PR #36639: URL: https://github.com/apache/spark/pull/36639#discussion_r879992776 ## core/src/main/resources/error/error-classes.json: ## @@ -1,319 +1,519 @@ { "AMBIGUOUS_FIELD_NAME" : { -"message" : [ "Field name is ambiguous and has matching fields in the struct." ], +"message" : [ + "Field name is ambiguous and has matching fields in the struct." +], "sqlState" : "42000" }, "ARITHMETIC_OVERFLOW" : { -"message" : [ ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + ". If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22003" }, "CANNOT_CAST_DATATYPE" : { -"message" : [ "Cannot cast to ." ], +"message" : [ + "Cannot cast to ." +], "sqlState" : "22005" }, "CANNOT_CHANGE_DECIMAL_PRECISION" : { -"message" : [ " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." ], +"message" : [ + " cannot be represented as Decimal(, ). If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CANNOT_PARSE_DECIMAL" : { -"message" : [ "Cannot parse decimal" ], +"message" : [ + "Cannot parse decimal" +], "sqlState" : "42000" }, "CANNOT_UP_CAST_DATATYPE" : { -"message" : [ "Cannot up cast from to .\n" ] +"message" : [ + "Cannot up cast from to .", + "" +] }, "CAST_INVALID_INPUT" : { -"message" : [ "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to because it is malformed. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "42000" }, "CAST_OVERFLOW" : { -"message" : [ "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." ], +"message" : [ + "The value of the type cannot be cast to due to an overflow. To return NULL instead, use `try_cast`. If necessary set to \"false\" to bypass this error." +], "sqlState" : "22005" }, "CONCURRENT_QUERY" : { -"message" : [ "Another instance of this query was just started by a concurrent session." ] +"message" : [ + "Another instance of this query was just started by a concurrent session." +] }, "DATETIME_OVERFLOW" : { -"message" : [ "Datetime operation overflow: ." ], +"message" : [ + "Datetime operation overflow: ." +], "sqlState" : "22008" }, "DIVIDE_BY_ZERO" : { -"message" : [ "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." ], +"message" : [ + "Division by zero. To return NULL instead, use `try_divide`. If necessary set to \"false\" (except for ANSI interval type) to bypass this error." +], "sqlState" : "22012" }, "DUPLICATE_KEY" : { -"message" : [ "Found duplicate keys " ], +"message" : [ + "Found duplicate keys " +], "sqlState" : "23000" }, "FAILED_EXECUTE_UDF" : { -"message" : [ "Failed to execute user defined function (: () => )" ] +"message" : [ + "Failed to execute user defined function (: () => )" +] }, "FAILED_RENAME_PATH" : { -"message" : [ "Failed to rename to as destination already exists" ], +"message" : [ + "Failed to rename to as destination already exists" +], "sqlState" : "22023" }, "FAILED_SET_ORIGINAL_PERMISSION_BACK" : { -"message" : [ "Failed to set original permission back to the created path: . Exception: " ] +"message" : [ + "Failed to set original permission back to the created path: . Exception: " +] }, "FORBIDDEN_OPERATION" : { -"message" : [ "The operation is not allowed on : " ] +"message" : [ + "The operation is not allowed on : " +] }, "GRAPHITE_SINK_INVALID_PROTOCOL" : { -"message" : [ "Invalid Graphite protocol: " ] +"message" : [ + "Invalid Graphite protocol: " +] }, "GRAPHITE_SINK_PROPERTY_MISSING" : { -"message" : [ "Graphite sink requires '' property." ] +"message" : [ + "Graphite sink requires '' property." +] }, "GROUPING_COLUMN_MISMATCH" : { -"message" : [ "Column of grouping () can't be found in grouping columns " ], +"message" : [ + "Column of grouping () can't be found in grouping columns " +], "sqlState" : "42000" }, "GROUPING_ID_COLUMN_MISMATCH" : { -"message" : [ "Columns of grouping_id () does not match grouping columns ()" ], +
[GitHub] [spark] anishshri-db commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
anishshri-db commented on code in PR #36642: URL: https://github.com/apache/spark/pull/36642#discussion_r879991863 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -451,8 +451,10 @@ abstract class StreamExecution( // Offset has been reached and the equality condition might never be met. if (!localCommittedOffsets.contains(source)) { true -} else if (newOffset.isInstanceOf[LongOffset]) { - localCommittedOffsets(source).toString.toLong < newOffset.asInstanceOf[LongOffset].offset +} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] && Review Comment: Simplified this further, based on convo with @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anishshri-db commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
anishshri-db commented on code in PR #36642: URL: https://github.com/apache/spark/pull/36642#discussion_r879982515 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -451,8 +451,10 @@ abstract class StreamExecution( // Offset has been reached and the equality condition might never be met. if (!localCommittedOffsets.contains(source)) { true -} else if (newOffset.isInstanceOf[LongOffset]) { - localCommittedOffsets(source).toString.toLong < newOffset.asInstanceOf[LongOffset].offset +} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] && Review Comment: Verified and it does seem to work in this case. Made the change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879978761 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: Filter is the case where we saw this issue (Filter contains a exists or not exists and a subplan as its child, and in both there are the same relation that will be deduplicated). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36636: [SPARK-39256][CORE] Reduce multiple file attribute calls of `JavaUtils#deleteRecursivelyUsingJavaIO`
LuciferYang commented on PR #36636: URL: https://github.com/apache/spark/pull/36636#issuecomment-1135303165 [127a515](https://github.com/apache/spark/pull/36636/commits/127a51581da94995b51ece11caf181200416d3bf) merge with master and rerun all tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jerqi commented on a diff in pull request #36519: [SPARK-39159][SQL] Add new Dataset API for Offset
jerqi commented on code in PR #36519: URL: https://github.com/apache/spark/pull/36519#discussion_r879981480 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2102,6 +2102,16 @@ class Dataset[T] private[sql]( Limit(Literal(n), logicalPlan) } + /** + * Returns a new Dataset by skipping the first `m` rows. Review Comment: a typo? may be 'n'? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879980347 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { Review Comment: sure done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879978761 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: Filter is the case where we saw this issue (Filter contains a exists or not exists and a subplan as its child, and in both there are the same relation that will be deduplicated). SQL UDF is a case that is not filter but also contains outer reference. That is a case that this fix does not aim at. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879978761 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: Filter is the case where we saw this issue (Filter contains a exists or not exists and a subplan as its child, and in both there are the same relation that will be deduplicated). SQL UDF is a case that is not filter but also contains outer reference. That is a case that this fix is not aim at. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879979107 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: Generally speaking if we want to improve outer reference check in general, that will beyond SPARK-39144 and worth a specific JIRA to track. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
amaliujia commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879978761 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: Filter is the case where we saw this issue (Filter contains a exists or not exists and a subplan as its child). SQL UDF is a case that is not filter but also contains outer reference. That is a case that this fix is not aim at. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`
LuciferYang commented on PR #36637: URL: https://github.com/apache/spark/pull/36637#issuecomment-1135297918 thanks all ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
cloud-fan commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879972900 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { + if (!attrs.exists(_.exprId == outer.exprId)) { +failAnalysis("outer attribute not found") + } +} + +def checkOuterReference(p: LogicalPlan, expr: SubqueryExpression): Unit = p match { + case f: Filter => Review Comment: why do we only apply the check for Filter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36503: [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up
cloud-fan commented on code in PR #36503: URL: https://github.com/apache/spark/pull/36503#discussion_r879972792 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -752,9 +752,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { expressions.exists(_.exists(_.semanticEquals(expr))) } +def checkOuterReferenceUtil(attrs: Seq[Attribute], outer: OuterReference): Unit = { Review Comment: This is only called once, can we inline it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35425: [SPARK-38129][SQL] Adaptively enable timeout for BroadcastQueryStageExec in AQE
github-actions[bot] closed pull request #35425: [SPARK-38129][SQL] Adaptively enable timeout for BroadcastQueryStageExec in AQE URL: https://github.com/apache/spark/pull/35425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35484: [SPARK-38181][SS] Update comments in KafkaDataConsumer.scala
github-actions[bot] closed pull request #35484: [SPARK-38181][SS] Update comments in KafkaDataConsumer.scala URL: https://github.com/apache/spark/pull/35484 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36627: [SPARK-39250][BUILD] Upgrade Jackson to 2.13.3
dongjoon-hyun commented on PR #36627: URL: https://github.com/apache/spark/pull/36627#issuecomment-1135262491 Yes, that failure was irrelevant to this PR, @MaxGekk . - It was caused via https://github.com/apache/spark/pull/36632 - It was fixed via https://github.com/apache/spark/pull/36637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen opened a new pull request, #36645: [SPARK-39266][CORE] Cleanup unused `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs
JoshRosen opened a new pull request, #36645: URL: https://github.com/apache/spark/pull/36645 ### What changes were proposed in this pull request? This PR cleans up the `spark.rpc.numRetries` and `spark.rpc.retry.wait` configs, both of which are ununused. ### Why are the changes needed? Since SPARK-19450 / #16790 in Spark 2.2.0, both of these configurations are unused and setting them has no effect. Marking the configs as deprecated and cleaning them up from the docs helps to avoid user confusion. In addition, this cleanup slightly improves the performance of constructing `RpcEndpointRef`s because it removes two unused fields that were initialized by reading the deprecated configs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? n/a -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified URL: https://github.com/apache/spark/pull/34785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
huaxingao commented on PR #34785: URL: https://github.com/apache/spark/pull/34785#issuecomment-1135252175 It's a bit too hard to rebase. I will close this PR and the new one is here https://github.com/apache/spark/pull/36644. Thank you all very much for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao opened a new pull request, #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified
huaxingao opened a new pull request, #36644: URL: https://github.com/apache/spark/pull/36644 ### What changes were proposed in this pull request? Re-optimize partitions in Distribution and Ordering if numPartitions is not specified. Support both strictly required distribution and best effort distribution. For best effort distribution, if user doesn't request a specific number of partitions, Spark will split skewed partitions and coalesce small partitions (`RebalancePartitions` will be used and both `OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be applied). For strictly required distribution, if user doesn't request a specific number of partitions, Spark will coalesce small partitions but will NOT split skewed partitions(`RepartitionByExpression(distribution, query, None)` will be used and `CoalesceShufflePartitions` will be applied). ### Why are the changes needed? optimization ### Does this PR introduce _any_ user-facing change? In interface `RequiresDistributionAndOrdering`, a new API `distributionStrictlyRequired()` will be added to tell if the distribution required by this write is strictly required or best effort only ``` default boolean distributionStrictlyRequired() { return true; } ``` ### How was this patch tested? Existing and new tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
HeartSaVioR commented on code in PR #36642: URL: https://github.com/apache/spark/pull/36642#discussion_r879934124 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -451,8 +451,10 @@ abstract class StreamExecution( // Offset has been reached and the equality condition might never be met. if (!localCommittedOffsets.contains(source)) { true -} else if (newOffset.isInstanceOf[LongOffset]) { - localCommittedOffsets(source).toString.toLong < newOffset.asInstanceOf[LongOffset].offset +} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] && Review Comment: This is going to be verbose on checking the type and casting, so could you please check whether we can enjoy the benefit of pattern matching or not? ``` (localCommittedOffsets.get(source), newOffset) match { case (None, newOff) => true case (Some(localOff: LongOffset), newOff: LongOffset) => localOff.offset < newOff.offset case (Some(localOff), newOff) => localOff != newOff } ``` Worth noting that awaitOffset is not on the critical path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
HeartSaVioR commented on code in PR #36642: URL: https://github.com/apache/spark/pull/36642#discussion_r879934124 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -451,8 +451,10 @@ abstract class StreamExecution( // Offset has been reached and the equality condition might never be met. if (!localCommittedOffsets.contains(source)) { true -} else if (newOffset.isInstanceOf[LongOffset]) { - localCommittedOffsets(source).toString.toLong < newOffset.asInstanceOf[LongOffset].offset +} else if (localCommittedOffsets(source).isInstanceOf[LongOffset] && Review Comment: This is going to be verbose on checking the type and casting, so could you please check whether we can enjoy the benefit of pattern matching or not? ``` (localCommittedOffsets.get(source), newOffset) match { case (Some(localOff: LongOffset), newOff: LongOffset) => localOff.offset < newOff.offset case (Some(localOff), newOff) => localOff != newOff case (None, newOff) => true } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anishshri-db commented on pull request #36642: [SPARK-39264][SS] Fix type check and conversion to longOffset for awaitOffset fix
anishshri-db commented on PR #36642: URL: https://github.com/apache/spark/pull/36642#issuecomment-1135224215 @HeartSaVioR - Could you please take a look and merge ? Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #36643: [SPARK-39245][SQL] Support non-vectorized Parquet scans with DEFAULT values
dtenedor opened a new pull request, #36643: URL: https://github.com/apache/spark/pull/36643 ### What changes were proposed in this pull request? Support non-vectorized Parquet scans when the table schema has associated DEFAULT column values. (Note: this PR depends on https://github.com/apache/spark/pull/36583) Example: ``` create table t(i int) using parquet; insert into t values(42); alter table t add column s string default concat('abc', def'); select * from t; > 42, 'abcdef' ``` ### Why are the changes needed? This change makes it easier to build, query, and maintain tables backed by Parquet data. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR includes new test coverage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anishshri-db commented on pull request #36642: [SPARK-39264] Fix type check and conversion to longOffset for awaitOffset fix
anishshri-db commented on PR #36642: URL: https://github.com/apache/spark/pull/36642#issuecomment-1135208135 @alex-balikov @HeartSaVioR @viirya - Could you folks please take a look ? Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anishshri-db opened a new pull request, #36642: [SPARK-39264] Fix type check and conversion to longOffset for awaitOffset fix
anishshri-db opened a new pull request, #36642: URL: https://github.com/apache/spark/pull/36642 … ### What changes were proposed in this pull request? Fix type check and conversion to longOffset for awaitOffset fix. Based on discussion with comments from @alex-balikov ### Why are the changes needed? To ensure type safety while doing comparisons and avoid type mismatch related bugs/issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Change only for making type checks explicit. Ran existing tests and verified that they pass: ``` [info] RateStreamProviderSuite: 15:37:58.700 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [info] - RateStreamProvider in registry (438 milliseconds) [info] - compatible with old path in registry (1 millisecond) 15:38:00.958 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - microbatch - basic (2 seconds, 346 milliseconds) 15:38:02.074 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. 15:38:04.391 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - microbatch - restart (4 seconds, 294 milliseconds) 15:38:06.450 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - microbatch - uniform distribution of event timestamps (547 milliseconds) [info] - microbatch - infer offsets (90 milliseconds) [info] - microbatch - predetermined batch size (74 milliseconds) [info] - microbatch - data read (73 milliseconds) [info] - valueAtSecond (0 milliseconds) 15:38:07.243 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - rampUpTime (1 second, 633 milliseconds) 15:38:08.806 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - numPartitions (924 milliseconds) 15:38:09.702 WARN org.apache.spark.sql.execution.streaming.ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. [info] - overflow (241 milliseconds) [info] - illegal option values (3 milliseconds) [info] - user-specified schema given (8 milliseconds) [info] - continuous data (1 second, 12 milliseconds) 15:38:11.035 WARN org.apache.spark.sql.execution.streaming.sources.RateStreamProviderSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.sources.RateStreamProviderSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true), state-store-maintenance-task (daemon=true) = [info] Run completed in 13 seconds, 606 milliseconds. [info] Total number of tests run: 15 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [WIP][SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r879901024 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -210,8 +252,23 @@ private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( File metaFile = appShuffleInfo.getMergedShuffleMetaFile(shuffleId, shuffleMergeId, reduceId); try { -return newAppShufflePartitionInfo(appShuffleInfo.appId, shuffleId, shuffleMergeId, - reduceId, dataFile, indexFile, metaFile); +AppShufflePartitionInfo appShufflePartitionInfo = +newAppShufflePartitionInfo(appShuffleInfo.appId, appShuffleInfo.attemptId, +shuffleId, shuffleMergeId, reduceId, dataFile, indexFile, metaFile); +if (db != null) { + try{ +byte[] dbKey = +dbAppAttemptShufflePartitionKeyWithCustomPrefix( Review Comment: Created two methods for two different types of DB writes. It is not that straightforward to get it generalized in one method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [WIP][SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r879899830 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -655,6 +776,238 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + private static byte[] dbAppAttemptShufflePartitionKeyWithCustomPrefix( + AppShufflePartitionId id, + String prefix) throws IOException { +// we stick a common prefix on all the keys so we can find them in the DB +String appAttemptShufflePartitionJson = mapper.writeValueAsString(id); +String key = prefix + ";" + appAttemptShufflePartitionJson; +return key.getBytes(StandardCharsets.UTF_8); + } + + private static AppShufflePartitionId parseDbAppAttemptShufflePartitionKeyWithCustomPrefix( + String s, + String prefix) throws IOException { +if (!s.startsWith(prefix)) { + throw new IllegalArgumentException("expected a string starting with " + prefix); +} +String json = s.substring(prefix.length() + 1); +return mapper.readValue(json, AppShufflePartitionId.class); + } + + private static byte[] dbAppAttemptPathsKey(String appId, int attemptId) throws IOException { +// we stick a common prefix on all the keys so we can find them in the DB +String appAttemptIdJson = mapper.writeValueAsString(new AppAttemptId(appId, attemptId)); +String key = APP_ATTEMPT_PATH_KEY_PREFIX + ";" + appAttemptIdJson; +return key.getBytes(StandardCharsets.UTF_8); + } + + private static AppAttemptId parseDbAppAttemptPathsKey(String s) throws IOException { +if (!s.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) { + throw new IllegalArgumentException("expected a string starting with " ++ APP_ATTEMPT_PATH_KEY_PREFIX); +} +String json = s.substring(APP_ATTEMPT_PATH_KEY_PREFIX.length() + 1); +return mapper.readValue(json, AppAttemptId.class); + } + + public void reloadAppShuffleInfo(DB db) throws IOException { +logger.debug("Reload applications merged shuffle information from DB"); +reloadActiveAppAttemptsPathInfo(db); +reloadFinalizedAppAttemptsShuffleInfo(db); +reloadActiveAppAttemptsShufflePartitionInfo(db); + } + + @VisibleForTesting + public void reloadActiveAppAttemptsPathInfo(DB db) throws IOException { +if (db != null) { + DBIterator itr = db.iterator(); + itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8)); + while (itr.hasNext()) { +Map.Entry e = itr.next(); +String key = new String(e.getKey(), StandardCharsets.UTF_8); +if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) { + break; +} +AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key); +try{ + AppPathsInfo appPathsInfo = mapper.readValue(e.getValue(), AppPathsInfo.class); + logger.debug("Reloading active application {}_{} merged shuffle files paths", +appAttemptId.appId, appAttemptId.attemptId); + appsShuffleInfo.computeIfAbsent(appAttemptId.appId, id -> +new AppShuffleInfo(appAttemptId.appId, appAttemptId.attemptId, appPathsInfo)); +} catch (Exception exception) { + logger.error("Parsing exception is {}", exception); +} + } +} + } + + @VisibleForTesting + public void reloadFinalizedAppAttemptsShuffleInfo(DB db) throws IOException { +if (db != null) { + DBIterator itr = db.iterator(); + itr.seek(APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX.getBytes(StandardCharsets.UTF_8)); + while (itr.hasNext()) { +Map.Entry e = itr.next(); +String key = new String(e.getKey(), StandardCharsets.UTF_8); +if (!key.startsWith(APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX)) { + break; +} +AppShufflePartitionId partitionId = + parseDbAppAttemptShufflePartitionKeyWithCustomPrefix( +key, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX); +if (partitionId.reduceId != -1) { Review Comment: Removed. It was for former implementation where we can reuse the AppShufflePartitionId for both finalized and on-going shuffle merges. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail:
[GitHub] [spark] dtenedor commented on pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans with DEFAULT values
dtenedor commented on PR #36623: URL: https://github.com/apache/spark/pull/36623#issuecomment-1135171752 It appears Spark does not allow ALTER ADD COLUMN DML with Avro tables. Closing this for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor closed pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans with DEFAULT values
dtenedor closed pull request #36623: [WIP][SPARK-39245][SQL] Support Avro scans with DEFAULT values URL: https://github.com/apache/spark/pull/36623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #36641: [SPARK-39263] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia opened a new pull request, #36641: URL: https://github.com/apache/spark/pull/36641 ### What changes were proposed in this pull request? Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace ### Why are the changes needed? This is a part of effort to make catalog API be compatible with 3 layer namespace ### Does this PR introduce _any_ user-facing change? Yes. The API change here is backward compatible and it extends the API to further support 3 layer namespace (e.g. catalog.database.table). ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] akpatnam25 commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is se
akpatnam25 commented on code in PR #36601: URL: https://github.com/apache/spark/pull/36601#discussion_r879883905 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -2449,7 +2459,12 @@ private[spark] class DAGScheduler( val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch) logDebug(s"Considering removal of executor $execId; " + s"fileLost: $fileLost, currentEpoch: $currentEpoch") -if (!executorFailureEpoch.contains(execId) || executorFailureEpoch(execId) < currentEpoch) { +// Check if the execId is a shuffle push merger +// We do not remove the executor if it is, +// and only remove the outputs on that host/executor. Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`
dongjoon-hyun commented on PR #36638: URL: https://github.com/apache/spark/pull/36638#issuecomment-1135155332 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`
dongjoon-hyun closed pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader` URL: https://github.com/apache/spark/pull/36638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`
dongjoon-hyun commented on PR #36637: URL: https://github.com/apache/spark/pull/36637#issuecomment-1135154660 Merged to master/3.3/3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`
dongjoon-hyun closed pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table` URL: https://github.com/apache/spark/pull/36637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36637: [SPARK-39258][TESTS] Fix `Hide credentials in show create table`
dongjoon-hyun commented on PR #36637: URL: https://github.com/apache/spark/pull/36637#issuecomment-1135153418 PySpark GitHub Action job is still running over 5h 40minute. It's irrelevant to this test PR. I'll merge this PR. https://user-images.githubusercontent.com/9700541/169908543-d94e0bc8-5489-42c4-bac5-fdc5e80b7767.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36638: [SPARK-39260][SQL] Use `Reader.getSchema` instead of `Reader.getTypes` in `SparkOrcNewRecordReader`
dongjoon-hyun commented on PR #36638: URL: https://github.com/apache/spark/pull/36638#issuecomment-1135152042 The PySpark GitHub Action jobs are still running exceptionally, but they are not related to this ORC patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vli-databricks closed pull request #36527: [SPARK-39169][SQL] Optimize FIRST when used as a single aggregate fun…
vli-databricks closed pull request #36527: [SPARK-39169][SQL] Optimize FIRST when used as a single aggregate fun… URL: https://github.com/apache/spark/pull/36527 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r879816903 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -125,6 +135,31 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { isTemporary = isTemp) } + private def makeTable(ident: Seq[String]): Table = { +val plan = UnresolvedTableOrView(ident, "Catalog.listTables", true) +val node = sparkSession.sessionState.executePlan(plan).analyzed +node match { + case t: ResolvedTable => +val isExternal = t.table.properties().getOrDefault("external", "false").equals("true") +new Table( + name = t.identifier.name(), + database = t.identifier.namespace().head, + description = t.table.properties().get("comment"), + tableType = +if (isExternal) CatalogTableType.EXTERNAL.name +else CatalogTableType.MANAGED.name, + isTemporary = false) + case v: ResolvedView => +new Table( Review Comment: BTW this idea triggered a compatibility check issue ``` spark-sql: Failed binary compatibility check against org.apache.spark:spark-sql_2.12:3.2.0! Found 1 potential problems (filtered 586) [error] * method this(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Boolean)Unit in class org.apache.spark.sql.catalog.Table's type is different in current version, where it is (java.lang.String,Array[java.lang.String],java.lang.String,java.lang.String,Boolean)Unit instead of (java.lang.String,java.lang.String,java.lang.String,java.lang.String,Boolean)Unit ``` How do we usually solve such issues? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to
otterc commented on code in PR #36601: URL: https://github.com/apache/spark/pull/36601#discussion_r879790477 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -2449,7 +2459,12 @@ private[spark] class DAGScheduler( val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch) logDebug(s"Considering removal of executor $execId; " + s"fileLost: $fileLost, currentEpoch: $currentEpoch") -if (!executorFailureEpoch.contains(execId) || executorFailureEpoch(execId) < currentEpoch) { +// Check if the execId is a shuffle push merger +// We do not remove the executor if it is, +// and only remove the outputs on that host/executor. Review Comment: Nit: `and only remove the outputs on the host` ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -2461,7 +2476,7 @@ private[spark] class DAGScheduler( clearCacheLocs() } if (fileLost) { - val remove = if (ignoreShuffleFileLostEpoch) { + val remove = if (ignoreShuffleFileLostEpoch || isShuffleMerger) { Review Comment: `removeExecutorAndUnregisterOutputs` is being called from 2 places: 1. Fetch failure handling: So here `ignoreShuffleFileLostEpoch` is going to be `true`. Then it doesn't matter if the execId is a shuffleMergerId or not. 2. Executor Lost handling: When it is called from here, then it is never going to be a `shuffleMergerId`. Given this, I think it might be better to remove this `isShuffleMerger` check here and just add a comment that when the fetch failure is for a merged shuffle chunk, `ignoreShuffleFileLostEpoch` is true, and so all the files will be removed. WDYT? cc. @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r879786294 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: When `first` is a nested list with the first element is an empty list, PySpark diverges from scala DataFrame API. A follow-up PR will be filed to adjust that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36640: [SPARK-39262][PYTHON] Correct error messages when creating DataFrame from an RDD with the first element `0`
xinrong-databricks commented on code in PR #36640: URL: https://github.com/apache/spark/pull/36640#discussion_r879786294 ## python/pyspark/sql/session.py: ## @@ -611,7 +611,7 @@ def _inferSchema( :class:`pyspark.sql.types.StructType` """ first = rdd.first() -if not first: +if first != 0 and not first: Review Comment: When `first` is a nested list with the first element is an empty list, PySpark diverges from Spark SQL. A follow-up PR will be filed to adjust that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks opened a new pull request, #36640: Correct error messages when creating DataFrame
xinrong-databricks opened a new pull request, #36640: URL: https://github.com/apache/spark/pull/36640 ### What changes were proposed in this pull request? Correct error messages when creating DataFrame. Previously, when the first element of RDD is 0, we raise a ValueError "The first row in RDD is empty, can not infer schema". However, a TypeError "Can not infer schema for type: " should be raised instead. ### Why are the changes needed? Improve error messages and debugability. ### Does this PR introduce _any_ user-facing change? Error message change. Before: ```py >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) Traceback (most recent call last): ... ValueError: The first row in RDD is empty, can not infer schema ``` After: ```py >>> spark.createDataFrame(spark._sc.parallelize([0, 1])) Traceback (most recent call last): ... TypeError: Can not infer schema for type: ``` ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages
MaxGekk commented on code in PR #36635: URL: https://github.com/apache/spark/pull/36635#discussion_r879761080 ## core/src/main/resources/error/error-classes.json: ## @@ -48,13 +48,13 @@ "FAILED_EXECUTE_UDF" : { "message" : [ "Failed to execute user defined function (: () => )" ] }, + "FAILED_PERMISSION_RESET_ORIGINAL" : { Review Comment: SET_PERMISSION_BACK is shorter than RESET_PERMISSION_TO_ORIGINAL ;-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages
MaxGekk commented on code in PR #36635: URL: https://github.com/apache/spark/pull/36635#discussion_r879750969 ## core/src/main/resources/error/error-classes.json: ## @@ -48,13 +48,13 @@ "FAILED_EXECUTE_UDF" : { "message" : [ "Failed to execute user defined function (: () => )" ] }, + "FAILED_PERMISSION_RESET_ORIGINAL" : { Review Comment: > How about FAILED_TO_SET_ORIGINAL_PERMISSION_BACK? It would be nice to have shorter names for error classes since we are going to use them as tags in docs headers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #36635: [SPARK-39255][SQL] Improve error messages
srielau commented on code in PR #36635: URL: https://github.com/apache/spark/pull/36635#discussion_r879757479 ## core/src/main/resources/error/error-classes.json: ## @@ -48,13 +48,13 @@ "FAILED_EXECUTE_UDF" : { "message" : [ "Failed to execute user defined function (: () => )" ] }, + "FAILED_PERMISSION_RESET_ORIGINAL" : { Review Comment: Yeah, it's hard to come up with a pattern that really works. Some thought: - The error class is typically followed by the message. No need for a sentence.. That's what the message does. - Short is beautiful - Try to avoid stating that an error is an error in the name.. provides no new information. It's an error class! ;-) How about RESET_PERMISSION_TO_ORIGINAL (again we KNOW it's failed) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org