[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments
dongjoon-hyun commented on code in PR #37964: URL: https://github.com/apache/spark/pull/37964#discussion_r977211125 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -257,10 +257,10 @@ def applyInPandasWithState( user-defined state. The value of the state will be presented as a tuple, as well as the update should be performed with the tuple. The corresponding Python types for :class:DataType are supported. Please refer to the page -https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab). +https://spark.apache.org/docs/latest/sql-ref-datatypes.html (Python tab). -The size of each DataFrame in both the input and output can be arbitrary. The number of -DataFrames in both the input and output can also be arbitrary. +The size of each `pandas.DataFrame` in both the input and output can be arbitrary. The +number `pandas.DataFrame` in both the input and output can also be arbitrary. Review Comment: `number` -> `number of`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37953: [SPARK-40510][PS] Implement `ddof` in `Series.cov`
zhengruifeng commented on PR #37953: URL: https://github.com/apache/spark/pull/37953#issuecomment-1254537682 also cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation
beliefer commented on code in PR #37915: URL: https://github.com/apache/spark/pull/37915#discussion_r977203717 ## sql/catalyst/src/test/scala/org/apache/spark/sql/DecimalBenchmark.scala: ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection +import org.apache.spark.sql.types.{Decimal, DecimalType, StructType} + +/** + * Benchmark for the previous Decimal (without proxy mode) vs refactored Decimal (proxy mode). + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars + * 2. build/sbt "catalyst/Test/runMain " + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "catalyst/Test/runMain " + * Results will be written to "benchmarks/DecimalBenchmark-results.txt". Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37958: [SPARK-40522][BUILD] Upgrade `kafka` to 3.2.3
dongjoon-hyun closed pull request #37958: [SPARK-40522][BUILD] Upgrade `kafka` to 3.2.3 URL: https://github.com/apache/spark/pull/37958 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37941: [SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))
LuciferYang commented on PR #37941: URL: https://github.com/apache/spark/pull/37941#issuecomment-1254535754 cc @wangyum @cloud-fan FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references
zhengruifeng commented on PR #37948: URL: https://github.com/apache/spark/pull/37948#issuecomment-1254535505 Merged into master, thank you guys! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references
zhengruifeng closed pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references URL: https://github.com/apache/spark/pull/37948 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
dongjoon-hyun commented on code in PR #37960: URL: https://github.com/apache/spark/pull/37960#discussion_r977201737 ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0)) } + test("fallback storage APIs - readFully") { Review Comment: +1 for @yaooqinn 's comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
dongjoon-hyun commented on code in PR #37960: URL: https://github.com/apache/spark/pull/37960#discussion_r977201645 ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -18,14 +18,11 @@ package org.apache.spark.storage import java.io.{DataOutputStream, File, FileOutputStream, IOException} import java.nio.file.Files - import scala.concurrent.duration._ - import org.apache.hadoop.conf.Configuration import org.mockito.{ArgumentMatchers => mc} import org.mockito.Mockito.{mock, never, verify, when} import org.scalatest.concurrent.Eventually.{eventually, interval, timeout} - import org.apache.spark.{LocalSparkContext, SparkConf, SparkContext, SparkFunSuite, TestUtils} Review Comment: +1 for @mridulm 's comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
dongjoon-hyun commented on PR #37960: URL: https://github.com/apache/spark/pull/37960#issuecomment-1254533780 Thank you, @ukby1234 and @mridulm . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
dongjoon-hyun commented on PR #37963: URL: https://github.com/apache/spark/pull/37963#issuecomment-1254532169 cc @HyukjinKwon for the pycodestyle question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark
HeartSaVioR commented on PR #37863: URL: https://github.com/apache/spark/pull/37863#issuecomment-1254527601 Closing this since we only have test suite PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark
HeartSaVioR closed pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark URL: https://github.com/apache/spark/pull/37863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments
HeartSaVioR commented on PR #37964: URL: https://github.com/apache/spark/pull/37964#issuecomment-1254495026 cc. @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments
HeartSaVioR opened a new pull request, #37964: URL: https://github.com/apache/spark/pull/37964 ### What changes were proposed in this pull request? This PR addresses the review comments from the last round of review from @HyukjinKwon in #37893. ### Why are the changes needed? Better documentation and removing unnecessary code. ### Does this PR introduce _any_ user-facing change? Slight documentation change. ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation
cloud-fan commented on code in PR #37915: URL: https://github.com/apache/spark/pull/37915#discussion_r977165972 ## sql/catalyst/src/test/scala/org/apache/spark/sql/DecimalBenchmark.scala: ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection +import org.apache.spark.sql.types.{Decimal, DecimalType, StructType} + +/** + * Benchmark for the previous Decimal (without proxy mode) vs refactored Decimal (proxy mode). + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars + * 2. build/sbt "catalyst/Test/runMain " + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "catalyst/Test/runMain " + * Results will be written to "benchmarks/DecimalBenchmark-results.txt". Review Comment: can you run it to generate result and commit the result? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun closed pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9
panbingkun closed pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9 URL: https://github.com/apache/spark/pull/37961 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9
panbingkun commented on PR #37961: URL: https://github.com/apache/spark/pull/37961#issuecomment-1254484226 > @panbingkun there is an incompatible [issue](https://github.com/scala/bug/issues/12641#issuecomment-1252344400), so we already decided to skip Scala 2.13.9 > > [#37943 (comment)](https://github.com/apache/spark/pull/37943#discussion_r975658409) Ok, i close it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on PR #37963: URL: https://github.com/apache/spark/pull/37963#issuecomment-1254484021 > Python linter failed, re run it > > ``` > The python3.9 -m black command was not found. Skipping black checks for now. > > starting pycodestyle test... > pycodestyle checks failed: > ./python/pyspark/broadcast.py:102:23: E275 missing whitespace after keyword > ./python/pyspark/ml/tests/test_tuning.py:203:23: E275 missing whitespace after keyword > ./python/pyspark/sql/readwriter.py:662:60: E275 missing whitespace after keyword > ./python/pyspark/sql/readwriter.py:694:60: E275 missing whitespace after keyword > ./python/pyspark/sql/tests/test_context.py:166:19: E275 missing whitespace after keyword > ./python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py:512:15: E275 missing whitespace after keyword > ./python/pyspark/sql/pandas/conversion.py:174:20: E275 missing whitespace after keyw > ``` hmm... seems a known issue ? https://github.com/apache/spark/actions/runs/3069718011/jobs/4958686170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9
LuciferYang commented on PR #37961: URL: https://github.com/apache/spark/pull/37961#issuecomment-1254482266 @panbingkun there is an incompatible [issue](https://github.com/scala/bug/issues/12641#issuecomment-1252344400), so we already decided to skip Scala 2.13.9 https://github.com/apache/spark/pull/37943#discussion_r975658409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on PR #37963: URL: https://github.com/apache/spark/pull/37963#issuecomment-1254480643 Python linter failed, re run it ``` The python3.9 -m black command was not found. Skipping black checks for now. starting pycodestyle test... pycodestyle checks failed: ./python/pyspark/broadcast.py:102:23: E275 missing whitespace after keyword ./python/pyspark/ml/tests/test_tuning.py:203:23: E275 missing whitespace after keyword ./python/pyspark/sql/readwriter.py:662:60: E275 missing whitespace after keyword ./python/pyspark/sql/readwriter.py:694:60: E275 missing whitespace after keyword ./python/pyspark/sql/tests/test_context.py:166:19: E275 missing whitespace after keyword ./python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py:512:15: E275 missing whitespace after keyword ./python/pyspark/sql/pandas/conversion.py:174:20: E275 missing whitespace after keyw ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios
LuciferYang commented on PR #37938: URL: https://github.com/apache/spark/pull/37938#issuecomment-1254479116 > Thank you, @LuciferYang . Please make two backporting PRs. done - branch 3.3: https://github.com/apache/spark/pull/37962 - branch 3.2: https://github.com/apache/spark/pull/37963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HeartSaVioR commented on PR #37893: URL: https://github.com/apache/spark/pull/37893#issuecomment-1254478824 Thanks @HyukjinKwon and @alex-balikov for thoughtful reviewing and merging! I'll handle the latest comments as a follow-up PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HyukjinKwon closed pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark URL: https://github.com/apache/spark/pull/37893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HyukjinKwon commented on PR #37893: URL: https://github.com/apache/spark/pull/37893#issuecomment-1254477957 My comments are just nits. I will merge this in first to move forward. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HyukjinKwon commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r977150660 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -216,6 +218,125 @@ def applyInPandas( jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr()) return DataFrame(jdf, self.session) +def applyInPandasWithState( +self, +func: "PandasGroupedMapFunctionWithState", +outputStructType: Union[StructType, str], +stateStructType: Union[StructType, str], +outputMode: str, +timeoutConf: str, +) -> DataFrame: +""" +Applies the given function to each group of data, while maintaining a user-defined +per-group state. The result Dataset will represent the flattened record returned by the +function. + +For a streaming Dataset, the function will be invoked first for all input groups and then +for all timed out states where the input data is set to be empty. Updates to each group's +state will be saved across invocations. Review Comment: ```suggestion For a streaming :class:`DataFrame`, the function will be invoked first for all input groups and then for all timed out states where the input data is set to be empty. Updates to each group's state will be saved across invocations. ``` ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala: ## @@ -98,6 +98,16 @@ class ArrowWriter(val root: VectorSchemaRoot, fields: Array[ArrowFieldWriter]) { count += 1 } + def sizeInBytes(): Int = { Review Comment: I think we don't need `sizeInBytes` and `getSizeInBytes ` anymore ## python/pyspark/sql/pandas/group_ops.py: ## @@ -216,6 +218,125 @@ def applyInPandas( jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr()) return DataFrame(jdf, self.session) +def applyInPandasWithState( +self, +func: "PandasGroupedMapFunctionWithState", +outputStructType: Union[StructType, str], +stateStructType: Union[StructType, str], +outputMode: str, +timeoutConf: str, +) -> DataFrame: +""" +Applies the given function to each group of data, while maintaining a user-defined +per-group state. The result Dataset will represent the flattened record returned by the +function. + +For a streaming Dataset, the function will be invoked first for all input groups and then +for all timed out states where the input data is set to be empty. Updates to each group's +state will be saved across invocations. + +The function should take parameters (key, Iterator[`pandas.DataFrame`], state) and +return another Iterator[`pandas.DataFrame`]. The grouping key(s) will be passed as a tuple +of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The state will be passed as +:class:`pyspark.sql.streaming.state.GroupState`. + +For each group, all columns are passed together as `pandas.DataFrame` to the user-function, +and the returned `pandas.DataFrame` across all invocations are combined as a +:class:`DataFrame`. Note that the user function should not make a guess of the number of +elements in the iterator. To process all data, the user function needs to iterate all +elements and process them. On the other hand, the user function is not strictly required to +iterate through all elements in the iterator if it intends to read a part of data. + +The `outputStructType` should be a :class:`StructType` describing the schema of all +elements in the returned value, `pandas.DataFrame`. The column labels of all elements in +returned `pandas.DataFrame` must either match the field names in the defined schema if +specified as strings, or match the field data types by position if not strings, +e.g. integer indices. + +The `stateStructType` should be :class:`StructType` describing the schema of the +user-defined state. The value of the state will be presented as a tuple, as well as the +update should be performed with the tuple. The corresponding Python types for +:class:DataType are supported. Please refer to the page +https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab). + +The size of each DataFrame in both the input and output can be arbitrary. The number of +DataFrames in both the input and output can also be arbitrary. + +.. versionadded:: 3.4.0 + +Parameters +-- +func : function +a Python native function to be called on every group. It should take parameters +(key, Iterator[`pandas.DataFrame`], state) and return Iterator[`pandas.DataFrame`]. +Note that the type of the key
[GitHub] [spark] wangyum commented on pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
wangyum commented on PR #37930: URL: https://github.com/apache/spark/pull/37930#issuecomment-1254466657 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
wangyum closed pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel URL: https://github.com/apache/spark/pull/37930 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977144993 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977144993 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps will could be inferred as timestamp when no timestamp pattern specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on code in PR #37963: URL: https://github.com/apache/spark/pull/37963#discussion_r977145998 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) throws Exception { boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE); +if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) { + _recoveryPath = new Path(Files.createTempDir().toURI()); Review Comment: There is a little difference here. `com.google.common.io.Files.createTempDir` is used instead due to SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` method in `JavaUtils` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on code in PR #37962: URL: https://github.com/apache/spark/pull/37962#discussion_r977145784 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) throws Exception { boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE); +if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) { + _recoveryPath = new Path(Files.createTempDir().toURI()); Review Comment: There is a little difference here. `com.google.common.io.Files.createTempDir` is used instead due to SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` method in `JavaUtils` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on code in PR #37963: URL: https://github.com/apache/spark/pull/37963#discussion_r977145998 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) throws Exception { boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE); +if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) { + _recoveryPath = new Path(Files.createTempDir().toURI()); Review Comment: There is a little difference here. com.google.common.io.Files.createTempDir is used instead due to [SPARK-39102](https://issues.apache.org/jira/browse/SPARK-39102) is not in Spark 3.2, so there is no `createTempDir` method in JavaUtils -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on code in PR #37963: URL: https://github.com/apache/spark/pull/37963#discussion_r977145998 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) throws Exception { boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE); +if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) { + _recoveryPath = new Path(Files.createTempDir().toURI()); Review Comment: There is a little difference here. com.google.common.io.Files.createTempDir is used instead due to [SPARK-39102](https://issues.apache.org/jira/browse/SPARK-39102) is not in Spark 3.2, so there is no createTempDir in JavaUtils -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang commented on code in PR #37962: URL: https://github.com/apache/spark/pull/37962#discussion_r977145784 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) throws Exception { boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE); +if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) { + _recoveryPath = new Path(Files.createTempDir().toURI()); Review Comment: There is a little difference here. `com.google.common.io.Files.createTempDir` is used instead due to SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` in `JavaUtils` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37894: [DO-NOT-MERGE][SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark
HeartSaVioR commented on PR #37894: URL: https://github.com/apache/spark/pull/37894#issuecomment-1254457970 cc. @HyukjinKwon @alex-balikov Could you please take a look at the new test suites? Thanks in advance! It's only the last commit on the list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
yaooqinn commented on code in PR #37960: URL: https://github.com/apache/spark/pull/37960#discussion_r977145034 ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0)) } + test("fallback storage APIs - readFully") { Review Comment: SPARK-39200: fallback storage APIs - readFully -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977144993 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: There is no easy way to go this behavior as well. Because the timestamp parser -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1254457075 I am not sure I follow what the code snippet is trying to do. Changing the code to : ``` private def receiveLoop() { Executors.newSingleThreadExecutor(threadFactory).execute(new MessageLoop) } ``` does cause the `uncaughtException` to be invoked - as expected. If we catch the `Throwable` in `receiveLoop` before that, then obviously it is no longer an uncaught exception - and so `MyExceptionHandler.uncaughtException` wont be invoked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang opened a new pull request, #37963: URL: https://github.com/apache/spark/pull/37963 ### What changes were proposed in this pull request? After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`. `YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data. This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes: 1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`. 2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified. ### Why are the changes needed? Fix registeredExecFile reload test scenarios. Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true? This configuration has been tried **Hadoop 3.3.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-3 ``` ``` 2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException 2022-09-10T11:44:42.1715234Z at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 2022-09-10T11:44:42.1719347Z at java.lang.ClassLoader.loadClass(ClassLoader.java:419) 2022-09-10T11:44:42.1723090Z at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 2022-09-10T11:44:42.1726759Z at java.lang.ClassLoader.loadClass(ClassLoader.java:352) 2022-09-10T11:44:42.1731028Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) 2022-09-10T11:44:42.1735424Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) 2022-09-10T11:44:42.1740303Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1745576Z at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) 2022-09-10T11:44:42.1828858Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1829712Z at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) 2022-09-10T11:44:42.1830633Z at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327) 2022-09-10T11:44:42.1831431Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1832279Z at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112) ``` **Hadoop 2.7.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-2 ``` ``` YarnShuffleIntegrationWithLevelDBBackendSuite: org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite *** ABORTED *** java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... Run completed in 3 seconds, 992 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` From the above test,
[GitHub] [spark] LuciferYang opened a new pull request, #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios
LuciferYang opened a new pull request, #37962: URL: https://github.com/apache/spark/pull/37962 ### What changes were proposed in this pull request? After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`. `YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data. This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes: 1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`. 2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified. ### Why are the changes needed? Fix registeredExecFile reload test scenarios. Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true? This configuration has been tried **Hadoop 3.3.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-3 ``` ``` 2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException 2022-09-10T11:44:42.1715234Z at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 2022-09-10T11:44:42.1719347Z at java.lang.ClassLoader.loadClass(ClassLoader.java:419) 2022-09-10T11:44:42.1723090Z at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 2022-09-10T11:44:42.1726759Z at java.lang.ClassLoader.loadClass(ClassLoader.java:352) 2022-09-10T11:44:42.1731028Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) 2022-09-10T11:44:42.1735424Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) 2022-09-10T11:44:42.1740303Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1745576Z at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) 2022-09-10T11:44:42.1828858Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1829712Z at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) 2022-09-10T11:44:42.1830633Z at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327) 2022-09-10T11:44:42.1831431Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 2022-09-10T11:44:42.1832279Z at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112) ``` **Hadoop 2.7.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-2 ``` ``` YarnShuffleIntegrationWithLevelDBBackendSuite: org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite *** ABORTED *** java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... Run completed in 3 seconds, 992 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` From the above test,
[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation
beliefer commented on code in PR #37915: URL: https://github.com/apache/spark/pull/37915#discussion_r977133079 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.unsafe.types.UTF8String /** - * A mutable implementation of BigDecimal that can hold a Long if values are small enough. - * - * The semantics of the fields are as follows: - * - _precision and _scale represent the SQL precision and scale we are looking for - * - If decimalVal is set, it represents the whole decimal value - * - Otherwise, the decimal value is longVal / (10 ** _scale) - * - * Note, for values between -1.0 and 1.0, precision digits are only counted after dot. + * A mutable implementation of BigDecimal that hold a `DecimalOperation`. */ @Unstable -final class Decimal extends Ordered[Decimal] with Serializable { +final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with Serializable { import org.apache.spark.sql.types.Decimal._ - private var decimalVal: BigDecimal = null - private var longVal: Long = 0L - private var _precision: Int = 1 - private var _scale: Int = 0 + private var decimalOperation: DecimalOperation[_] = null Review Comment: `decimalOperation` is not null in default. To reduce the overhead of operations (e.g. +), let it is null. I tested the benchmark, if give a default `DecimalOperation` value, these math operations (e.g. +) have 5x performance overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #37961: [BUILD] Upgrade Scala to 2.13.9
panbingkun opened a new pull request, #37961: URL: https://github.com/apache/spark/pull/37961 ### What changes were proposed in this pull request? This pr aims to update from Scala 2.13.7 to Scala 2.13.8 for Apache Spark 3.4. ### Why are the changes needed? Release notes: https://github.com/scala/scala/releases/tag/v2.13.9 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GA Scala 2.13 job -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor
HyukjinKwon closed pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor URL: https://github.com/apache/spark/pull/37837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor
HyukjinKwon commented on PR #37837: URL: https://github.com/apache/spark/pull/37837#issuecomment-1254446903 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at
HyukjinKwon closed pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at URL: https://github.com/apache/spark/pull/37959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at
HyukjinKwon commented on PR #37959: URL: https://github.com/apache/spark/pull/37959#issuecomment-1254440082 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #37941: [SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))
panbingkun commented on PR #37941: URL: https://github.com/apache/spark/pull/37941#issuecomment-1254439692 > Wait, I'm checking the cause of UT failure Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation
beliefer commented on code in PR #37915: URL: https://github.com/apache/spark/pull/37915#discussion_r977133079 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.unsafe.types.UTF8String /** - * A mutable implementation of BigDecimal that can hold a Long if values are small enough. - * - * The semantics of the fields are as follows: - * - _precision and _scale represent the SQL precision and scale we are looking for - * - If decimalVal is set, it represents the whole decimal value - * - Otherwise, the decimal value is longVal / (10 ** _scale) - * - * Note, for values between -1.0 and 1.0, precision digits are only counted after dot. + * A mutable implementation of BigDecimal that hold a `DecimalOperation`. */ @Unstable -final class Decimal extends Ordered[Decimal] with Serializable { +final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with Serializable { import org.apache.spark.sql.types.Decimal._ - private var decimalVal: BigDecimal = null - private var longVal: Long = 0L - private var _precision: Int = 1 - private var _scale: Int = 0 + private var decimalOperation: DecimalOperation[_] = null Review Comment: decimalOperation is not null in default. To reduce the overhead of operations (e.g. +), let it is null. I tested the benchmark, if give a default `DecimalOperation` value, these math operations (e.g. +) have 5x performance overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`
Yikun commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r977131978 ## python/pyspark/pandas/groupby.py: ## @@ -18,7 +18,6 @@ """ A wrapper for GroupedData to behave similar to pandas GroupBy. """ - Review Comment: Friendly note, looks like it is not recovered yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`
Yikun commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r977130360 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameLike: +""" +Compute prod of groups. + +.. versionadded:: 3.4.0 + +Parameters +-- +numeric_only : bool, default False +Include only float, int, boolean columns. If None, will attempt to use +everything, then use only numeric data. + +min_count: int, default 0 +The required number of valid values to perform the operation. +If fewer than min_count non-NA values are present the result will be NA. + +Returns +--- +Series or DataFrame +Computed prod of values within each group. + +See Also + +pyspark.pandas.Series.groupby +pyspark.pandas.DataFrame.groupby + +Examples + +>>> df = ps.DataFrame( Review Comment: ```suggestion >>> import numpy as np >>> df = ps.DataFrame( ``` Let's make it as self-contained examples, it means users can copy/paste to python shell directly, see also: SPARK-40005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r977126652 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameLike: +""" +Compute prod of groups. + +.. versionadded:: 3.4.0 + +Parameters +-- +numeric_only : bool, default False +Include only float, int, boolean columns. If None, will attempt to use +everything, then use only numeric data. + +min_count: int, default 0 +The required number of valid values to perform the operation. +If fewer than min_count non-NA values are present the result will be NA. + +Returns +--- +Series or DataFrame +Computed prod of values within each group. + +See Also + +pyspark.pandas.Series.groupby +pyspark.pandas.DataFrame.groupby + +Examples + +>>> df = ps.DataFrame( +... { +... "A": [1, 1, 2, 1, 2], +... "B": [np.nan, 2, 3, 4, 5], +... "C": [1, 2, 1, 1, 2], +... "D": [True, False, True, False, True], +... } +... ) + +Groupby one column and return the prod of the remaining columns in +each group. + +>>> df.groupby('A').prod().sort_index() + B C D +A +1 8.0 2 0 +2 15.0 2 1 + +>>> df.groupby('A').prod(min_count=3).sort_index() + B C D +A +1 NaN 2.0 0.0 +2 NaN NaN NaN +""" + +self._validate_agg_columns(numeric_only=numeric_only, function_name="prod") + +groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))] +internal, agg_columns, sdf = self._prepare_reduce( +groupkey_names=groupkey_names, +accepted_spark_types=(NumericType, BooleanType), +bool_to_numeric=True, +) + +psdf: DataFrame = DataFrame(internal) +if len(psdf._internal.column_labels) > 0: + +stat_exprs = [] +for label in psdf._internal.column_labels: +label_name = label[0] +tmp_count_column_name = verify_temp_column_name( +sdf, "__tmp_%s_count_col__" % label_name +) +psser = psdf._psser_for(label) +column = psser._dtype_op.nan_to_null(psser).spark.column +data_type = psser.spark.data_type + +if isinstance(data_type, IntegralType): Review Comment: for `IntegralType `, maybe always returns `LongType`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation
cloud-fan commented on code in PR #37915: URL: https://github.com/apache/spark/pull/37915#discussion_r977125557 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.unsafe.types.UTF8String /** - * A mutable implementation of BigDecimal that can hold a Long if values are small enough. - * - * The semantics of the fields are as follows: - * - _precision and _scale represent the SQL precision and scale we are looking for - * - If decimalVal is set, it represents the whole decimal value - * - Otherwise, the decimal value is longVal / (10 ** _scale) - * - * Note, for values between -1.0 and 1.0, precision digits are only counted after dot. + * A mutable implementation of BigDecimal that hold a `DecimalOperation`. */ @Unstable -final class Decimal extends Ordered[Decimal] with Serializable { +final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with Serializable { import org.apache.spark.sql.types.Decimal._ - private var decimalVal: BigDecimal = null - private var longVal: Long = 0L - private var _precision: Int = 1 - private var _scale: Int = 0 + private var decimalOperation: DecimalOperation[_] = null Review Comment: does it mean `decimalOperation` can be null sometimes? This seems fragile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
mridulm commented on code in PR #37960: URL: https://github.com/apache/spark/pull/37960#discussion_r977119060 ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -18,14 +18,11 @@ package org.apache.spark.storage import java.io.{DataOutputStream, File, FileOutputStream, IOException} import java.nio.file.Files - import scala.concurrent.duration._ - import org.apache.hadoop.conf.Configuration import org.mockito.{ArgumentMatchers => mc} import org.mockito.Mockito.{mock, never, verify, when} import org.scalatest.concurrent.Eventually.{eventually, interval, timeout} - import org.apache.spark.{LocalSparkContext, SparkConf, SparkContext, SparkFunSuite, TestUtils} Review Comment: Revert these whitespace changes. ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0)) } + test("fallback storage APIs - readFully") { +val conf = new SparkConf(false) + .set("spark.app.id", "testId") + .set(SHUFFLE_COMPRESS, false) + .set(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, true) + .set(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH, +Files.createTempDirectory("tmp").toFile.getAbsolutePath + "/") +val fallbackStorage = new FallbackStorage(conf) +val bmm = new BlockManagerMaster(new NoopRpcEndpointRef(conf), null, conf, false) + +val bm = mock(classOf[BlockManager]) +val dbm = new DiskBlockManager(conf, deleteFilesOnStop = false, isDriver = false) +when(bm.diskBlockManager).thenReturn(dbm) +when(bm.master).thenReturn(bmm) +val resolver = new IndexShuffleBlockResolver(conf, bm) +when(bm.migratableResolver).thenReturn(resolver) + +val length = 10 +val content = new Array[Byte](length) +Random.nextBytes(content) + +val indexFile = resolver.getIndexFile(1, 2L) +tryWithResource(new FileOutputStream(indexFile)) { fos => + tryWithResource(new DataOutputStream(fos)) { dos => +dos.writeLong(0) +dos.writeLong(length) + } +} + +val dataFile = resolver.getDataFile(1, 2L) +tryWithResource(new FileOutputStream(dataFile)) { fos => + tryWithResource(new DataOutputStream(fos)) { dos => +dos.write(content) + } +} + +fallbackStorage.copy(ShuffleBlockInfo(1, 2L), bm) + +assert(fallbackStorage.exists(1, ShuffleIndexBlockId(1, 2L, NOOP_REDUCE_ID).name)) +assert(fallbackStorage.exists(1, ShuffleDataBlockId(1, 2L, NOOP_REDUCE_ID).name)) + +val readResult = FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0)) +assert(readResult.nioByteBuffer().array().sameElements(content)) Review Comment: This test is not checking for `readFully` and would work even for `read`, depending on whether the read ends up satisfying the request or not (We are relying on what the buffer size might be internally, which is subject to change). As in, the test could work even without the fix. ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -38,6 +35,8 @@ import org.apache.spark.shuffle.{IndexShuffleBlockResolver, ShuffleBlockInfo} import org.apache.spark.shuffle.IndexShuffleBlockResolver.NOOP_REDUCE_ID import org.apache.spark.util.Utils.tryWithResource +import scala.util.Random Review Comment: Move to scala block above ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0)) } + test("fallback storage APIs - readFully") { +val conf = new SparkConf(false) + .set("spark.app.id", "testId") + .set(SHUFFLE_COMPRESS, false) + .set(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, true) + .set(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH, +Files.createTempDirectory("tmp").toFile.getAbsolutePath + "/") +val fallbackStorage = new FallbackStorage(conf) +val bmm = new BlockManagerMaster(new NoopRpcEndpointRef(conf), null, conf, false) + +val bm = mock(classOf[BlockManager]) +val dbm = new DiskBlockManager(conf, deleteFilesOnStop = false, isDriver = false) +when(bm.diskBlockManager).thenReturn(dbm) +when(bm.master).thenReturn(bmm) +val resolver = new IndexShuffleBlockResolver(conf, bm) +when(bm.migratableResolver).thenReturn(resolver) + +val length = 10 +val content = new Array[Byte](length) +Random.nextBytes(content) + +val indexFile = resolver.getIndexFile(1, 2L) +tryWithResource(new FileOutputStream(indexFile)) { fos => + tryWithResource(new DataOutputStream(fos)) { dos => +dos.writeLong(0) +dos.writeLong(length) + } +} + +val dataFile = resolver.getDataFile(1, 2L) +
[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on code in PR #37679: URL: https://github.com/apache/spark/pull/37679#discussion_r977124693 ## sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala: ## @@ -36,7 +36,7 @@ import org.apache.spark.SparkFunSuite import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.FunctionIdentifier import org.apache.spark.sql.catalyst.analysis.NoSuchTableException -import org.apache.spark.sql.catalyst.catalog.SessionCatalog.DEFAULT_DATABASE Review Comment: I think we can revert all the changes in this file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on code in PR #37679: URL: https://github.com/apache/spark/pull/37679#discussion_r977124552 ## sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: ## @@ -148,13 +148,18 @@ private[sql] class SharedState( val externalCatalog = SharedState.reflect[ExternalCatalog, SparkConf, Configuration]( SharedState.externalCatalogClassName(conf), conf, hadoopConf) -val defaultDbDefinition = CatalogDatabase( - SessionCatalog.DEFAULT_DATABASE, - "default database", - CatalogUtils.stringToURI(conf.get(WAREHOUSE_PATH)), - Map()) // Create default database if it doesn't exist -if (!externalCatalog.databaseExists(SessionCatalog.DEFAULT_DATABASE)) { +// If database name not equals 'default', throw exception +if (!externalCatalog.databaseExists(SQLConf.get.defaultDatabase)) { + if (SessionCatalog.DEFAULT_DATABASE != SQLConf.get.defaultDatabase) { +throw new SparkException(s"Default catalog database '${SQLConf.get.defaultDatabase}' " + + s"not exist, please create it first or change default database to 'default'.") Review Comment: let's use the new error framework: define the error in `error-classes.json` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r977124585 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameLike: +""" +Compute prod of groups. + +.. versionadded:: 3.4.0 + +Parameters +-- +numeric_only : bool, default False +Include only float, int, boolean columns. If None, will attempt to use +everything, then use only numeric data. + +min_count: int, default 0 +The required number of valid values to perform the operation. +If fewer than min_count non-NA values are present the result will be NA. + +Returns +--- +Series or DataFrame +Computed prod of values within each group. + +See Also + +pyspark.pandas.Series.groupby +pyspark.pandas.DataFrame.groupby + +Examples + +>>> df = ps.DataFrame( +... { +... "A": [1, 1, 2, 1, 2], +... "B": [np.nan, 2, 3, 4, 5], +... "C": [1, 2, 1, 1, 2], +... "D": [True, False, True, False, True], +... } +... ) + +Groupby one column and return the prod of the remaining columns in +each group. + +>>> df.groupby('A').prod().sort_index() + B C D +A +1 8.0 2 0 +2 15.0 2 1 + +>>> df.groupby('A').prod(min_count=3).sort_index() + B C D +A +1 NaN 2.0 0.0 +2 NaN NaN NaN +""" + +self._validate_agg_columns(numeric_only=numeric_only, function_name="prod") + +groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))] +internal, agg_columns, sdf = self._prepare_reduce( +groupkey_names=groupkey_names, +accepted_spark_types=(NumericType, BooleanType), +bool_to_numeric=True, +) + +psdf: DataFrame = DataFrame(internal) +if len(psdf._internal.column_labels) > 0: + +stat_exprs = [] +for label in psdf._internal.column_labels: +label_name = label[0] +tmp_count_column_name = verify_temp_column_name( +sdf, "__tmp_%s_count_col__" % label_name +) +psser = psdf._psser_for(label) +column = psser._dtype_op.nan_to_null(psser).spark.column +data_type = psser.spark.data_type + +if isinstance(data_type, IntegralType): + stat_exprs.append(F.product(column).cast(data_type).alias(label_name)) +else: +stat_exprs.append(F.product(column).alias(label_name)) + +if min_count > 0: + stat_exprs.append(F.count(column).alias(tmp_count_column_name)) + +sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs) Review Comment: Can we simplify the logic like this? ```suggestion for label in psdf._internal.column_labels: psser = psdf._psser_for(label) scol = psser._dtype_op.nan_to_null(psser).spark.column prod_scol = F.product(scol) if min_count > 0: prod_scol = F.when(F.count(scol) < min_count, F.lit(None)).otherwise(F.product(scol)) else: prod_scol = F.product(scol) stat_exprs.append( prod_scol.alias(psser._internal.data_spark_column_names[0]) ) sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`
zhengruifeng commented on code in PR #37923: URL: https://github.com/apache/spark/pull/37923#discussion_r977124090 ## python/pyspark/pandas/groupby.py: ## @@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike: return self._prepare_return(DataFrame(internal)) +def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameLike: +""" +Compute prod of groups. + +.. versionadded:: 3.4.0 + +Parameters +-- +numeric_only : bool, default False +Include only float, int, boolean columns. If None, will attempt to use +everything, then use only numeric data. + +min_count: int, default 0 +The required number of valid values to perform the operation. +If fewer than min_count non-NA values are present the result will be NA. + +Returns +--- +Series or DataFrame +Computed prod of values within each group. + +See Also + +pyspark.pandas.Series.groupby +pyspark.pandas.DataFrame.groupby + +Examples + +>>> df = ps.DataFrame( +... { +... "A": [1, 1, 2, 1, 2], +... "B": [np.nan, 2, 3, 4, 5], +... "C": [1, 2, 1, 1, 2], +... "D": [True, False, True, False, True], +... } +... ) + +Groupby one column and return the prod of the remaining columns in +each group. + +>>> df.groupby('A').prod().sort_index() + B C D +A +1 8.0 2 0 +2 15.0 2 1 + +>>> df.groupby('A').prod(min_count=3).sort_index() + B C D +A +1 NaN 2.0 0.0 +2 NaN NaN NaN +""" + +self._validate_agg_columns(numeric_only=numeric_only, function_name="prod") + +groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))] +internal, agg_columns, sdf = self._prepare_reduce( +groupkey_names=groupkey_names, +accepted_spark_types=(NumericType, BooleanType), +bool_to_numeric=True, +) + +psdf: DataFrame = DataFrame(internal) +if len(psdf._internal.column_labels) > 0: + +stat_exprs = [] +for label in psdf._internal.column_labels: +label_name = label[0] +tmp_count_column_name = verify_temp_column_name( +sdf, "__tmp_%s_count_col__" % label_name +) +psser = psdf._psser_for(label) +column = psser._dtype_op.nan_to_null(psser).spark.column +data_type = psser.spark.data_type + +if isinstance(data_type, IntegralType): + stat_exprs.append(F.product(column).cast(data_type).alias(label_name)) +else: +stat_exprs.append(F.product(column).alias(label_name)) + +if min_count > 0: + stat_exprs.append(F.count(column).alias(tmp_count_column_name)) + +sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs) Review Comment: Can we simplify the logic like this? ```suggestion for label in psdf._internal.column_labels: psser = psdf._psser_for(label) scol = psser._dtype_op.nan_to_null(psser).spark.column prod_scol = F.product(scol) if min_count > 0: prod_scol = F.when(F.count(scol) < min_count, F.lit(None)).otherwise(F.product(scol)) else: prod_scol = F.product(scol) stat_exprs.append( prod_scol.alias(psser._internal.data_spark_column_names[0]) ) sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on code in PR #37679: URL: https://github.com/apache/spark/pull/37679#discussion_r977124092 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -43,7 +44,7 @@ class V2SessionCatalog(catalog: SessionCatalog) extends TableCatalog with FunctionCatalog with SupportsNamespaces with SQLConfHelper { import V2SessionCatalog._ - override val defaultNamespace: Array[String] = Array("default") + override val defaultNamespace: Array[String] = Array(SQLConf.get.defaultDatabase) Review Comment: `catalog.defaultDatabase`? It's already available in `SessionCatalog` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] YetiCuzMountain commented on pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups
YetiCuzMountain commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1254423954 I find a bug. When u show create table, there does not exist DEFAULT on some fields event if they were defined. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35391: [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion
HyukjinKwon commented on PR #35391: URL: https://github.com/apache/spark/pull/35391#issuecomment-1254416308 Alright, let's merge this in once the tests pass. @LucaCanali mind fixing the lint failure? ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/tests/test_dataframe.py:1113: error: Redundant cast to "str" [redundant-cast] Found 1 error in 1 file (checked 341 source files) 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0
itholic commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1254410609 Thanks for the comments! Let me investigate the test failure and make an umbrella ticket if there are many failures. If there is few failures, let me handle them in this PR at once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0
zhengruifeng commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1254400059 > > what about also changing `dev/requirements.txt` > > Maybe we don't need to set an upper bound for pandas in `dev/requirements.txt` because `pip install -r dev/requirements.txt` always install the latest pandas when the version is not specified?? I was hit by the conflicts several times caused by the version differences between CI and `dev/requirements.txt`, but in the mean time, it keep us aware of the dependency updates, so fine to let it alone here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0
Yikun commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1254396596 Generally speaking almost each pandas upgrade will cause CI to fail, for the stability of CI, our strategy was using `<=` to pin specific version, and upgrade to specific version manually. @itholic FYI, these two testcase are failed due to: - python/pyspark/pandas/tests/indexes/test_category.py.test_append: https://github.com/pandas-dev/pandas/commit/c7b470c3e13f99ce990e23b2a311d3a2c633499c - python/pyspark/pandas/tests/indexes/test_base.py.test_to_frame: https://github.com/pandas-dev/pandas/commit/7dbfe9f95a31aa01dc8288350e785f04a2abf6f0 not having a more deep look, you could do more invistigation and fix or fix test, just like SPARK-38819. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ukby1234 opened a new pull request, #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content
ukby1234 opened a new pull request, #37960: URL: https://github.com/apache/spark/pull/37960 Looks like from bug description, fallback storage doesn't readFully and then cause `org.apache.spark.shuffle.FetchFailedException: Decompression error: Corrupted block detected`. This is an attempt to fix this by read the underlying stream fully. ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r977100211 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: I checked with @HyukjinKwon , even the pandas melt/unpivot API does not support expressions but only columns. Let's only allow `Attribute` for id and value columns, for both SQL and Scala API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios
LuciferYang commented on PR #37938: URL: https://github.com/apache/spark/pull/37938#issuecomment-1254384287 > Thank you, @LuciferYang . Please make two backporting PRs. OK, I will finish this work today -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37799: [SPARK-40331][DOCS] Recommend use Java 11/17 as the runtime environment of Spark
LuciferYang commented on PR #37799: URL: https://github.com/apache/spark/pull/37799#issuecomment-1254383080 > The performance Issue fixed by [JDK-8159720](https://bugs.openjdk.org/browse/JDK-8159720). Seems to be fixed only in Java 9+? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at
HyukjinKwon commented on PR #37959: URL: https://github.com/apache/spark/pull/37959#issuecomment-1254373373 @gengliangwang mind taking a quick look please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition
cloud-fan commented on PR #37855: URL: https://github.com/apache/spark/pull/37855#issuecomment-1254371742 @wbo4958 Can you add comments as I asked in https://github.com/apache/spark/pull/37855/files#r975993118 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
yabola commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1254370238 @mridulm @tgravescs If you have time, can you also help take a look~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977086276 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -149,13 +149,10 @@ class CSVOptions( val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) /** - * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type) - * if schema inference is enabled. When being used with user-provided schema, tries to parse - * timestamp values as dates if the values do not conform to the timestamp formatter before - * falling back to the backward compatible parsing - the parsed values will be cast to timestamp - * afterwards. + * Infer columns with all valid date entries as date type (otherwise inferred as string type) Review Comment: ```suggestion * Infer columns with all valid date entries as date type (otherwise inferred as string or timestamp type) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977086043 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977086043 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: Do we really need to cover these corner cases? We can just say that we can only parse date as string if neither timestamp pattern nor date pattern is not specified. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return true if strings of given date format can be parsed as timestamps + * 1. If user provides timestamp format, we will parse strings as timestamps using + * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed + * as timestamp type in this case + * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which + * is more lenient and can parse strings of some date formats as timestamps. + */ + private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { +if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || + (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { + LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) Review Comment: Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is not specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977085759 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: let's enrich the comment a bit more ``` This only happens when the timestamp pattern is not specified, as the default timestamp parser is very lenient and can parse date string as well. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0
itholic commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1254369142 Yeah, I think we should make the test pass since pandas-on-Spark should follow the behavior of latest pandas. Let me take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977085297 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: nvm, I got it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references
HyukjinKwon commented on code in PR #37948: URL: https://github.com/apache/spark/pull/37948#discussion_r977083499 ## python/pyspark/pandas/resample.py: ## @@ -491,3 +737,37 @@ def __getattr__(self, item: str) -> Any: def _handle_output(self, psdf: DataFrame) -> Series: return first_series(psdf).rename(self._psser.name) + + +def _test() -> None: +import os +import doctest +import sys +import numpy +from datetime import datetime +from pyspark.sql import SparkSession +import pyspark.pandas.resample + +os.chdir(os.environ["SPARK_HOME"]) + +globs = pyspark.pandas.resample.__dict__.copy() +globs["np"] = numpy Review Comment: Let's probably import this within the example. `datetime` too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at
HyukjinKwon opened a new pull request, #37959: URL: https://github.com/apache/spark/pull/37959 ### What changes were proposed in this pull request? This PR removes non-ANSI compliant example in element_at. ### Why are the changes needed? ANSI build fails to run the example. https://github.com/apache/spark/actions/runs/3094607589/jobs/5008176959 ``` File "/__w/spark/spark/python/pyspark/sql/functions.py", line 6599, in pyspark.sql.functions.element_at at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2897) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2836) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2825) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2258) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2277) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2302) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1020) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) at org.apache.spark.rdd.RDD.collect(RDD.scala:1019) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:424) at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3925) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4095) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4093) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4093) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3922) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index -4 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidElementAtIndexError(QueryExecutionErrors.scala:264) at org.apache.spark.sql.errors.QueryExecutionErrors.invalidElementAtIndexError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:43) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:889) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:889) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at
[GitHub] [spark] github-actions[bot] commented on pull request #36485: [SPARK-39128][SQL][HIVE] Log cost time for getting FileStatus in HadoopTableReader
github-actions[bot] commented on PR #36485: URL: https://github.com/apache/spark/pull/36485#issuecomment-1254360769 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36548: [SPARK-38470][CORE] Use error classes in org.apache.spark.partial
github-actions[bot] commented on PR #36548: URL: https://github.com/apache/spark/pull/36548#issuecomment-1254360723 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36495: [SPARK-39136][SQL] JDBCTable support table properties
github-actions[bot] commented on PR #36495: URL: https://github.com/apache/spark/pull/36495#issuecomment-1254360757 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36453: SPARK-39103: SparkContext.addFiles trigger backend exception if it tr…
github-actions[bot] commented on PR #36453: URL: https://github.com/apache/spark/pull/36453#issuecomment-1254360800 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36483: [SPARK-39126][SQL] After eliminating join to one side, that side should take advantage of LocalShuffleRead optimization
github-actions[bot] commented on PR #36483: URL: https://github.com/apache/spark/pull/36483#issuecomment-1254360786 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36540: [SPARK-38466][CORE] Use error classes in org.apache.spark.mapred
github-actions[bot] commented on PR #36540: URL: https://github.com/apache/spark/pull/36540#issuecomment-1254360743 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36613: [WIP][SPARK-30983] Support typed select in Datasets up to the max tuple size
github-actions[bot] commented on PR #36613: URL: https://github.com/apache/spark/pull/36613#issuecomment-1254360704 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36551: [SPARK-38463][CORE] Use error classes in org.apache.spark.input
github-actions[bot] commented on PR #36551: URL: https://github.com/apache/spark/pull/36551#issuecomment-1254360716 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36665: [SPARK-39287][CORE] TaskSchedulerImpl should quickly ignore task finished event if its task was finished state.
github-actions[bot] commented on PR #36665: URL: https://github.com/apache/spark/pull/36665#issuecomment-1254360691 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36668: [SPARK-39291][CORE] Fetch blocks and open stream should not respond a closed channel
github-actions[bot] commented on PR #36668: URL: https://github.com/apache/spark/pull/36668#issuecomment-1254360676 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36678: [SPARK-39297][CORE][UI] bugfix: spark.ui.proxyBase contains proxy or history
github-actions[bot] commented on PR #36678: URL: https://github.com/apache/spark/pull/36678#issuecomment-1254360667 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36798: [SPARK-39408][SQL] Update the buildKeys for DynamicPruningSubquery.withNewPlan
github-actions[bot] commented on PR #36798: URL: https://github.com/apache/spark/pull/36798#issuecomment-1254360629 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36751: [WIP][SPARK-39366][CORE] Do not release write locks on task end.
github-actions[bot] commented on PR #36751: URL: https://github.com/apache/spark/pull/36751#issuecomment-1254360645 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36844: Update ExecutorClassLoader.scala
github-actions[bot] commented on PR #36844: URL: https://github.com/apache/spark/pull/36844#issuecomment-1254360617 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977079883 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: We want to have consistent behavior when timestamp format is not specified. When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible. Thus, we added the additional handling here for a consistent behavior as above when `prefersDate=true`. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977079883 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: We want to have consistent behavior when timestamp format is not specified. When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible. Thus, we added the additional handling here for a similar behavior as above when `prefersDate=true`. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977079883 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: We want to have consistent behavior when timestamp format is not specified. When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible. Thus, we added the additional handling here to have similar behavior as above when `prefersDate=true`. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977079883 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: We want to have consistent behavior when timestamp format is not given. When `prefersDate=false`, a column with mixed date and timestamp could be inferred as timestamp if possible. Thus, we added the additional handling here to have similar behavior as above when `prefersDate=true`. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
cloud-fan commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r977077958 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing a mixture of dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType Review Comment: I don't quite understand the rationale here. why can't we directly return `Some(StringType)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #37799: [SPARK-40331][DOCS] Recommend use Java 11/17 as the runtime environment of Spark
wangyum commented on PR #37799: URL: https://github.com/apache/spark/pull/37799#issuecomment-1254347894 The performance Issue fixed by [JDK-8159720](https://bugs.openjdk.org/browse/JDK-8159720). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
sadikovi commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1254315080 @cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org