date:20220921

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments

2022-09-21 Thread GitBox



dongjoon-hyun commented on code in PR #37964:
URL: https://github.com/apache/spark/pull/37964#discussion_r977211125


##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -257,10 +257,10 @@ def applyInPandasWithState(
 user-defined state. The value of the state will be presented as a 
tuple, as well as the
 update should be performed with the tuple. The corresponding Python 
types for
 :class:DataType are supported. Please refer to the page
-https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python 
tab).
+https://spark.apache.org/docs/latest/sql-ref-datatypes.html (Python 
tab).
 
-The size of each DataFrame in both the input and output can be 
arbitrary. The number of
-DataFrames in both the input and output can also be arbitrary.
+The size of each `pandas.DataFrame` in both the input and output can 
be arbitrary. The
+number `pandas.DataFrame` in both the input and output can also be 
arbitrary.

Review Comment:
   `number` -> `number of`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37953: [SPARK-40510][PS] Implement `ddof` in `Series.cov`

2022-09-21 Thread GitBox



zhengruifeng commented on PR #37953:
URL: https://github.com/apache/spark/pull/37953#issuecomment-1254537682

   also cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation

2022-09-21 Thread GitBox



beliefer commented on code in PR #37915:
URL: https://github.com/apache/spark/pull/37915#discussion_r977203717


##
sql/catalyst/src/test/scala/org/apache/spark/sql/DecimalBenchmark.scala:
##
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection
+import org.apache.spark.sql.types.{Decimal, DecimalType, StructType}
+
+/**
+ * Benchmark for the previous Decimal (without proxy mode) vs refactored 
Decimal (proxy mode).
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

+ *   2. build/sbt "catalyst/Test/runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "catalyst/Test/runMain 
"
+ *  Results will be written to "benchmarks/DecimalBenchmark-results.txt".

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37958: [SPARK-40522][BUILD] Upgrade `kafka` to 3.2.3

2022-09-21 Thread GitBox



dongjoon-hyun closed pull request #37958: [SPARK-40522][BUILD] Upgrade `kafka` 
to 3.2.3
URL: https://github.com/apache/spark/pull/37958


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37941: [SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))

2022-09-21 Thread GitBox



LuciferYang commented on PR #37941:
URL: https://github.com/apache/spark/pull/37941#issuecomment-1254535754

   cc @wangyum @cloud-fan FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references

2022-09-21 Thread GitBox



zhengruifeng commented on PR #37948:
URL: https://github.com/apache/spark/pull/37948#issuecomment-1254535505

   Merged into master, thank you guys!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references

2022-09-21 Thread GitBox



zhengruifeng closed pull request #37948: [SPARK-40327][PS][DOCS] Add resampling 
to API references
URL: https://github.com/apache/spark/pull/37948


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



dongjoon-hyun commented on code in PR #37960:
URL: https://github.com/apache/spark/pull/37960#discussion_r977201737


##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with 
LocalSparkContext {
 FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0))
   }
 
+  test("fallback storage APIs - readFully") {

Review Comment:
   +1 for @yaooqinn 's comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



dongjoon-hyun commented on code in PR #37960:
URL: https://github.com/apache/spark/pull/37960#discussion_r977201645


##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -18,14 +18,11 @@ package org.apache.spark.storage
 
 import java.io.{DataOutputStream, File, FileOutputStream, IOException}
 import java.nio.file.Files
-
 import scala.concurrent.duration._
-
 import org.apache.hadoop.conf.Configuration
 import org.mockito.{ArgumentMatchers => mc}
 import org.mockito.Mockito.{mock, never, verify, when}
 import org.scalatest.concurrent.Eventually.{eventually, interval, timeout}
-
 import org.apache.spark.{LocalSparkContext, SparkConf, SparkContext, 
SparkFunSuite, TestUtils}

Review Comment:
   +1 for @mridulm 's comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



dongjoon-hyun commented on PR #37960:
URL: https://github.com/apache/spark/pull/37960#issuecomment-1254533780

   Thank you, @ukby1234 and @mridulm .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



dongjoon-hyun commented on PR #37963:
URL: https://github.com/apache/spark/pull/37963#issuecomment-1254532169

   cc @HyukjinKwon for the pycodestyle question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark

2022-09-21 Thread GitBox



HeartSaVioR commented on PR #37863:
URL: https://github.com/apache/spark/pull/37863#issuecomment-1254527601

   Closing this since we only have test suite PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark

2022-09-21 Thread GitBox



HeartSaVioR closed pull request #37863: [WIP][DO-NOT-MERGE] Reference PR for 
flatMapGroupsWithState in PySpark
URL: https://github.com/apache/spark/pull/37863


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments

2022-09-21 Thread GitBox



HeartSaVioR commented on PR #37964:
URL: https://github.com/apache/spark/pull/37964#issuecomment-1254495026

   cc. @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request, #37964: [SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments

2022-09-21 Thread GitBox



HeartSaVioR opened a new pull request, #37964:
URL: https://github.com/apache/spark/pull/37964

   ### What changes were proposed in this pull request?
   
   This PR addresses the review comments from the last round of review from 
@HyukjinKwon  in #37893.
   
   ### Why are the changes needed?
   
   Better documentation and removing unnecessary code.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Slight documentation change.
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37915:
URL: https://github.com/apache/spark/pull/37915#discussion_r977165972


##
sql/catalyst/src/test/scala/org/apache/spark/sql/DecimalBenchmark.scala:
##
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection
+import org.apache.spark.sql.types.{Decimal, DecimalType, StructType}
+
+/**
+ * Benchmark for the previous Decimal (without proxy mode) vs refactored 
Decimal (proxy mode).
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

+ *   2. build/sbt "catalyst/Test/runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "catalyst/Test/runMain 
"
+ *  Results will be written to "benchmarks/DecimalBenchmark-results.txt".

Review Comment:
   can you run it to generate result and commit the result?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun closed pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9

2022-09-21 Thread GitBox



panbingkun closed pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala 
to 2.13.9
URL: https://github.com/apache/spark/pull/37961


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9

2022-09-21 Thread GitBox



panbingkun commented on PR #37961:
URL: https://github.com/apache/spark/pull/37961#issuecomment-1254484226

   > @panbingkun there is an incompatible 
[issue](https://github.com/scala/bug/issues/12641#issuecomment-1252344400), so 
we already decided to skip Scala 2.13.9
   > 
   > [#37943 
(comment)](https://github.com/apache/spark/pull/37943#discussion_r975658409)
   
   Ok, i close it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on PR #37963:
URL: https://github.com/apache/spark/pull/37963#issuecomment-1254484021

   > Python linter failed, re run it
   > 
   > ```
   > The python3.9 -m black command was not found. Skipping black checks for 
now.
   > 
   > starting pycodestyle test...
   > pycodestyle checks failed:
   > ./python/pyspark/broadcast.py:102:23: E275 missing whitespace after keyword
   > ./python/pyspark/ml/tests/test_tuning.py:203:23: E275 missing whitespace 
after keyword
   > ./python/pyspark/sql/readwriter.py:662:60: E275 missing whitespace after 
keyword
   > ./python/pyspark/sql/readwriter.py:694:60: E275 missing whitespace after 
keyword
   > ./python/pyspark/sql/tests/test_context.py:166:19: E275 missing whitespace 
after keyword
   > ./python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py:512:15: E275 
missing whitespace after keyword
   > ./python/pyspark/sql/pandas/conversion.py:174:20: E275 missing whitespace 
after keyw
   > ```
   
   hmm... seems a known issue ? 
https://github.com/apache/spark/actions/runs/3069718011/jobs/4958686170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37961: [WIP][SPARK-40526][BUILD] Upgrade Scala to 2.13.9

2022-09-21 Thread GitBox



LuciferYang commented on PR #37961:
URL: https://github.com/apache/spark/pull/37961#issuecomment-1254482266

   @panbingkun there is an incompatible 
[issue](https://github.com/scala/bug/issues/12641#issuecomment-1252344400),  so 
we already decided to skip Scala 2.13.9
   
   https://github.com/apache/spark/pull/37943#discussion_r975658409
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on PR #37963:
URL: https://github.com/apache/spark/pull/37963#issuecomment-1254480643

   Python linter failed, re run it
   
   ```
   The python3.9 -m black command was not found. Skipping black checks for now.
   
   starting pycodestyle test...
   pycodestyle checks failed:
   ./python/pyspark/broadcast.py:102:23: E275 missing whitespace after keyword
   ./python/pyspark/ml/tests/test_tuning.py:203:23: E275 missing whitespace 
after keyword
   ./python/pyspark/sql/readwriter.py:662:60: E275 missing whitespace after 
keyword
   ./python/pyspark/sql/readwriter.py:694:60: E275 missing whitespace after 
keyword
   ./python/pyspark/sql/tests/test_context.py:166:19: E275 missing whitespace 
after keyword
   ./python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py:512:15: E275 
missing whitespace after keyword
   ./python/pyspark/sql/pandas/conversion.py:174:20: E275 missing whitespace 
after keyw
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on PR #37938:
URL: https://github.com/apache/spark/pull/37938#issuecomment-1254479116

   > Thank you, @LuciferYang . Please make two backporting PRs.
   
   done
   
   - branch 3.3: https://github.com/apache/spark/pull/37962
   - branch 3.2: https://github.com/apache/spark/pull/37963


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-21 Thread GitBox



HeartSaVioR commented on PR #37893:
URL: https://github.com/apache/spark/pull/37893#issuecomment-1254478824

   Thanks @HyukjinKwon and @alex-balikov for thoughtful reviewing and merging!
   I'll handle the latest comments as a follow-up PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-21 Thread GitBox



HyukjinKwon closed pull request #37893: [SPARK-40434][SS][PYTHON] Implement 
applyInPandasWithState in PySpark
URL: https://github.com/apache/spark/pull/37893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-21 Thread GitBox



HyukjinKwon commented on PR #37893:
URL: https://github.com/apache/spark/pull/37893#issuecomment-1254477957

   My comments are just nits. I will merge this in first to move forward.
   
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-21 Thread GitBox



HyukjinKwon commented on code in PR #37893:
URL: https://github.com/apache/spark/pull/37893#discussion_r977150660


##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -216,6 +218,125 @@ def applyInPandas(
 jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())
 return DataFrame(jdf, self.session)
 
+def applyInPandasWithState(
+self,
+func: "PandasGroupedMapFunctionWithState",
+outputStructType: Union[StructType, str],
+stateStructType: Union[StructType, str],
+outputMode: str,
+timeoutConf: str,
+) -> DataFrame:
+"""
+Applies the given function to each group of data, while maintaining a 
user-defined
+per-group state. The result Dataset will represent the flattened 
record returned by the
+function.
+
+For a streaming Dataset, the function will be invoked first for all 
input groups and then
+for all timed out states where the input data is set to be empty. 
Updates to each group's
+state will be saved across invocations.

Review Comment:
   ```suggestion
   For a streaming :class:`DataFrame`, the function will be invoked 
first for all input groups
   and then for all timed out states where the input data is set to be 
empty. Updates to
   each group's state will be saved across invocations.
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:
##
@@ -98,6 +98,16 @@ class ArrowWriter(val root: VectorSchemaRoot, fields: 
Array[ArrowFieldWriter]) {
 count += 1
   }
 
+  def sizeInBytes(): Int = {

Review Comment:
   I think we don't need `sizeInBytes` and  `getSizeInBytes ` anymore



##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -216,6 +218,125 @@ def applyInPandas(
 jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())
 return DataFrame(jdf, self.session)
 
+def applyInPandasWithState(
+self,
+func: "PandasGroupedMapFunctionWithState",
+outputStructType: Union[StructType, str],
+stateStructType: Union[StructType, str],
+outputMode: str,
+timeoutConf: str,
+) -> DataFrame:
+"""
+Applies the given function to each group of data, while maintaining a 
user-defined
+per-group state. The result Dataset will represent the flattened 
record returned by the
+function.
+
+For a streaming Dataset, the function will be invoked first for all 
input groups and then
+for all timed out states where the input data is set to be empty. 
Updates to each group's
+state will be saved across invocations.
+
+The function should take parameters (key, 
Iterator[`pandas.DataFrame`], state) and
+return another Iterator[`pandas.DataFrame`]. The grouping key(s) will 
be passed as a tuple
+of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The 
state will be passed as
+:class:`pyspark.sql.streaming.state.GroupState`.
+
+For each group, all columns are passed together as `pandas.DataFrame` 
to the user-function,
+and the returned `pandas.DataFrame` across all invocations are 
combined as a
+:class:`DataFrame`. Note that the user function should not make a 
guess of the number of
+elements in the iterator. To process all data, the user function needs 
to iterate all
+elements and process them. On the other hand, the user function is not 
strictly required to
+iterate through all elements in the iterator if it intends to read a 
part of data.
+
+The `outputStructType` should be a :class:`StructType` describing the 
schema of all
+elements in the returned value, `pandas.DataFrame`. The column labels 
of all elements in
+returned `pandas.DataFrame` must either match the field names in the 
defined schema if
+specified as strings, or match the field data types by position if not 
strings,
+e.g. integer indices.
+
+The `stateStructType` should be :class:`StructType` describing the 
schema of the
+user-defined state. The value of the state will be presented as a 
tuple, as well as the
+update should be performed with the tuple. The corresponding Python 
types for
+:class:DataType are supported. Please refer to the page
+https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python 
tab).
+
+The size of each DataFrame in both the input and output can be 
arbitrary. The number of
+DataFrames in both the input and output can also be arbitrary.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+func : function
+a Python native function to be called on every group. It should 
take parameters
+(key, Iterator[`pandas.DataFrame`], state) and return 
Iterator[`pandas.DataFrame`].
+Note that the type of the key

[GitHub] [spark] wangyum commented on pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-21 Thread GitBox



wangyum commented on PR #37930:
URL: https://github.com/apache/spark/pull/37930#issuecomment-1254466657

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum closed pull request #37930: [SPARK-40487][SQL] Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel

2022-09-21 Thread GitBox



wangyum closed pull request #37930: [SPARK-40487][SQL] Make defaultJoin in 
BroadcastNestedLoopJoinExec running in parallel
URL: https://github.com/apache/spark/pull/37930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   This is a behavior change in terms of Spark 3.3 branch, where a column with 
mixed dates and timestamps could be inferred as timestamp type if possible when 
no timestamp pattern specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   This is a behavior change in terms of Spark 3.3 branch, where a column with 
mixed dates and timestamps will could be inferred as timestamp when no 
timestamp pattern specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on code in PR #37963:
URL: https://github.com/apache/spark/pull/37963#discussion_r977145998


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
 
 boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, 
DEFAULT_STOP_ON_FAILURE);
 
+if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) 
{
+  _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here. 
   `com.google.common.io.Files.createTempDir` is used instead due to  
SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` method in 
`JavaUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on code in PR #37962:
URL: https://github.com/apache/spark/pull/37962#discussion_r977145784


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
 
 boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, 
DEFAULT_STOP_ON_FAILURE);
 
+if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) 
{
+  _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here. 
   `com.google.common.io.Files.createTempDir` is used instead due to  
SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` method in 
`JavaUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on code in PR #37963:
URL: https://github.com/apache/spark/pull/37963#discussion_r977145998


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
 
 boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, 
DEFAULT_STOP_ON_FAILURE);
 
+if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) 
{
+  _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here.
   com.google.common.io.Files.createTempDir is used instead due to 
[SPARK-39102](https://issues.apache.org/jira/browse/SPARK-39102) is not in 
Spark 3.2, so there is no `createTempDir` method in JavaUtils



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on code in PR #37963:
URL: https://github.com/apache/spark/pull/37963#discussion_r977145998


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -208,6 +213,10 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
 
 boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, 
DEFAULT_STOP_ON_FAILURE);
 
+if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) 
{
+  _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here.
   com.google.common.io.Files.createTempDir is used instead due to 
[SPARK-39102](https://issues.apache.org/jira/browse/SPARK-39102) is not in 
Spark 3.2, so there is no createTempDir in JavaUtils



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on code in PR #37962:
URL: https://github.com/apache/spark/pull/37962#discussion_r977145784


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
 
 boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, 
DEFAULT_STOP_ON_FAILURE);
 
+if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) 
{
+  _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here. 
   `com.google.common.io.Files.createTempDir` is used instead due to  
SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` in `JavaUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #37894: [DO-NOT-MERGE][SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark

2022-09-21 Thread GitBox



HeartSaVioR commented on PR #37894:
URL: https://github.com/apache/spark/pull/37894#issuecomment-1254457970

   cc. @HyukjinKwon @alex-balikov 
   Could you please take a look at the new test suites? Thanks in advance! It's 
only the last commit on the list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



yaooqinn commented on code in PR #37960:
URL: https://github.com/apache/spark/pull/37960#discussion_r977145034


##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with 
LocalSparkContext {
 FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0))
   }
 
+  test("fallback storage APIs - readFully") {

Review Comment:
   SPARK-39200: fallback storage APIs - readFully



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977144993


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   There is no easy way to go this behavior as well. Because the timestamp 
parser 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-21 Thread GitBox



mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1254457075

   I am not sure I follow what the code snippet is trying to do.
   Changing the code to :
   
   ```
 private def receiveLoop() {
 Executors.newSingleThreadExecutor(threadFactory).execute(new 
MessageLoop)
   }
   ```
   does cause the `uncaughtException` to be invoked - as expected.
   If we catch the `Throwable` in `receiveLoop` before that, then obviously it 
is no longer an uncaught exception - and so 
`MyExceptionHandler.uncaughtException` wont be invoked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #37963: [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang opened a new pull request, #37963:
URL: https://github.com/apache/spark/pull/37963

   ### What changes were proposed in this pull request?
   After SPARK-17321, `YarnShuffleService` will persist data to local shuffle 
state db/reload data from local shuffle state db only when Yarn NodeManager 
start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.
   
   `YarnShuffleIntegrationSuite` not set 
`YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the 
configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger 
data persistence to the db nor verify the reload of data.
   
   This pr aims to let `YarnShuffleIntegrationSuite` restart the verification 
of registeredExecFile reload scenarios, to achieve this goal, this pr make the 
following changes:
   
   1. Add a new un-document configuration `spark.yarn.shuffle.testing` to 
`YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == 
null && spark.yarn.shuffle.testing == true`.
   
   2. Only set `spark.yarn.shuffle.testing = true` in 
`YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` 
is not null to ensure that registeredExecFile reload scenarios will be verified.
   
   ### Why are the changes needed?
   Fix registeredExecFile reload  test scenarios.
   
   Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?
   
   This configuration has been tried
   
   **Hadoop 3.3.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-3
   ```
   
   ```
   2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: 
org.apache.hadoop.shaded.org.iq80.leveldb.DBException
   2022-09-10T11:44:42.1715234Z at 
java.net.URLClassLoader.findClass(URLClassLoader.java:387)
   2022-09-10T11:44:42.1719347Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   2022-09-10T11:44:42.1723090Z at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   2022-09-10T11:44:42.1726759Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   2022-09-10T11:44:42.1731028Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
   2022-09-10T11:44:42.1735424Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
   2022-09-10T11:44:42.1740303Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1745576Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
   2022-09-10T11:44:42.1828858Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1829712Z at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
   2022-09-10T11:44:42.1830633Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
   2022-09-10T11:44:42.1831431Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1832279Z at 
org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
   ```
   
   **Hadoop 2.7.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-2
   ```
   
   ```
   YarnShuffleIntegrationWithLevelDBBackendSuite:
   org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite 
*** ABORTED ***
 java.lang.IllegalArgumentException: Cannot support recovery with an 
ephemeral server port. Check the setting of yarn.nodemanager.address
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 ...
   Run completed in 3 seconds, 992 milliseconds.
   Total number of tests run: 0
   Suites: completed 1, aborted 1
   Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
   *** 1 SUITE ABORTED ***
   ```
   
   From the above test,

[GitHub] [spark] LuciferYang opened a new pull request, #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang opened a new pull request, #37962:
URL: https://github.com/apache/spark/pull/37962

   ### What changes were proposed in this pull request?
   After SPARK-17321, `YarnShuffleService` will persist data to local shuffle 
state db/reload data from local shuffle state db only when Yarn NodeManager 
start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.
   
   `YarnShuffleIntegrationSuite` not set 
`YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the 
configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger 
data persistence to the db nor verify the reload of data.
   
   This pr aims to let `YarnShuffleIntegrationSuite` restart the verification 
of registeredExecFile reload scenarios, to achieve this goal, this pr make the 
following changes:
   
   1. Add a new un-document configuration `spark.yarn.shuffle.testing` to 
`YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == 
null && spark.yarn.shuffle.testing == true`.
   
   2. Only set `spark.yarn.shuffle.testing = true` in 
`YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` 
is not null to ensure that registeredExecFile reload scenarios will be verified.
   
   ### Why are the changes needed?
   Fix registeredExecFile reload  test scenarios.
   
   Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?
   
   This configuration has been tried
   
   **Hadoop 3.3.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-3
   ```
   
   ```
   2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: 
org.apache.hadoop.shaded.org.iq80.leveldb.DBException
   2022-09-10T11:44:42.1715234Z at 
java.net.URLClassLoader.findClass(URLClassLoader.java:387)
   2022-09-10T11:44:42.1719347Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   2022-09-10T11:44:42.1723090Z at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   2022-09-10T11:44:42.1726759Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   2022-09-10T11:44:42.1731028Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
   2022-09-10T11:44:42.1735424Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
   2022-09-10T11:44:42.1740303Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1745576Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
   2022-09-10T11:44:42.1828858Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1829712Z at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
   2022-09-10T11:44:42.1830633Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
   2022-09-10T11:44:42.1831431Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1832279Z at 
org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
   ```
   
   **Hadoop 2.7.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-2
   ```
   
   ```
   YarnShuffleIntegrationWithLevelDBBackendSuite:
   org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite 
*** ABORTED ***
 java.lang.IllegalArgumentException: Cannot support recovery with an 
ephemeral server port. Check the setting of yarn.nodemanager.address
 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
 at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 ...
   Run completed in 3 seconds, 992 milliseconds.
   Total number of tests run: 0
   Suites: completed 1, aborted 1
   Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
   *** 1 SUITE ABORTED ***
   ```
   
   From the above test,

[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation

2022-09-21 Thread GitBox



beliefer commented on code in PR #37915:
URL: https://github.com/apache/spark/pull/37915#discussion_r977133079


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
- * A mutable implementation of BigDecimal that can hold a Long if values are 
small enough.
- *
- * The semantics of the fields are as follows:
- * - _precision and _scale represent the SQL precision and scale we are 
looking for
- * - If decimalVal is set, it represents the whole decimal value
- * - Otherwise, the decimal value is longVal / (10 ** _scale)
- *
- * Note, for values between -1.0 and 1.0, precision digits are only counted 
after dot.
+ * A mutable implementation of BigDecimal that hold a `DecimalOperation`.
  */
 @Unstable
-final class Decimal extends Ordered[Decimal] with Serializable {
+final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with 
Serializable {
   import org.apache.spark.sql.types.Decimal._
 
-  private var decimalVal: BigDecimal = null
-  private var longVal: Long = 0L
-  private var _precision: Int = 1
-  private var _scale: Int = 0
+  private var decimalOperation: DecimalOperation[_] = null

Review Comment:
   `decimalOperation` is not null in default. To reduce the overhead of 
operations (e.g. +), let it is null.
   I tested the benchmark, if give a default `DecimalOperation` value, these 
math operations (e.g. +) have 5x performance overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #37961: [BUILD] Upgrade Scala to 2.13.9

2022-09-21 Thread GitBox



panbingkun opened a new pull request, #37961:
URL: https://github.com/apache/spark/pull/37961

   
   ### What changes were proposed in this pull request?
   This pr aims to update from Scala 2.13.7 to Scala 2.13.8 for Apache Spark 
3.4.
   
   ### Why are the changes needed?
   Release notes: https://github.com/scala/scala/releases/tag/v2.13.9
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass the GA Scala 2.13 job


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor

2022-09-21 Thread GitBox



HyukjinKwon closed pull request #37837: [SPARK-40385][SQL] Fix interpreted path 
for companion object constructor
URL: https://github.com/apache/spark/pull/37837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor

2022-09-21 Thread GitBox



HyukjinKwon commented on PR #37837:
URL: https://github.com/apache/spark/pull/37837#issuecomment-1254446903

   Merged to master and branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at

2022-09-21 Thread GitBox



HyukjinKwon closed pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] 
Remove non-ANSI compliant example in element_at
URL: https://github.com/apache/spark/pull/37959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at

2022-09-21 Thread GitBox



HyukjinKwon commented on PR #37959:
URL: https://github.com/apache/spark/pull/37959#issuecomment-1254440082

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #37941: [SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))

2022-09-21 Thread GitBox



panbingkun commented on PR #37941:
URL: https://github.com/apache/spark/pull/37941#issuecomment-1254439692

   > Wait, I'm checking the cause of UT failure
   Done
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation

2022-09-21 Thread GitBox



beliefer commented on code in PR #37915:
URL: https://github.com/apache/spark/pull/37915#discussion_r977133079


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
- * A mutable implementation of BigDecimal that can hold a Long if values are 
small enough.
- *
- * The semantics of the fields are as follows:
- * - _precision and _scale represent the SQL precision and scale we are 
looking for
- * - If decimalVal is set, it represents the whole decimal value
- * - Otherwise, the decimal value is longVal / (10 ** _scale)
- *
- * Note, for values between -1.0 and 1.0, precision digits are only counted 
after dot.
+ * A mutable implementation of BigDecimal that hold a `DecimalOperation`.
  */
 @Unstable
-final class Decimal extends Ordered[Decimal] with Serializable {
+final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with 
Serializable {
   import org.apache.spark.sql.types.Decimal._
 
-  private var decimalVal: BigDecimal = null
-  private var longVal: Long = 0L
-  private var _precision: Int = 1
-  private var _scale: Int = 0
+  private var decimalOperation: DecimalOperation[_] = null

Review Comment:
   decimalOperation is not null in default. To reduce the overhead of 
operations (e.g. +), let it is null.
   I tested the benchmark, if give a default `DecimalOperation` value, these 
math operations (e.g. +) have 5x performance overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-21 Thread GitBox



Yikun commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r977131978


##
python/pyspark/pandas/groupby.py:
##
@@ -18,7 +18,6 @@
 """
 A wrapper for GroupedData to behave similar to pandas GroupBy.
 """
-

Review Comment:
   Friendly note, looks like it is not recovered yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-21 Thread GitBox



Yikun commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r977130360


##
python/pyspark/pandas/groupby.py:
##
@@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+"""
+Compute prod of groups.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+numeric_only : bool, default False
+Include only float, int, boolean columns. If None, will attempt to 
use
+everything, then use only numeric data.
+
+min_count: int, default 0
+The required number of valid values to perform the operation.
+If fewer than min_count non-NA values are present the result will 
be NA.
+
+Returns
+---
+Series or DataFrame
+Computed prod of values within each group.
+
+See Also
+
+pyspark.pandas.Series.groupby
+pyspark.pandas.DataFrame.groupby
+
+Examples
+
+>>> df = ps.DataFrame(

Review Comment:
   ```suggestion
   >>> import numpy as np
   >>> df = ps.DataFrame(
   ```
   
   Let's make it as self-contained examples, it means users can copy/paste to 
python shell directly, see also: SPARK-40005



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-21 Thread GitBox



zhengruifeng commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r977126652


##
python/pyspark/pandas/groupby.py:
##
@@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+"""
+Compute prod of groups.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+numeric_only : bool, default False
+Include only float, int, boolean columns. If None, will attempt to 
use
+everything, then use only numeric data.
+
+min_count: int, default 0
+The required number of valid values to perform the operation.
+If fewer than min_count non-NA values are present the result will 
be NA.
+
+Returns
+---
+Series or DataFrame
+Computed prod of values within each group.
+
+See Also
+
+pyspark.pandas.Series.groupby
+pyspark.pandas.DataFrame.groupby
+
+Examples
+
+>>> df = ps.DataFrame(
+... {
+... "A": [1, 1, 2, 1, 2],
+... "B": [np.nan, 2, 3, 4, 5],
+... "C": [1, 2, 1, 1, 2],
+... "D": [True, False, True, False, True],
+... }
+... )
+
+Groupby one column and return the prod of the remaining columns in
+each group.
+
+>>> df.groupby('A').prod().sort_index()
+ B  C  D
+A
+1  8.0  2  0
+2  15.0 2  1
+
+>>> df.groupby('A').prod(min_count=3).sort_index()
+ B  C   D
+A
+1  NaN  2.0  0.0
+2  NaN NaN  NaN
+"""
+
+self._validate_agg_columns(numeric_only=numeric_only, 
function_name="prod")
+
+groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in 
range(len(self._groupkeys))]
+internal, agg_columns, sdf = self._prepare_reduce(
+groupkey_names=groupkey_names,
+accepted_spark_types=(NumericType, BooleanType),
+bool_to_numeric=True,
+)
+
+psdf: DataFrame = DataFrame(internal)
+if len(psdf._internal.column_labels) > 0:
+
+stat_exprs = []
+for label in psdf._internal.column_labels:
+label_name = label[0]
+tmp_count_column_name = verify_temp_column_name(
+sdf, "__tmp_%s_count_col__" % label_name
+)
+psser = psdf._psser_for(label)
+column = psser._dtype_op.nan_to_null(psser).spark.column
+data_type = psser.spark.data_type
+
+if isinstance(data_type, IntegralType):

Review Comment:
   for `IntegralType `, maybe always returns `LongType`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37915: [SPARK-40465][SQL] Refactor Decimal so as we can use other underlying implementation

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37915:
URL: https://github.com/apache/spark/pull/37915#discussion_r977125557


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -28,52 +28,35 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
- * A mutable implementation of BigDecimal that can hold a Long if values are 
small enough.
- *
- * The semantics of the fields are as follows:
- * - _precision and _scale represent the SQL precision and scale we are 
looking for
- * - If decimalVal is set, it represents the whole decimal value
- * - Otherwise, the decimal value is longVal / (10 ** _scale)
- *
- * Note, for values between -1.0 and 1.0, precision digits are only counted 
after dot.
+ * A mutable implementation of BigDecimal that hold a `DecimalOperation`.
  */
 @Unstable
-final class Decimal extends Ordered[Decimal] with Serializable {
+final class Decimal(initEnabled: Boolean = true) extends Ordered[Decimal] with 
Serializable {
   import org.apache.spark.sql.types.Decimal._
 
-  private var decimalVal: BigDecimal = null
-  private var longVal: Long = 0L
-  private var _precision: Int = 1
-  private var _scale: Int = 0
+  private var decimalOperation: DecimalOperation[_] = null

Review Comment:
   does it mean `decimalOperation` can be null sometimes? This seems fragile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



mridulm commented on code in PR #37960:
URL: https://github.com/apache/spark/pull/37960#discussion_r977119060


##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -18,14 +18,11 @@ package org.apache.spark.storage
 
 import java.io.{DataOutputStream, File, FileOutputStream, IOException}
 import java.nio.file.Files
-
 import scala.concurrent.duration._
-
 import org.apache.hadoop.conf.Configuration
 import org.mockito.{ArgumentMatchers => mc}
 import org.mockito.Mockito.{mock, never, verify, when}
 import org.scalatest.concurrent.Eventually.{eventually, interval, timeout}
-
 import org.apache.spark.{LocalSparkContext, SparkConf, SparkContext, 
SparkFunSuite, TestUtils}

Review Comment:
   Revert these whitespace changes.



##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with 
LocalSparkContext {
 FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0))
   }
 
+  test("fallback storage APIs - readFully") {
+val conf = new SparkConf(false)
+  .set("spark.app.id", "testId")
+  .set(SHUFFLE_COMPRESS, false)
+  .set(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, true)
+  .set(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH,
+Files.createTempDirectory("tmp").toFile.getAbsolutePath + "/")
+val fallbackStorage = new FallbackStorage(conf)
+val bmm = new BlockManagerMaster(new NoopRpcEndpointRef(conf), null, conf, 
false)
+
+val bm = mock(classOf[BlockManager])
+val dbm = new DiskBlockManager(conf, deleteFilesOnStop = false, isDriver = 
false)
+when(bm.diskBlockManager).thenReturn(dbm)
+when(bm.master).thenReturn(bmm)
+val resolver = new IndexShuffleBlockResolver(conf, bm)
+when(bm.migratableResolver).thenReturn(resolver)
+
+val length = 10
+val content = new Array[Byte](length)
+Random.nextBytes(content)
+
+val indexFile = resolver.getIndexFile(1, 2L)
+tryWithResource(new FileOutputStream(indexFile)) { fos =>
+  tryWithResource(new DataOutputStream(fos)) { dos =>
+dos.writeLong(0)
+dos.writeLong(length)
+  }
+}
+
+val dataFile = resolver.getDataFile(1, 2L)
+tryWithResource(new FileOutputStream(dataFile)) { fos =>
+  tryWithResource(new DataOutputStream(fos)) { dos =>
+dos.write(content)
+  }
+}
+
+fallbackStorage.copy(ShuffleBlockInfo(1, 2L), bm)
+
+assert(fallbackStorage.exists(1, ShuffleIndexBlockId(1, 2L, 
NOOP_REDUCE_ID).name))
+assert(fallbackStorage.exists(1, ShuffleDataBlockId(1, 2L, 
NOOP_REDUCE_ID).name))
+
+val readResult = FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0))
+assert(readResult.nioByteBuffer().array().sameElements(content))

Review Comment:
   This test is not checking for `readFully` and would work even for `read`, 
depending on whether the read ends up satisfying the request or not (We are 
relying on what the buffer size might be internally, which is subject to 
change).
   As in, the test could work even without the fix.
   



##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -38,6 +35,8 @@ import org.apache.spark.shuffle.{IndexShuffleBlockResolver, 
ShuffleBlockInfo}
 import org.apache.spark.shuffle.IndexShuffleBlockResolver.NOOP_REDUCE_ID
 import org.apache.spark.util.Utils.tryWithResource
 
+import scala.util.Random

Review Comment:
   Move to scala block above



##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -107,6 +106,51 @@ class FallbackStorageSuite extends SparkFunSuite with 
LocalSparkContext {
 FallbackStorage.read(conf, ShuffleBlockId(1, 2L, 0))
   }
 
+  test("fallback storage APIs - readFully") {
+val conf = new SparkConf(false)
+  .set("spark.app.id", "testId")
+  .set(SHUFFLE_COMPRESS, false)
+  .set(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, true)
+  .set(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH,
+Files.createTempDirectory("tmp").toFile.getAbsolutePath + "/")
+val fallbackStorage = new FallbackStorage(conf)
+val bmm = new BlockManagerMaster(new NoopRpcEndpointRef(conf), null, conf, 
false)
+
+val bm = mock(classOf[BlockManager])
+val dbm = new DiskBlockManager(conf, deleteFilesOnStop = false, isDriver = 
false)
+when(bm.diskBlockManager).thenReturn(dbm)
+when(bm.master).thenReturn(bmm)
+val resolver = new IndexShuffleBlockResolver(conf, bm)
+when(bm.migratableResolver).thenReturn(resolver)
+
+val length = 10
+val content = new Array[Byte](length)
+Random.nextBytes(content)
+
+val indexFile = resolver.getIndexFile(1, 2L)
+tryWithResource(new FileOutputStream(indexFile)) { fos =>
+  tryWithResource(new DataOutputStream(fos)) { dos =>
+dos.writeLong(0)
+dos.writeLong(length)
+  }
+}
+
+val dataFile = resolver.getDataFile(1, 2L)
+

[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37679:
URL: https://github.com/apache/spark/pull/37679#discussion_r977124693


##
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala:
##
@@ -36,7 +36,7 @@ import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.FunctionIdentifier
 import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
-import org.apache.spark.sql.catalyst.catalog.SessionCatalog.DEFAULT_DATABASE

Review Comment:
   I think we can revert all the changes in this file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37679:
URL: https://github.com/apache/spark/pull/37679#discussion_r977124552


##
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
##
@@ -148,13 +148,18 @@ private[sql] class SharedState(
 val externalCatalog = SharedState.reflect[ExternalCatalog, SparkConf, 
Configuration](
   SharedState.externalCatalogClassName(conf), conf, hadoopConf)
 
-val defaultDbDefinition = CatalogDatabase(
-  SessionCatalog.DEFAULT_DATABASE,
-  "default database",
-  CatalogUtils.stringToURI(conf.get(WAREHOUSE_PATH)),
-  Map())
 // Create default database if it doesn't exist
-if (!externalCatalog.databaseExists(SessionCatalog.DEFAULT_DATABASE)) {
+// If database name not equals 'default', throw exception
+if (!externalCatalog.databaseExists(SQLConf.get.defaultDatabase)) {
+  if (SessionCatalog.DEFAULT_DATABASE != SQLConf.get.defaultDatabase) {
+throw new SparkException(s"Default catalog database 
'${SQLConf.get.defaultDatabase}' " +
+  s"not exist, please create it first or change default database to 
'default'.")

Review Comment:
   let's use the new error framework: define the error in `error-classes.json`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-21 Thread GitBox



zhengruifeng commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r977124585


##
python/pyspark/pandas/groupby.py:
##
@@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+"""
+Compute prod of groups.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+numeric_only : bool, default False
+Include only float, int, boolean columns. If None, will attempt to 
use
+everything, then use only numeric data.
+
+min_count: int, default 0
+The required number of valid values to perform the operation.
+If fewer than min_count non-NA values are present the result will 
be NA.
+
+Returns
+---
+Series or DataFrame
+Computed prod of values within each group.
+
+See Also
+
+pyspark.pandas.Series.groupby
+pyspark.pandas.DataFrame.groupby
+
+Examples
+
+>>> df = ps.DataFrame(
+... {
+... "A": [1, 1, 2, 1, 2],
+... "B": [np.nan, 2, 3, 4, 5],
+... "C": [1, 2, 1, 1, 2],
+... "D": [True, False, True, False, True],
+... }
+... )
+
+Groupby one column and return the prod of the remaining columns in
+each group.
+
+>>> df.groupby('A').prod().sort_index()
+ B  C  D
+A
+1  8.0  2  0
+2  15.0 2  1
+
+>>> df.groupby('A').prod(min_count=3).sort_index()
+ B  C   D
+A
+1  NaN  2.0  0.0
+2  NaN NaN  NaN
+"""
+
+self._validate_agg_columns(numeric_only=numeric_only, 
function_name="prod")
+
+groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in 
range(len(self._groupkeys))]
+internal, agg_columns, sdf = self._prepare_reduce(
+groupkey_names=groupkey_names,
+accepted_spark_types=(NumericType, BooleanType),
+bool_to_numeric=True,
+)
+
+psdf: DataFrame = DataFrame(internal)
+if len(psdf._internal.column_labels) > 0:
+
+stat_exprs = []
+for label in psdf._internal.column_labels:
+label_name = label[0]
+tmp_count_column_name = verify_temp_column_name(
+sdf, "__tmp_%s_count_col__" % label_name
+)
+psser = psdf._psser_for(label)
+column = psser._dtype_op.nan_to_null(psser).spark.column
+data_type = psser.spark.data_type
+
+if isinstance(data_type, IntegralType):
+
stat_exprs.append(F.product(column).cast(data_type).alias(label_name))
+else:
+stat_exprs.append(F.product(column).alias(label_name))
+
+if min_count > 0:
+
stat_exprs.append(F.count(column).alias(tmp_count_column_name))
+
+sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs)

Review Comment:
   Can we simplify the logic like this?
   
   ```suggestion
   for label in psdf._internal.column_labels:
   psser = psdf._psser_for(label)
   scol = psser._dtype_op.nan_to_null(psser).spark.column
   prod_scol = F.product(scol)
   if min_count > 0:
   prod_scol = F.when(F.count(scol) < min_count, 
F.lit(None)).otherwise(F.product(scol))
   else:
   prod_scol = F.product(scol)
   
   stat_exprs.append(
   
prod_scol.alias(psser._internal.data_spark_column_names[0])
   )
   sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

2022-09-21 Thread GitBox



zhengruifeng commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r977124090


##
python/pyspark/pandas/groupby.py:
##
@@ -993,6 +993,115 @@ def nth(self, n: int) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+"""
+Compute prod of groups.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+numeric_only : bool, default False
+Include only float, int, boolean columns. If None, will attempt to 
use
+everything, then use only numeric data.
+
+min_count: int, default 0
+The required number of valid values to perform the operation.
+If fewer than min_count non-NA values are present the result will 
be NA.
+
+Returns
+---
+Series or DataFrame
+Computed prod of values within each group.
+
+See Also
+
+pyspark.pandas.Series.groupby
+pyspark.pandas.DataFrame.groupby
+
+Examples
+
+>>> df = ps.DataFrame(
+... {
+... "A": [1, 1, 2, 1, 2],
+... "B": [np.nan, 2, 3, 4, 5],
+... "C": [1, 2, 1, 1, 2],
+... "D": [True, False, True, False, True],
+... }
+... )
+
+Groupby one column and return the prod of the remaining columns in
+each group.
+
+>>> df.groupby('A').prod().sort_index()
+ B  C  D
+A
+1  8.0  2  0
+2  15.0 2  1
+
+>>> df.groupby('A').prod(min_count=3).sort_index()
+ B  C   D
+A
+1  NaN  2.0  0.0
+2  NaN NaN  NaN
+"""
+
+self._validate_agg_columns(numeric_only=numeric_only, 
function_name="prod")
+
+groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in 
range(len(self._groupkeys))]
+internal, agg_columns, sdf = self._prepare_reduce(
+groupkey_names=groupkey_names,
+accepted_spark_types=(NumericType, BooleanType),
+bool_to_numeric=True,
+)
+
+psdf: DataFrame = DataFrame(internal)
+if len(psdf._internal.column_labels) > 0:
+
+stat_exprs = []
+for label in psdf._internal.column_labels:
+label_name = label[0]
+tmp_count_column_name = verify_temp_column_name(
+sdf, "__tmp_%s_count_col__" % label_name
+)
+psser = psdf._psser_for(label)
+column = psser._dtype_op.nan_to_null(psser).spark.column
+data_type = psser.spark.data_type
+
+if isinstance(data_type, IntegralType):
+
stat_exprs.append(F.product(column).cast(data_type).alias(label_name))
+else:
+stat_exprs.append(F.product(column).alias(label_name))
+
+if min_count > 0:
+
stat_exprs.append(F.count(column).alias(tmp_count_column_name))
+
+sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs)

Review Comment:
   Can we simplify the logic like this?
   
   ```suggestion
   for label in psdf._internal.column_labels:
   psser = psdf._psser_for(label)
   scol = psser._dtype_op.nan_to_null(psser).spark.column
   prod_scol = F.product(scol)
   if min_count > 0:
   prod_scol = F.when(F.count(scol) < min_count, 
F.lit(None)).otherwise(F.product(scol))
   else:
   prod_scol = F.product(scol)
   
   stat_exprs.append(
   
prod_scol.alias(psser._internal.data_spark_column_names[0])
   )
   sdf = sdf.groupby(*groupkey_names).agg(*stat_exprs)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37679:
URL: https://github.com/apache/spark/pull/37679#discussion_r977124092


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala:
##
@@ -43,7 +44,7 @@ class V2SessionCatalog(catalog: SessionCatalog)
   extends TableCatalog with FunctionCatalog with SupportsNamespaces with 
SQLConfHelper {
   import V2SessionCatalog._
 
-  override val defaultNamespace: Array[String] = Array("default")
+  override val defaultNamespace: Array[String] = 
Array(SQLConf.get.defaultDatabase)

Review Comment:
   `catalog.defaultDatabase`? It's already available in `SessionCatalog`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] YetiCuzMountain commented on pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups

2022-09-21 Thread GitBox



YetiCuzMountain commented on PR #37280:
URL: https://github.com/apache/spark/pull/37280#issuecomment-1254423954

   I find a bug.  When u show create table, there does not exist DEFAULT on 
some fields event if they were defined.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35391: [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion

2022-09-21 Thread GitBox



HyukjinKwon commented on PR #35391:
URL: https://github.com/apache/spark/pull/35391#issuecomment-1254416308

   Alright, let's merge this in once the tests pass. @LucaCanali mind fixing 
the lint failure?
   
   ```
   
   starting mypy annotations test...
   annotations failed mypy checks:
   python/pyspark/sql/tests/test_dataframe.py:1113: error: Redundant cast to 
"str"  [redundant-cast]
   Found 1 error in 1 file (checked 341 source files)
   1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0

2022-09-21 Thread GitBox



itholic commented on PR #37955:
URL: https://github.com/apache/spark/pull/37955#issuecomment-1254410609

   Thanks for the comments!
   
   Let me investigate the test failure and make an umbrella ticket if there are 
many failures.
   
   If there is few failures, let me handle them in this PR at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0

2022-09-21 Thread GitBox



zhengruifeng commented on PR #37955:
URL: https://github.com/apache/spark/pull/37955#issuecomment-1254400059

   > > what about also changing `dev/requirements.txt`
   > 
   > Maybe we don't need to set an upper bound for pandas in 
`dev/requirements.txt` because `pip install -r dev/requirements.txt` always 
install the latest pandas when the version is not specified??
   
   I was hit by the conflicts several times caused by the version differences 
between CI and `dev/requirements.txt`, but in the mean time, it keep us aware 
of the dependency updates, so fine to let it alone here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0

2022-09-21 Thread GitBox



Yikun commented on PR #37955:
URL: https://github.com/apache/spark/pull/37955#issuecomment-1254396596

   Generally speaking almost each pandas upgrade will cause CI to fail, for the 
stability of CI, our strategy was using `<=` to pin specific version, and 
upgrade to specific version manually.
   
   @itholic FYI, these two testcase are failed due to:
   - python/pyspark/pandas/tests/indexes/test_category.py.test_append: 
https://github.com/pandas-dev/pandas/commit/c7b470c3e13f99ce990e23b2a311d3a2c633499c
   - python/pyspark/pandas/tests/indexes/test_base.py.test_to_frame: 
https://github.com/pandas-dev/pandas/commit/7dbfe9f95a31aa01dc8288350e785f04a2abf6f0
   
   not having a more deep look, you could do more invistigation and fix or fix 
test, just like SPARK-38819.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ukby1234 opened a new pull request, #37960: [SPARK-39200][CORE] Make Fallback Storage readFully on content

2022-09-21 Thread GitBox



ukby1234 opened a new pull request, #37960:
URL: https://github.com/apache/spark/pull/37960

   Looks like from bug description, fallback storage doesn't readFully and then 
cause `org.apache.spark.shuffle.FetchFailedException: Decompression error: 
Corrupted block detected`. This is an attempt to fix this by read the 
underlying stream fully. 
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r977100211


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   I checked with @HyukjinKwon , even the pandas melt/unpivot API does not 
support expressions but only columns. Let's only allow `Attribute` for id and 
value columns, for both SQL and Scala API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37938: [SPARK-40490][YARN][TESTS] Ensure `YarnShuffleIntegrationSuite` tests registeredExecFile reload scenarios

2022-09-21 Thread GitBox



LuciferYang commented on PR #37938:
URL: https://github.com/apache/spark/pull/37938#issuecomment-1254384287

   > Thank you, @LuciferYang . Please make two backporting PRs.
   
   OK, I will finish this work today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37799: [SPARK-40331][DOCS] Recommend use Java 11/17 as the runtime environment of Spark

2022-09-21 Thread GitBox



LuciferYang commented on PR #37799:
URL: https://github.com/apache/spark/pull/37799#issuecomment-1254383080

   > The performance Issue fixed by 
[JDK-8159720](https://bugs.openjdk.org/browse/JDK-8159720).
   
   Seems to be fixed only in Java 9+?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at

2022-09-21 Thread GitBox



HyukjinKwon commented on PR #37959:
URL: https://github.com/apache/spark/pull/37959#issuecomment-1254373373

   @gengliangwang mind taking a quick look please?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

2022-09-21 Thread GitBox



cloud-fan commented on PR #37855:
URL: https://github.com/apache/spark/pull/37855#issuecomment-1254371742

   @wbo4958 Can you add comments as I asked in 
https://github.com/apache/spark/pull/37855/files#r975993118 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yabola commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-21 Thread GitBox



yabola commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1254370238

   @mridulm @tgravescs If you have time, can you also help take a look~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086276


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,13 +149,10 @@ class CSVOptions(
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
   /**
-   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type)
-   * if schema inference is enabled. When being used with user-provided 
schema, tries to parse
-   * timestamp values as dates if the values do not conform to the timestamp 
formatter before
-   * falling back to the backward compatible parsing - the parsed values will 
be cast to timestamp
-   * afterwards.
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as string type)

Review Comment:
   ```suggestion
  * Infer columns with all valid date entries as date type (otherwise 
inferred as string or timestamp type)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086043


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can 
only parse date as timestamp if neither timestamp pattern nor date pattern is 
specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977086043


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can 
only parse date as string if neither timestamp pattern nor date pattern is not 
specified.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return true if strings of given date format can be parsed as timestamps
+   *  1. If user provides timestamp format, we will parse strings as 
timestamps using
+   *  Iso8601TimestampFormatter (with strict timestamp parsing). Any date 
string can not be parsed
+   *  as timestamp type in this case
+   *  2. Otherwise, we will use DefaultTimestampFormatter to parse strings as 
timestamps, which
+   *  is more lenient and can parse strings of some date formats as timestamps.
+   */
+  private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): 
Boolean = {
+if ((tsType.isInstanceOf[TimestampType] && 
options.timestampFormatInRead.isEmpty) ||
+  (tsType.isInstanceOf[TimestampNTZType] && 
options.timestampNTZFormatInRead.isEmpty)) {
+  LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat)

Review Comment:
   Do we really need to cover these corner cases? We can just say that we can 
only parse date as timestamp if neither timestamp pattern nor date pattern is 
not specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977085759


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   let's enrich the comment a bit more
   ```
   This only happens when the timestamp pattern is not specified, as the default
   timestamp parser is very lenient and can parse date string as well.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37955: [SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0

2022-09-21 Thread GitBox



itholic commented on PR #37955:
URL: https://github.com/apache/spark/pull/37955#issuecomment-1254369142

   Yeah, I think we should make the test pass since pandas-on-Spark should 
follow the behavior of latest pandas.
   
   Let me take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977085297


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   nvm, I got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37948: [SPARK-40327][PS][DOCS] Add resampling to API references

2022-09-21 Thread GitBox



HyukjinKwon commented on code in PR #37948:
URL: https://github.com/apache/spark/pull/37948#discussion_r977083499


##
python/pyspark/pandas/resample.py:
##
@@ -491,3 +737,37 @@ def __getattr__(self, item: str) -> Any:
 
 def _handle_output(self, psdf: DataFrame) -> Series:
 return first_series(psdf).rename(self._psser.name)
+
+
+def _test() -> None:
+import os
+import doctest
+import sys
+import numpy
+from datetime import datetime
+from pyspark.sql import SparkSession
+import pyspark.pandas.resample
+
+os.chdir(os.environ["SPARK_HOME"])
+
+globs = pyspark.pandas.resample.__dict__.copy()
+globs["np"] = numpy

Review Comment:
   Let's probably import this within the example. `datetime` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request, #37959: [SPARK-40142][PYTHON][DOCS][FOLLOW-UP] Remove non-ANSI compliant example in element_at

2022-09-21 Thread GitBox



HyukjinKwon opened a new pull request, #37959:
URL: https://github.com/apache/spark/pull/37959

   ### What changes were proposed in this pull request?
   
   This PR removes non-ANSI compliant example in element_at.
   
   ### Why are the changes needed?
   
   ANSI build fails to run the example.
   
   https://github.com/apache/spark/actions/runs/3094607589/jobs/5008176959
   
   ```
   File "/__w/spark/spark/python/pyspark/sql/functions.py", line 6599, in 
pyspark.sql.functions.element_at
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2897)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2836)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2825)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2258)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2277)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2302)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1020)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1019)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:424)
at 
org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3925)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4095)
at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:512)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4093)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4093)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3922)
at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.spark.SparkArrayIndexOutOfBoundsException: 
[INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index -4 is out of bounds. The array 
has 3 elements. Use `try_element_at` to tolerate accessing element at invalid 
index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.invalidElementAtIndexError(QueryExecutionErrors.scala:264)
at 
org.apache.spark.sql.errors.QueryExecutionErrors.invalidElementAtIndexError(QueryExecutionErrors.scala)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:43)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:889)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:889)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at

[GitHub] [spark] github-actions[bot] commented on pull request #36485: [SPARK-39128][SQL][HIVE] Log cost time for getting FileStatus in HadoopTableReader

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36485:
URL: https://github.com/apache/spark/pull/36485#issuecomment-1254360769

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36548: [SPARK-38470][CORE] Use error classes in org.apache.spark.partial

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36548:
URL: https://github.com/apache/spark/pull/36548#issuecomment-1254360723

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36495: [SPARK-39136][SQL] JDBCTable support table properties

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36495:
URL: https://github.com/apache/spark/pull/36495#issuecomment-1254360757

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36453: SPARK-39103: SparkContext.addFiles trigger backend exception if it tr…

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36453:
URL: https://github.com/apache/spark/pull/36453#issuecomment-1254360800

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36483: [SPARK-39126][SQL] After eliminating join to one side, that side should take advantage of LocalShuffleRead optimization

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36483:
URL: https://github.com/apache/spark/pull/36483#issuecomment-1254360786

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36540: [SPARK-38466][CORE] Use error classes in org.apache.spark.mapred

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36540:
URL: https://github.com/apache/spark/pull/36540#issuecomment-1254360743

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36613: [WIP][SPARK-30983] Support typed select in Datasets up to the max tuple size

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36613:
URL: https://github.com/apache/spark/pull/36613#issuecomment-1254360704

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36551: [SPARK-38463][CORE] Use error classes in org.apache.spark.input

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36551:
URL: https://github.com/apache/spark/pull/36551#issuecomment-1254360716

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36665: [SPARK-39287][CORE] TaskSchedulerImpl should quickly ignore task finished event if its task was finished state.

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36665:
URL: https://github.com/apache/spark/pull/36665#issuecomment-1254360691

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36668: [SPARK-39291][CORE] Fetch blocks and open stream should not respond a closed channel

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36668:
URL: https://github.com/apache/spark/pull/36668#issuecomment-1254360676

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36678: [SPARK-39297][CORE][UI] bugfix: spark.ui.proxyBase contains proxy or history

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36678:
URL: https://github.com/apache/spark/pull/36678#issuecomment-1254360667

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36798: [SPARK-39408][SQL] Update the buildKeys for DynamicPruningSubquery.withNewPlan

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36798:
URL: https://github.com/apache/spark/pull/36798#issuecomment-1254360629

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36751: [WIP][SPARK-39366][CORE] Do not release write locks on task end.

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36751:
URL: https://github.com/apache/spark/pull/36751#issuecomment-1254360645

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36844: Update ExecutorClassLoader.scala

2022-09-21 Thread GitBox



github-actions[bot] commented on PR #36844:
URL: https://github.com/apache/spark/pull/36844#issuecomment-1254360617

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be 
inferred as timestamp if possible.
   Thus, we added the additional handling here for a consistent behavior as 
above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be 
inferred as timestamp if possible.
   Thus, we added the additional handling here for a similar behavior as above 
when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be 
inferred as timestamp if possible.
   Thus, we added the additional handling here to have similar behavior as 
above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not given.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be 
inferred as timestamp if possible.
   Thus, we added the additional handling here to have similar behavior as 
above when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



cloud-fan commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977077958


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing a mixture of dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType

Review Comment:
   I don't quite understand the rationale here. why can't we directly return 
`Some(StringType)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37799: [SPARK-40331][DOCS] Recommend use Java 11/17 as the runtime environment of Spark

2022-09-21 Thread GitBox



wangyum commented on PR #37799:
URL: https://github.com/apache/spark/pull/37799#issuecomment-1254347894

   The performance Issue fixed by 
[JDK-8159720](https://bugs.openjdk.org/browse/JDK-8159720).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

2022-09-21 Thread GitBox



sadikovi commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1254315080

   @cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can 
merge this PR, seems that all of the questions have been addressed. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 219 matches

Mail list logo