[spark] branch master updated: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new db7f346729d [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json db7f346729d is described below commit db7f346729d481f6ea6fcc88e381fda33de9b3f1 Author: Max Gekk AuthorDate: Tue May 3 08:28:27 2022 +0300 [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json ### What changes were proposed in this pull request? In the PR, I propose to create two new sub-classes of the error class `INCONSISTENT_BEHAVIOR_CROSS_VERSION`: - READ_ANCIENT_DATETIME - WRITE_ANCIENT_DATETIME and move their error messages from source code to the json file `error-classes.json`. ### Why are the changes needed? 1. To improve maintainability of error messages in the one place. 2. To follow the general rule that bodies of error messages should be placed to the json file, and only parameters are passed from source code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*" $ build/sbt "test:testOnly *SparkThrowableSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *DateFormatterSuite" $ build/sbt "test:testOnly *DateExpressionsSuite" $ build/sbt "test:testOnly *TimestampFormatterSuite" ``` Closes #36426 from MaxGekk/error-subclass-INCONSISTENT_BEHAVIOR_CROSS_VERSION. Authored-by: Max Gekk Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 19 +- .../scala/org/apache/spark/SparkException.scala| 7 --- .../spark/sql/errors/QueryExecutionErrors.scala| 67 ++ .../resources/sql-tests/results/ansi/date.sql.out | 9 ++- .../results/ansi/datetime-parsing-invalid.sql.out | 24 +--- .../sql-tests/results/ansi/timestamp.sql.out | 18 -- .../test/resources/sql-tests/results/date.sql.out | 9 ++- .../results/datetime-formatting-invalid.sql.out| 66 ++--- .../results/datetime-parsing-invalid.sql.out | 24 +--- .../sql-tests/results/json-functions.sql.out | 6 +- .../resources/sql-tests/results/timestamp.sql.out | 18 -- .../results/timestampNTZ/timestamp-ansi.sql.out| 3 +- .../results/timestampNTZ/timestamp.sql.out | 3 +- .../native/stringCastAndExpressions.sql.out| 9 ++- .../sql/errors/QueryExecutionErrorsSuite.scala | 4 +- 15 files changed, 177 insertions(+), 109 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index eacbeec570f..24b50c4209a 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -79,7 +79,24 @@ "message" : [ "Detected an incompatible DataSourceRegister. Please remove the incompatible library from classpath or upgrade it. Error: " ] }, "INCONSISTENT_BEHAVIOR_CROSS_VERSION" : { -"message" : [ "You may get a different result due to the upgrading to Spark >= : " ] +"message" : [ "You may get a different result due to the upgrading to" ], +"subClass" : { + "DATETIME_PATTERN_RECOGNITION" : { +"message" : [ " Spark >= 3.0: \nFail to recognize pattern in the DateTimeFormatter. 1) You can set to 'LEGACY' to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"; ] + }, + "FORMAT_DATETIME_BY_NEW_PARSER" : { +"message" : [ " Spark >= 3.0: \nFail to format it to in the new formatter. You can set\n to 'LEGACY' to restore the behavior before\nSpark 3.0, or set to 'CORRECTED' and treat it as an invalid datetime string.\n" ] + }, + "PARSE_DATETIME_BY_NEW_PARSER" : { +"message" : [ " Spark >= 3.0: \nFail to parse in the new parser. You can set to 'LEGACY' to restore the behavior before Spark 3.0, or set to 'CORRECTED' and treat it as an invalid datetime string." ] + }, + "READ_ANCIENT_DATETIME" : { +"message" : [ " Spark >= 3.0: \nreading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z\nfrom files can be ambiguous, as the files may be written by\nSpark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar\nthat is different from Spark 3.0+'s Proleptic Gregorian calendar.\nSee more details in SPARK-31404. You can set the SQL config or\nthe datasource option '' to 'LEGACY' to rebase the datetime values\nw.r.t. the calen [...] + }, + "WR
[spark] branch master updated: [SPARK-39087][SQL] Improve messages of error classes
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 040526391a4 [SPARK-39087][SQL] Improve messages of error classes 040526391a4 is described below commit 040526391a45ad610422a48c05aa69ba5133f922 Author: Max Gekk AuthorDate: Tue May 3 08:17:02 2022 +0300 [SPARK-39087][SQL] Improve messages of error classes ### What changes were proposed in this pull request? In the PR, I propose to modify error messages of the following error classes: - INVALID_JSON_SCHEMA_MAP_TYPE - INCOMPARABLE_PIVOT_COLUMN - INVALID_ARRAY_INDEX_IN_ELEMENT_AT - INVALID_ARRAY_INDEX - DIVIDE_BY_ZERO ### Why are the changes needed? To improve readability of error messages. ### Does this PR introduce _any_ user-facing change? Yes. It changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "test:testOnly *SparkThrowableSuite" ``` Closes #36428 from MaxGekk/error-class-improve-msg. Authored-by: Max Gekk Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 12 - .../org/apache/spark/SparkThrowableSuite.scala | 4 +-- .../spark/sql/errors/QueryCompilationErrors.scala | 6 ++--- .../expressions/ArithmeticExpressionSuite.scala| 30 +++--- .../expressions/CollectionExpressionsSuite.scala | 4 +-- .../catalyst/expressions/ComplexTypeSuite.scala| 4 +-- .../expressions/IntervalExpressionsSuite.scala | 10 .../expressions/StringExpressionsSuite.scala | 6 ++--- .../sql/catalyst/util/IntervalUtilsSuite.scala | 2 +- .../resources/sql-tests/results/ansi/array.sql.out | 24 - .../sql-tests/results/ansi/interval.sql.out| 4 +-- .../resources/sql-tests/results/interval.sql.out | 4 +-- .../test/resources/sql-tests/results/pivot.sql.out | 4 +-- .../sql-tests/results/postgreSQL/case.sql.out | 6 ++--- .../sql-tests/results/postgreSQL/int8.sql.out | 6 ++--- .../results/postgreSQL/select_having.sql.out | 2 +- .../results/udf/postgreSQL/udf-case.sql.out| 6 ++--- .../udf/postgreSQL/udf-select_having.sql.out | 2 +- .../sql-tests/results/udf/udf-pivot.sql.out| 4 +-- .../apache/spark/sql/ColumnExpressionSuite.scala | 12 - .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../sql/errors/QueryCompilationErrorsSuite.scala | 10 +++- .../sql/errors/QueryExecutionAnsiErrorsSuite.scala | 8 +++--- .../sql/errors/QueryExecutionErrorsSuite.scala | 25 +- .../apache/spark/sql/execution/SQLViewSuite.scala | 4 +-- .../sql/streaming/FileStreamSourceSuite.scala | 2 +- 26 files changed, 101 insertions(+), 102 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index aa38f8b9747..eacbeec570f 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -34,7 +34,7 @@ "sqlState" : "22008" }, "DIVIDE_BY_ZERO" : { -"message" : [ "divide by zero. To return NULL instead, use 'try_divide'. If necessary set to false (except for ANSI interval type) to bypass this error." ], +"message" : [ "Division by zero. To return NULL instead, use `try_divide`. If necessary set to false (except for ANSI interval type) to bypass this error." ], "sqlState" : "22012" }, "DUPLICATE_KEY" : { @@ -72,7 +72,7 @@ "message" : [ "Grouping sets size cannot be greater than " ] }, "INCOMPARABLE_PIVOT_COLUMN" : { -"message" : [ "Invalid pivot column ''. Pivot columns must be comparable." ], +"message" : [ "Invalid pivot column . Pivot columns must be comparable." ], "sqlState" : "42000" }, "INCOMPATIBLE_DATASOURCE_REGISTER" : { @@ -89,10 +89,10 @@ "message" : [ "" ] }, "INVALID_ARRAY_INDEX" : { -"message" : [ "Invalid index: , numElements: . If necessary set to false to bypass this error." ] +"message" : [ "The index is out of bounds. The array has elements. If necessary set to false to bypass this error." ] }, "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" : { -"message" : [ "Invalid index: , numElements: . To return NULL instead, use 'try_element_at'. If necessary set to false to bypass this error." ] +"message" : [ "The index is out of bounds. The array has elements. To return NULL instead, use `try_element_at`. If necessary set to false to bypass this error." ] }, "INVALID_FIELD_NAME" : {
[GitHub] [spark-website] yaooqinn commented on pull request #386: Add link to ASF events in rest of templates
yaooqinn commented on PR #386: URL: https://github.com/apache/spark-website/pull/386#issuecomment-1115741623 +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git commit be5249092e151cbd2f54053d3e66f445b97a460e Author: Hyukjin Kwon AuthorDate: Tue May 3 08:34:39 2022 +0900 Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion" This reverts commit 660a9f845f954b4bf2c3a7d51988b33ae94e3207. --- python/pyspark/sql/tests/test_dataframe.py | 81 -- .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 1 insertion(+), 83 deletions(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index dfdbcb912f7..e3977e81851 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -21,7 +21,6 @@ import shutil import tempfile import time import unittest -import uuid from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -838,86 +837,6 @@ class DataFrameTests(ReusedSQLTestCase): finally: shutil.rmtree(tpath) -<<< HEAD -=== -def test_df_show(self): -# SPARK-35408: ensure better diagnostics if incorrect parameters are passed -# to DataFrame.show - -df = self.spark.createDataFrame([("foo",)]) -df.show(5) -df.show(5, True) -df.show(5, 1, True) -df.show(n=5, truncate="1", vertical=False) -df.show(n=5, truncate=1.5, vertical=False) - -with self.assertRaisesRegex(TypeError, "Parameter 'n'"): -df.show(True) -with self.assertRaisesRegex(TypeError, "Parameter 'vertical'"): -df.show(vertical="foo") -with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): -df.show(truncate="foo") - -def test_df_is_empty(self): -# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. - -# This particular example of DataFrame reproduces an issue in isEmpty call -# which could result in JVM crash. -data = [] -for t in range(0, 1): -id = str(uuid.uuid4()) -if t == 0: -for i in range(0, 99): -data.append((id,)) -elif t < 10: -for i in range(0, 75): -data.append((id,)) -elif t < 100: -for i in range(0, 50): -data.append((id,)) -elif t < 1000: -for i in range(0, 25): -data.append((id,)) -else: -for i in range(0, 10): -data.append((id,)) - -tmpPath = tempfile.mkdtemp() -shutil.rmtree(tmpPath) -try: -df = self.spark.createDataFrame(data, ["col"]) -df.coalesce(1).write.parquet(tmpPath) - -res = self.spark.read.parquet(tmpPath).groupBy("col").count() -self.assertFalse(res.rdd.isEmpty()) -finally: -shutil.rmtree(tmpPath) - -@unittest.skipIf( -not have_pandas or not have_pyarrow, -cast(str, pandas_requirement_message or pyarrow_requirement_message), -) -def test_pandas_api(self): -import pandas as pd -from pandas.testing import assert_frame_equal - -sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) -psdf_from_sdf = sdf.pandas_api() -psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1") -pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]}) -pdf_with_index = pdf.set_index("Col1") - -assert_frame_equal(pdf, psdf_from_sdf.to_pandas()) -assert_frame_equal(pdf_with_index, psdf_from_sdf_with_index.to_pandas()) - -# test for SPARK-36337 -def test_create_nan_decimal_dataframe(self): -self.assertEqual( -self.spark.createDataFrame(data=[Decimal("NaN")], schema="decimal").collect(), -[Row(value=None)], -) - ->>> 9305cc744d2 ([SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion) class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils): # These tests are separate because it uses 'spark.sql.queryExecutionListeners' which is diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index ca33f6951e1..7fe32636308 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,7 +24,6 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcod
[spark] branch branch-3.1 updated (660a9f845f9 -> 8f6a3a50b4b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git from 660a9f845f9 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion new be5249092e1 Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion" new 8f6a3a50b4b [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: python/pyspark/sql/tests/test_dataframe.py | 45 -- 1 file changed, 45 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git commit 8f6a3a50b4b86bed008250824b0c304df9952762 Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 36 ++ .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index e3977e81851..6b9ac24d8c1 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -21,6 +21,7 @@ import shutil import tempfile import time import unittest +import uuid from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -837,6 +838,41 @@ class DataFrameTests(ReusedSQLTestCase): finally: shutil.rmtree(tpath) +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils): # These tests are separate because it uses 'spark.sql.queryExecutionListeners' which is diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 7fe32636308..ca33f6951e1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} +import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -300,7 +301,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(iter) + new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskContext.get, iter)) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git commit 514d1899a1baf7c1bb5af68aa05e3886a80a0843 Author: Hyukjin Kwon AuthorDate: Tue May 3 08:33:22 2022 +0900 Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion" This reverts commit 4dba99ae359b07f814f68707073414f60616b564. --- python/pyspark/sql/tests/test_dataframe.py | 38 +- .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 2 insertions(+), 39 deletions(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index f2826e29d36..8c9f3304e00 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -20,8 +20,7 @@ import pydoc import shutil import tempfile import time -import uuid -from typing import cast +import unittest from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -874,41 +873,6 @@ class DataFrameTests(ReusedSQLTestCase): with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): df.show(truncate='foo') -def test_df_is_empty(self): -# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. - -# This particular example of DataFrame reproduces an issue in isEmpty call -# which could result in JVM crash. -data = [] -for t in range(0, 1): -id = str(uuid.uuid4()) -if t == 0: -for i in range(0, 99): -data.append((id,)) -elif t < 10: -for i in range(0, 75): -data.append((id,)) -elif t < 100: -for i in range(0, 50): -data.append((id,)) -elif t < 1000: -for i in range(0, 25): -data.append((id,)) -else: -for i in range(0, 10): -data.append((id,)) - -tmpPath = tempfile.mkdtemp() -shutil.rmtree(tmpPath) -try: -df = self.spark.createDataFrame(data, ["col"]) -df.coalesce(1).write.parquet(tmpPath) - -res = self.spark.read.parquet(tmpPath).groupBy("col").count() -self.assertFalse(res.rdd.isEmpty()) -finally: -shutil.rmtree(tmpPath) - @unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message) # type: ignore diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 667f2c030d5..4885f631138 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,7 +24,6 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} -import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -301,7 +300,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskContext.get, iter)) + new SerDeUtil.AutoBatchedPickler(iter) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git commit 744b5f429ca238e9bdd7081ccb7c23a9de11b422 Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 36 ++ .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index 8c9f3304e00..79522efe9e9 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -21,6 +21,7 @@ import shutil import tempfile import time import unittest +import uuid from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -873,6 +874,41 @@ class DataFrameTests(ReusedSQLTestCase): with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): df.show(truncate='foo') +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + @unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message) # type: ignore diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 4885f631138..667f2c030d5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} +import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -300,7 +301,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(iter) + new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskContext.get, iter)) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (4dba99ae359 -> 744b5f429ca)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from 4dba99ae359 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion new 514d1899a1b Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion" new 744b5f429ca [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: python/pyspark/sql/tests/test_dataframe.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 660a9f845f9 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion 660a9f845f9 is described below commit 660a9f845f954b4bf2c3a7d51988b33ae94e3207 Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 81 ++ .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 83 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index e3977e81851..dfdbcb912f7 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -21,6 +21,7 @@ import shutil import tempfile import time import unittest +import uuid from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -837,6 +838,86 @@ class DataFrameTests(ReusedSQLTestCase): finally: shutil.rmtree(tpath) +<<< HEAD +=== +def test_df_show(self): +# SPARK-35408: ensure better diagnostics if incorrect parameters are passed +# to DataFrame.show + +df = self.spark.createDataFrame([("foo",)]) +df.show(5) +df.show(5, True) +df.show(5, 1, True) +df.show(n=5, truncate="1", vertical=False) +df.show(n=5, truncate=1.5, vertical=False) + +with self.assertRaisesRegex(TypeError, "Parameter 'n'"): +df.show(True) +with self.assertRaisesRegex(TypeError, "Parameter 'vertical'"): +df.show(vertical="foo") +with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): +df.show(truncate="foo") + +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + +@unittest.skipIf( +not have_pandas or not have_pyarrow, +cast(str, pandas_requirement_message or pyarrow_requirement_message), +) +def test_pandas_api(self): +import pandas as pd +from pandas.testing import assert_frame_equal + +sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) +psdf_from_sdf = sdf.pandas_api() +psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1") +pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]}) +pdf_with_index = pdf.set_index("Col1") + +assert_frame_equal(pdf, psdf_from_sdf.to_pa
[spark] branch branch-3.2 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 4dba99ae359 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion 4dba99ae359 is described below commit 4dba99ae359b07f814f68707073414f60616b564 Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 38 +- .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 39 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index 8c9f3304e00..f2826e29d36 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -20,7 +20,8 @@ import pydoc import shutil import tempfile import time -import unittest +import uuid +from typing import cast from pyspark.sql import SparkSession, Row from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, \ @@ -873,6 +874,41 @@ class DataFrameTests(ReusedSQLTestCase): with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): df.show(truncate='foo') +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + @unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message) # type: ignore diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 4885f631138..667f2c030d5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} +import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -300,7 +301,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(iter) + new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskContext.get, iter)) } } } -
[spark] branch branch-3.3 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new bd6fd7e1320 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion bd6fd7e1320 is described below commit bd6fd7e1320f689c42c8ef6710f250123a78707d Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion ### What changes were proposed in this pull request? This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. ### Why are the changes needed? Fixes the JVM crash when checking isEmpty() on a dataset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 36 ++ .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index be5e1d9a6e5..fd54c25c705 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -22,6 +22,7 @@ import shutil import tempfile import time import unittest +import uuid from typing import cast from pyspark.sql import SparkSession, Row @@ -1141,6 +1142,41 @@ class DataFrameTests(ReusedSQLTestCase): with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): df.show(truncate="foo") +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + @unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 6664acf9572..8d2f788e05c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} +import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -301,7 +302,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(iter) + new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskCont
[spark] branch master updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9305cc744d2 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion 9305cc744d2 is described below commit 9305cc744d27daa6a746d3eb30e7639c63329072 Author: Ivan Sadikov AuthorDate: Tue May 3 08:30:05 2022 +0900 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion ### What changes were proposed in this pull request? This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. ### Why are the changes needed? Fixes the JVM crash when checking isEmpty() on a dataset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/tests/test_dataframe.py | 36 ++ .../sql/execution/python/EvaluatePython.scala | 3 +- 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py index ac6b6f68aed..5287826c1b4 100644 --- a/python/pyspark/sql/tests/test_dataframe.py +++ b/python/pyspark/sql/tests/test_dataframe.py @@ -22,6 +22,7 @@ import shutil import tempfile import time import unittest +import uuid from typing import cast from pyspark.sql import SparkSession, Row @@ -1176,6 +1177,41 @@ class DataFrameTests(ReusedSQLTestCase): with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"): df.show(truncate="foo") +def test_df_is_empty(self): +# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash. + +# This particular example of DataFrame reproduces an issue in isEmpty call +# which could result in JVM crash. +data = [] +for t in range(0, 1): +id = str(uuid.uuid4()) +if t == 0: +for i in range(0, 99): +data.append((id,)) +elif t < 10: +for i in range(0, 75): +data.append((id,)) +elif t < 100: +for i in range(0, 50): +data.append((id,)) +elif t < 1000: +for i in range(0, 25): +data.append((id,)) +else: +for i in range(0, 10): +data.append((id,)) + +tmpPath = tempfile.mkdtemp() +shutil.rmtree(tmpPath) +try: +df = self.spark.createDataFrame(data, ["col"]) +df.coalesce(1).write.parquet(tmpPath) + +res = self.spark.read.parquet(tmpPath).groupBy("col").count() +self.assertFalse(res.rdd.isEmpty()) +finally: +shutil.rmtree(tmpPath) + @unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 6664acf9572..8d2f788e05c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -24,6 +24,7 @@ import scala.collection.JavaConverters._ import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler} +import org.apache.spark.{ContextAwareIterator, TaskContext} import org.apache.spark.api.python.SerDeUtil import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -301,7 +302,7 @@ object EvaluatePython { def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = { rdd.mapPartitions { iter => registerPicklers() // let it called in executor - new SerDeUtil.AutoBatchedPickler(iter) + new SerDeUtil.AutoBatchedPickler(new ContextAwareIterator(TaskContext.get, iter)) } } } - To unsubscr
[GitHub] [spark-website] srowen closed pull request #386: Add link to ASF events in rest of templates
srowen closed pull request #386: Add link to ASF events in rest of templates URL: https://github.com/apache/spark-website/pull/386 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Add link to ASF events in rest of templates
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new cbf031539 Add link to ASF events in rest of templates cbf031539 is described below commit cbf03153968d0f311bf5517ef5f55ce41f0f2874 Author: Sean Owen AuthorDate: Mon May 2 09:21:14 2022 -0500 Add link to ASF events in rest of templates Required links to ASF events were added to the site in https://github.com/apache/spark-website/commit/b899a8353467b9a27c90509daa19f07dba450b38 but we missed one template that controls the home page. Author: Sean Owen Closes #386 from srowen/Events2. --- _layouts/home.html | 1 + site/index.html| 1 + 2 files changed, 2 insertions(+) diff --git a/_layouts/home.html b/_layouts/home.html index 1b4c20a44..7cf1ee258 100644 --- a/_layouts/home.html +++ b/_layouts/home.html @@ -123,6 +123,7 @@ href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship https://www.apache.org/foundation/thanks.html";>Thanks https://www.apache.org/security/";>Security + https://www.apache.org/events/current-event";>Event diff --git a/site/index.html b/site/index.html index 1b02ea829..d95cc1583 100644 --- a/site/index.html +++ b/site/index.html @@ -119,6 +119,7 @@ href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship https://www.apache.org/foundation/thanks.html";>Thanks https://www.apache.org/security/";>Security + https://www.apache.org/events/current-event";>Event - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] srowen opened a new pull request, #386: Add link to ASF events in rest of templates
srowen opened a new pull request, #386: URL: https://github.com/apache/spark-website/pull/386 Required links to ASF events were added to the site in https://github.com/apache/spark-website/commit/b899a8353467b9a27c90509daa19f07dba450b38 but we missed one template that controls the home page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 1804f5c8c02 [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS 1804f5c8c02 is described below commit 1804f5c8c02fd9beade9e986540dac248638e8a5 Author: Hyukjin Kwon AuthorDate: Mon May 2 17:50:01 2022 +0900 [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS ### What changes were proposed in this pull request? Currently SparkR documentation fails because of the usage `grep -oP `. Mac OS does not have this. This PR fixes it via using the existing way used in the current scripts at: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52 ### Why are the changes needed? To make the dev easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd R ./create-docs.sh ``` Closes #36423 from HyukjinKwon/SPARK-37474. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b) Signed-off-by: Hyukjin Kwon --- R/create-docs.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/create-docs.sh b/R/create-docs.sh index 1774d5870de..4867fd99e64 100755 --- a/R/create-docs.sh +++ b/R/create-docs.sh @@ -55,7 +55,7 @@ pushd pkg/html # Determine Spark(R) version -SPARK_VERSION=$(grep -oP "(?<=Version:\ ).*" ../DESCRIPTION) +SPARK_VERSION=$(grep Version "../DESCRIPTION" | awk '{print $NF}') # Update url sed "s/{SPARK_VERSION}/$SPARK_VERSION/" ../pkgdown/_pkgdown_template.yml > ../_pkgdown.yml - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6479455b8db [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS 6479455b8db is described below commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b Author: Hyukjin Kwon AuthorDate: Mon May 2 17:50:01 2022 +0900 [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS ### What changes were proposed in this pull request? Currently SparkR documentation fails because of the usage `grep -oP `. Mac OS does not have this. This PR fixes it via using the existing way used in the current scripts at: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52 ### Why are the changes needed? To make the dev easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd R ./create-docs.sh ``` Closes #36423 from HyukjinKwon/SPARK-37474. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- R/create-docs.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/create-docs.sh b/R/create-docs.sh index 1774d5870de..4867fd99e64 100755 --- a/R/create-docs.sh +++ b/R/create-docs.sh @@ -55,7 +55,7 @@ pushd pkg/html # Determine Spark(R) version -SPARK_VERSION=$(grep -oP "(?<=Version:\ ).*" ../DESCRIPTION) +SPARK_VERSION=$(grep Version "../DESCRIPTION" | awk '{print $NF}') # Update url sed "s/{SPARK_VERSION}/$SPARK_VERSION/" ../pkgdown/_pkgdown_template.yml > ../_pkgdown.yml - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org