date:20220502

[spark] branch master updated: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

2022-05-02 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db7f346729d [SPARK-39085][SQL] Move the error message of 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json
db7f346729d is described below

commit db7f346729d481f6ea6fcc88e381fda33de9b3f1
Author: Max Gekk 
AuthorDate: Tue May 3 08:28:27 2022 +0300

[SPARK-39085][SQL] Move the error message of 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

### What changes were proposed in this pull request?
In the PR, I propose to create two new sub-classes of the error class 
`INCONSISTENT_BEHAVIOR_CROSS_VERSION`:
- READ_ANCIENT_DATETIME
- WRITE_ANCIENT_DATETIME

and move their error messages from source code to the json file 
`error-classes.json`.

### Why are the changes needed?
1. To improve maintainability of error messages in the one place.
2. To follow the general rule that bodies of error messages should be 
placed to the json file, and only parameters are passed from source code.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*"
$ build/sbt "test:testOnly *SparkThrowableSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DateFormatterSuite"
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *TimestampFormatterSuite"
```

Closes #36426 from 
MaxGekk/error-subclass-INCONSISTENT_BEHAVIOR_CROSS_VERSION.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 19 +-
 .../scala/org/apache/spark/SparkException.scala|  7 ---
 .../spark/sql/errors/QueryExecutionErrors.scala| 67 ++
 .../resources/sql-tests/results/ansi/date.sql.out  |  9 ++-
 .../results/ansi/datetime-parsing-invalid.sql.out  | 24 +---
 .../sql-tests/results/ansi/timestamp.sql.out   | 18 --
 .../test/resources/sql-tests/results/date.sql.out  |  9 ++-
 .../results/datetime-formatting-invalid.sql.out| 66 ++---
 .../results/datetime-parsing-invalid.sql.out   | 24 +---
 .../sql-tests/results/json-functions.sql.out   |  6 +-
 .../resources/sql-tests/results/timestamp.sql.out  | 18 --
 .../results/timestampNTZ/timestamp-ansi.sql.out|  3 +-
 .../results/timestampNTZ/timestamp.sql.out |  3 +-
 .../native/stringCastAndExpressions.sql.out|  9 ++-
 .../sql/errors/QueryExecutionErrorsSuite.scala |  4 +-
 15 files changed, 177 insertions(+), 109 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index eacbeec570f..24b50c4209a 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -79,7 +79,24 @@
 "message" : [ "Detected an incompatible DataSourceRegister. Please remove 
the incompatible library from classpath or upgrade it. Error: " ]
   },
   "INCONSISTENT_BEHAVIOR_CROSS_VERSION" : {
-"message" : [ "You may get a different result due to the upgrading to 
Spark >= : " ]
+"message" : [ "You may get a different result due to the upgrading to" ],
+"subClass" : {
+  "DATETIME_PATTERN_RECOGNITION" : {
+"message" : [ " Spark >= 3.0: \nFail to recognize  pattern in 
the DateTimeFormatter. 1) You can set  to 'LEGACY' to restore the 
behavior before Spark 3.0. 2) You can form a valid datetime pattern with the 
guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"; ]
+  },
+  "FORMAT_DATETIME_BY_NEW_PARSER" : {
+"message" : [ " Spark >= 3.0: \nFail to format it to  
in the new formatter. You can set\n to 'LEGACY' to restore the behavior 
before\nSpark 3.0, or set to 'CORRECTED' and treat it as an invalid datetime 
string.\n" ]
+  },
+  "PARSE_DATETIME_BY_NEW_PARSER" : {
+"message" : [ " Spark >= 3.0: \nFail to parse  in the new 
parser. You can set  to 'LEGACY' to restore the behavior before Spark 
3.0, or set to 'CORRECTED' and treat it as an invalid datetime string." ]
+  },
+  "READ_ANCIENT_DATETIME" : {
+"message" : [ " Spark >= 3.0: \nreading dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z\nfrom  files can be ambiguous, 
as the files may be written by\nSpark 2.x or legacy versions of Hive, which 
uses a legacy hybrid calendar\nthat is different from Spark 3.0+'s Proleptic 
Gregorian calendar.\nSee more details in SPARK-31404. You can set the SQL 
config  or\nthe datasource option '' to 'LEGACY' to rebase the 
datetime values\nw.r.t. the calen [...]
+  },
+  "WR

[spark] branch master updated: [SPARK-39087][SQL] Improve messages of error classes

2022-05-02 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 040526391a4 [SPARK-39087][SQL] Improve messages of error classes
040526391a4 is described below

commit 040526391a45ad610422a48c05aa69ba5133f922
Author: Max Gekk 
AuthorDate: Tue May 3 08:17:02 2022 +0300

[SPARK-39087][SQL] Improve messages of error classes

### What changes were proposed in this pull request?
In the PR, I propose to modify error messages of the following error 
classes:
- INVALID_JSON_SCHEMA_MAP_TYPE
- INCOMPARABLE_PIVOT_COLUMN
- INVALID_ARRAY_INDEX_IN_ELEMENT_AT
- INVALID_ARRAY_INDEX
- DIVIDE_BY_ZERO

### Why are the changes needed?
To improve readability of error messages.

### Does this PR introduce _any_ user-facing change?
Yes. It changes user-facing error messages.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "sql/testOnly *QueryCompilationErrorsSuite*"
$ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*"
$ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite"
$ build/sbt "test:testOnly *SparkThrowableSuite"
```

Closes #36428 from MaxGekk/error-class-improve-msg.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 12 -
 .../org/apache/spark/SparkThrowableSuite.scala |  4 +--
 .../spark/sql/errors/QueryCompilationErrors.scala  |  6 ++---
 .../expressions/ArithmeticExpressionSuite.scala| 30 +++---
 .../expressions/CollectionExpressionsSuite.scala   |  4 +--
 .../catalyst/expressions/ComplexTypeSuite.scala|  4 +--
 .../expressions/IntervalExpressionsSuite.scala | 10 
 .../expressions/StringExpressionsSuite.scala   |  6 ++---
 .../sql/catalyst/util/IntervalUtilsSuite.scala |  2 +-
 .../resources/sql-tests/results/ansi/array.sql.out | 24 -
 .../sql-tests/results/ansi/interval.sql.out|  4 +--
 .../resources/sql-tests/results/interval.sql.out   |  4 +--
 .../test/resources/sql-tests/results/pivot.sql.out |  4 +--
 .../sql-tests/results/postgreSQL/case.sql.out  |  6 ++---
 .../sql-tests/results/postgreSQL/int8.sql.out  |  6 ++---
 .../results/postgreSQL/select_having.sql.out   |  2 +-
 .../results/udf/postgreSQL/udf-case.sql.out|  6 ++---
 .../udf/postgreSQL/udf-select_having.sql.out   |  2 +-
 .../sql-tests/results/udf/udf-pivot.sql.out|  4 +--
 .../apache/spark/sql/ColumnExpressionSuite.scala   | 12 -
 .../org/apache/spark/sql/DataFrameSuite.scala  |  2 +-
 .../sql/errors/QueryCompilationErrorsSuite.scala   | 10 +++-
 .../sql/errors/QueryExecutionAnsiErrorsSuite.scala |  8 +++---
 .../sql/errors/QueryExecutionErrorsSuite.scala | 25 +-
 .../apache/spark/sql/execution/SQLViewSuite.scala  |  4 +--
 .../sql/streaming/FileStreamSourceSuite.scala  |  2 +-
 26 files changed, 101 insertions(+), 102 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index aa38f8b9747..eacbeec570f 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -34,7 +34,7 @@
 "sqlState" : "22008"
   },
   "DIVIDE_BY_ZERO" : {
-"message" : [ "divide by zero. To return NULL instead, use 'try_divide'. 
If necessary set  to false (except for ANSI interval type) to bypass 
this error." ],
+"message" : [ "Division by zero. To return NULL instead, use `try_divide`. 
If necessary set  to false (except for ANSI interval type) to bypass 
this error." ],
 "sqlState" : "22012"
   },
   "DUPLICATE_KEY" : {
@@ -72,7 +72,7 @@
 "message" : [ "Grouping sets size cannot be greater than " ]
   },
   "INCOMPARABLE_PIVOT_COLUMN" : {
-"message" : [ "Invalid pivot column ''. Pivot columns must be 
comparable." ],
+"message" : [ "Invalid pivot column . Pivot columns must be 
comparable." ],
 "sqlState" : "42000"
   },
   "INCOMPATIBLE_DATASOURCE_REGISTER" : {
@@ -89,10 +89,10 @@
 "message" : [ "" ]
   },
   "INVALID_ARRAY_INDEX" : {
-"message" : [ "Invalid index: , numElements: . If 
necessary set  to false to bypass this error." ]
+"message" : [ "The index  is out of bounds. The array has 
 elements. If necessary set  to false to bypass this error." 
]
   },
   "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" : {
-"message" : [ "Invalid index: , numElements: . To 
return NULL instead, use 'try_element_at'. If necessary set  to false 
to bypass this error." ]
+"message" : [ "The index  is out of bounds. The array has 
 elements. To return NULL instead, use `try_element_at`. If 
necessary set  to false to bypass this error." ]
   },
   "INVALID_FIELD_NAME" : {

[GitHub] [spark-website] yaooqinn commented on pull request #386: Add link to ASF events in rest of templates

2022-05-02 Thread GitBox



yaooqinn commented on PR #386:
URL: https://github.com/apache/spark-website/pull/386#issuecomment-1115741623

   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git

commit be5249092e151cbd2f54053d3e66f445b97a460e
Author: Hyukjin Kwon 
AuthorDate: Tue May 3 08:34:39 2022 +0900

Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to 
stop iterator on task completion"

This reverts commit 660a9f845f954b4bf2c3a7d51988b33ae94e3207.
---
 python/pyspark/sql/tests/test_dataframe.py | 81 --
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 1 insertion(+), 83 deletions(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index dfdbcb912f7..e3977e81851 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -21,7 +21,6 @@ import shutil
 import tempfile
 import time
 import unittest
-import uuid
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -838,86 +837,6 @@ class DataFrameTests(ReusedSQLTestCase):
 finally:
 shutil.rmtree(tpath)
 
-<<< HEAD
-===
-def test_df_show(self):
-# SPARK-35408: ensure better diagnostics if incorrect parameters are 
passed
-# to DataFrame.show
-
-df = self.spark.createDataFrame([("foo",)])
-df.show(5)
-df.show(5, True)
-df.show(5, 1, True)
-df.show(n=5, truncate="1", vertical=False)
-df.show(n=5, truncate=1.5, vertical=False)
-
-with self.assertRaisesRegex(TypeError, "Parameter 'n'"):
-df.show(True)
-with self.assertRaisesRegex(TypeError, "Parameter 'vertical'"):
-df.show(vertical="foo")
-with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
-df.show(truncate="foo")
-
-def test_df_is_empty(self):
-# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
-
-# This particular example of DataFrame reproduces an issue in isEmpty 
call
-# which could result in JVM crash.
-data = []
-for t in range(0, 1):
-id = str(uuid.uuid4())
-if t == 0:
-for i in range(0, 99):
-data.append((id,))
-elif t < 10:
-for i in range(0, 75):
-data.append((id,))
-elif t < 100:
-for i in range(0, 50):
-data.append((id,))
-elif t < 1000:
-for i in range(0, 25):
-data.append((id,))
-else:
-for i in range(0, 10):
-data.append((id,))
-
-tmpPath = tempfile.mkdtemp()
-shutil.rmtree(tmpPath)
-try:
-df = self.spark.createDataFrame(data, ["col"])
-df.coalesce(1).write.parquet(tmpPath)
-
-res = self.spark.read.parquet(tmpPath).groupBy("col").count()
-self.assertFalse(res.rdd.isEmpty())
-finally:
-shutil.rmtree(tmpPath)
-
-@unittest.skipIf(
-not have_pandas or not have_pyarrow,
-cast(str, pandas_requirement_message or pyarrow_requirement_message),
-)
-def test_pandas_api(self):
-import pandas as pd
-from pandas.testing import assert_frame_equal
-
-sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], 
["Col1", "Col2"])
-psdf_from_sdf = sdf.pandas_api()
-psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1")
-pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]})
-pdf_with_index = pdf.set_index("Col1")
-
-assert_frame_equal(pdf, psdf_from_sdf.to_pandas())
-assert_frame_equal(pdf_with_index, 
psdf_from_sdf_with_index.to_pandas())
-
-# test for SPARK-36337
-def test_create_nan_decimal_dataframe(self):
-self.assertEqual(
-self.spark.createDataFrame(data=[Decimal("NaN")], 
schema="decimal").collect(),
-[Row(value=None)],
-)
-
->>> 9305cc744d2 ([SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion)
 
 class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils):
 # These tests are separate because it uses 
'spark.sql.queryExecutionListeners' which is
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index ca33f6951e1..7fe32636308 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,7 +24,6 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcod

[spark] branch branch-3.1 updated (660a9f845f9 -> 8f6a3a50b4b)

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


from 660a9f845f9 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
 new be5249092e1 Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by 
using TaskContext to stop iterator on task completion"
 new 8f6a3a50b4b [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 python/pyspark/sql/tests/test_dataframe.py | 45 --
 1 file changed, 45 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 8f6a3a50b4b86bed008250824b0c304df9952762
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

Fixes the JVM crash when checking isEmpty() on a dataset.

No.

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 36 ++
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index e3977e81851..6b9ac24d8c1 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -21,6 +21,7 @@ import shutil
 import tempfile
 import time
 import unittest
+import uuid
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -837,6 +838,41 @@ class DataFrameTests(ReusedSQLTestCase):
 finally:
 shutil.rmtree(tpath)
 
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
 
 class QueryExecutionListenerTests(unittest.TestCase, SQLTestUtils):
 # These tests are separate because it uses 
'spark.sql.queryExecutionListeners' which is
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 7fe32636308..ca33f6951e1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
+import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -300,7 +301,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(iter)
+  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskContext.get, iter))
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 514d1899a1baf7c1bb5af68aa05e3886a80a0843
Author: Hyukjin Kwon 
AuthorDate: Tue May 3 08:33:22 2022 +0900

Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to 
stop iterator on task completion"

This reverts commit 4dba99ae359b07f814f68707073414f60616b564.
---
 python/pyspark/sql/tests/test_dataframe.py | 38 +-
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 2 insertions(+), 39 deletions(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index f2826e29d36..8c9f3304e00 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -20,8 +20,7 @@ import pydoc
 import shutil
 import tempfile
 import time
-import uuid
-from typing import cast
+import unittest
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -874,41 +873,6 @@ class DataFrameTests(ReusedSQLTestCase):
 with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
 df.show(truncate='foo')
 
-def test_df_is_empty(self):
-# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
-
-# This particular example of DataFrame reproduces an issue in isEmpty 
call
-# which could result in JVM crash.
-data = []
-for t in range(0, 1):
-id = str(uuid.uuid4())
-if t == 0:
-for i in range(0, 99):
-data.append((id,))
-elif t < 10:
-for i in range(0, 75):
-data.append((id,))
-elif t < 100:
-for i in range(0, 50):
-data.append((id,))
-elif t < 1000:
-for i in range(0, 25):
-data.append((id,))
-else:
-for i in range(0, 10):
-data.append((id,))
-
-tmpPath = tempfile.mkdtemp()
-shutil.rmtree(tmpPath)
-try:
-df = self.spark.createDataFrame(data, ["col"])
-df.coalesce(1).write.parquet(tmpPath)
-
-res = self.spark.read.parquet(tmpPath).groupBy("col").count()
-self.assertFalse(res.rdd.isEmpty())
-finally:
-shutil.rmtree(tmpPath)
-
 @unittest.skipIf(
 not have_pandas or not have_pyarrow,
 pandas_requirement_message or pyarrow_requirement_message)  # type: 
ignore
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 667f2c030d5..4885f631138 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,7 +24,6 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
-import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -301,7 +300,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskContext.get, iter))
+  new SerDeUtil.AutoBatchedPickler(iter)
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 744b5f429ca238e9bdd7081ccb7c23a9de11b422
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

Fixes the JVM crash when checking isEmpty() on a dataset.

No.

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 36 ++
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 8c9f3304e00..79522efe9e9 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -21,6 +21,7 @@ import shutil
 import tempfile
 import time
 import unittest
+import uuid
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -873,6 +874,41 @@ class DataFrameTests(ReusedSQLTestCase):
 with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
 df.show(truncate='foo')
 
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
 @unittest.skipIf(
 not have_pandas or not have_pyarrow,
 pandas_requirement_message or pyarrow_requirement_message)  # type: 
ignore
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 4885f631138..667f2c030d5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
+import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -300,7 +301,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(iter)
+  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskContext.get, iter))
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated (4dba99ae359 -> 744b5f429ca)

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4dba99ae359 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
 new 514d1899a1b Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by 
using TaskContext to stop iterator on task completion"
 new 744b5f429ca [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 python/pyspark/sql/tests/test_dataframe.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 660a9f845f9 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
660a9f845f9 is described below

commit 660a9f845f954b4bf2c3a7d51988b33ae94e3207
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

Fixes the JVM crash when checking isEmpty() on a dataset.

No.

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 81 ++
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index e3977e81851..dfdbcb912f7 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -21,6 +21,7 @@ import shutil
 import tempfile
 import time
 import unittest
+import uuid
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -837,6 +838,86 @@ class DataFrameTests(ReusedSQLTestCase):
 finally:
 shutil.rmtree(tpath)
 
+<<< HEAD
+===
+def test_df_show(self):
+# SPARK-35408: ensure better diagnostics if incorrect parameters are 
passed
+# to DataFrame.show
+
+df = self.spark.createDataFrame([("foo",)])
+df.show(5)
+df.show(5, True)
+df.show(5, 1, True)
+df.show(n=5, truncate="1", vertical=False)
+df.show(n=5, truncate=1.5, vertical=False)
+
+with self.assertRaisesRegex(TypeError, "Parameter 'n'"):
+df.show(True)
+with self.assertRaisesRegex(TypeError, "Parameter 'vertical'"):
+df.show(vertical="foo")
+with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
+df.show(truncate="foo")
+
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+cast(str, pandas_requirement_message or pyarrow_requirement_message),
+)
+def test_pandas_api(self):
+import pandas as pd
+from pandas.testing import assert_frame_equal
+
+sdf = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], 
["Col1", "Col2"])
+psdf_from_sdf = sdf.pandas_api()
+psdf_from_sdf_with_index = sdf.pandas_api(index_col="Col1")
+pdf = pd.DataFrame({"Col1": ["a", "b", "c"], "Col2": [1, 2, 3]})
+pdf_with_index = pdf.set_index("Col1")
+
+assert_frame_equal(pdf, psdf_from_sdf.to_pa

[spark] branch branch-3.2 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 4dba99ae359 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
4dba99ae359 is described below

commit 4dba99ae359b07f814f68707073414f60616b564
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

Fixes the JVM crash when checking isEmpty() on a dataset.

No.

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 38 +-
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index 8c9f3304e00..f2826e29d36 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -20,7 +20,8 @@ import pydoc
 import shutil
 import tempfile
 import time
-import unittest
+import uuid
+from typing import cast
 
 from pyspark.sql import SparkSession, Row
 from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, 
StructField, \
@@ -873,6 +874,41 @@ class DataFrameTests(ReusedSQLTestCase):
 with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
 df.show(truncate='foo')
 
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
 @unittest.skipIf(
 not have_pandas or not have_pyarrow,
 pandas_requirement_message or pyarrow_requirement_message)  # type: 
ignore
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 4885f631138..667f2c030d5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
+import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -300,7 +301,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(iter)
+  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskContext.get, iter))
 }
   }
 }


-

[spark] branch branch-3.3 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new bd6fd7e1320 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
bd6fd7e1320 is described below

commit bd6fd7e1320f689c42c8ef6710f250123a78707d
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

### What changes were proposed in this pull request?

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

### Why are the changes needed?

Fixes the JVM crash when checking isEmpty() on a dataset.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 36 ++
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index be5e1d9a6e5..fd54c25c705 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -22,6 +22,7 @@ import shutil
 import tempfile
 import time
 import unittest
+import uuid
 from typing import cast
 
 from pyspark.sql import SparkSession, Row
@@ -1141,6 +1142,41 @@ class DataFrameTests(ReusedSQLTestCase):
 with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
 df.show(truncate="foo")
 
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
 @unittest.skipIf(
 not have_pandas or not have_pyarrow,
 cast(str, pandas_requirement_message or pyarrow_requirement_message),
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 6664acf9572..8d2f788e05c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
+import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -301,7 +302,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(iter)
+  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskCont

[spark] branch master updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9305cc744d2 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using 
TaskContext to stop iterator on task completion
9305cc744d2 is described below

commit 9305cc744d27daa6a746d3eb30e7639c63329072
Author: Ivan Sadikov 
AuthorDate: Tue May 3 08:30:05 2022 +0900

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop 
iterator on task completion

### What changes were proposed in this pull request?

This PR fixes the issue described in 
https://issues.apache.org/jira/browse/SPARK-39084 where calling 
`df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or 
executor failure.

The issue was due to Python iterator not being synchronised with Java 
iterator so when the task is complete, the Python iterator continues to process 
data. We have introduced ContextAwareIterator as part of 
https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the 
places where this should be used.

### Why are the changes needed?

Fixes the JVM crash when checking isEmpty() on a dataset.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a test case that reproduces the issue 100%. I confirmed that the 
test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_dataframe.py | 36 ++
 .../sql/execution/python/EvaluatePython.scala  |  3 +-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_dataframe.py 
b/python/pyspark/sql/tests/test_dataframe.py
index ac6b6f68aed..5287826c1b4 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -22,6 +22,7 @@ import shutil
 import tempfile
 import time
 import unittest
+import uuid
 from typing import cast
 
 from pyspark.sql import SparkSession, Row
@@ -1176,6 +1177,41 @@ class DataFrameTests(ReusedSQLTestCase):
 with self.assertRaisesRegex(TypeError, "Parameter 'truncate=foo'"):
 df.show(truncate="foo")
 
+def test_df_is_empty(self):
+# SPARK-39084: Fix df.rdd.isEmpty() resulting in JVM crash.
+
+# This particular example of DataFrame reproduces an issue in isEmpty 
call
+# which could result in JVM crash.
+data = []
+for t in range(0, 1):
+id = str(uuid.uuid4())
+if t == 0:
+for i in range(0, 99):
+data.append((id,))
+elif t < 10:
+for i in range(0, 75):
+data.append((id,))
+elif t < 100:
+for i in range(0, 50):
+data.append((id,))
+elif t < 1000:
+for i in range(0, 25):
+data.append((id,))
+else:
+for i in range(0, 10):
+data.append((id,))
+
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+try:
+df = self.spark.createDataFrame(data, ["col"])
+df.coalesce(1).write.parquet(tmpPath)
+
+res = self.spark.read.parquet(tmpPath).groupBy("col").count()
+self.assertFalse(res.rdd.isEmpty())
+finally:
+shutil.rmtree(tmpPath)
+
 @unittest.skipIf(
 not have_pandas or not have_pyarrow,
 cast(str, pandas_requirement_message or pyarrow_requirement_message),
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
index 6664acf9572..8d2f788e05c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
@@ -24,6 +24,7 @@ import scala.collection.JavaConverters._
 
 import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
 
+import org.apache.spark.{ContextAwareIterator, TaskContext}
 import org.apache.spark.api.python.SerDeUtil
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -301,7 +302,7 @@ object EvaluatePython {
   def javaToPython(rdd: RDD[Any]): RDD[Array[Byte]] = {
 rdd.mapPartitions { iter =>
   registerPicklers()  // let it called in executor
-  new SerDeUtil.AutoBatchedPickler(iter)
+  new SerDeUtil.AutoBatchedPickler(new 
ContextAwareIterator(TaskContext.get, iter))
 }
   }
 }


-
To unsubscr

[GitHub] [spark-website] srowen closed pull request #386: Add link to ASF events in rest of templates

2022-05-02 Thread GitBox



srowen closed pull request #386: Add link to ASF events in rest of templates
URL: https://github.com/apache/spark-website/pull/386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-website] branch asf-site updated: Add link to ASF events in rest of templates

2022-05-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new cbf031539 Add link to ASF events in rest of templates
cbf031539 is described below

commit cbf03153968d0f311bf5517ef5f55ce41f0f2874
Author: Sean Owen 
AuthorDate: Mon May 2 09:21:14 2022 -0500

Add link to ASF events in rest of templates

Required links to ASF events were added to the site in 
https://github.com/apache/spark-website/commit/b899a8353467b9a27c90509daa19f07dba450b38
 but we missed one template that controls the home page.

Author: Sean Owen 

Closes #386 from srowen/Events2.
---
 _layouts/home.html | 1 +
 site/index.html| 1 +
 2 files changed, 2 insertions(+)

diff --git a/_layouts/home.html b/_layouts/home.html
index 1b4c20a44..7cf1ee258 100644
--- a/_layouts/home.html
+++ b/_layouts/home.html
@@ -123,6 +123,7 @@
  
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship
   https://www.apache.org/foundation/thanks.html";>Thanks
   https://www.apache.org/security/";>Security
+  https://www.apache.org/events/current-event";>Event
 
   
 
diff --git a/site/index.html b/site/index.html
index 1b02ea829..d95cc1583 100644
--- a/site/index.html
+++ b/site/index.html
@@ -119,6 +119,7 @@
  
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship
   https://www.apache.org/foundation/thanks.html";>Thanks
   https://www.apache.org/security/";>Security
+  https://www.apache.org/events/current-event";>Event
 
   
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] srowen opened a new pull request, #386: Add link to ASF events in rest of templates

2022-05-02 Thread GitBox



srowen opened a new pull request, #386:
URL: https://github.com/apache/spark-website/pull/386

   Required links to ASF events were added to the site in 
https://github.com/apache/spark-website/commit/b899a8353467b9a27c90509daa19f07dba450b38
 but we missed one template that controls the home page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 1804f5c8c02 [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR 
documentation able to build on Mac OS
1804f5c8c02 is described below

commit 1804f5c8c02fd9beade9e986540dac248638e8a5
Author: Hyukjin Kwon 
AuthorDate: Mon May 2 17:50:01 2022 +0900

[SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build 
on Mac OS

### What changes were proposed in this pull request?

Currently SparkR documentation fails because of the usage `grep -oP `. Mac 
OS does not have this.
This PR fixes it via using the existing way used in the current scripts at: 
https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52

### Why are the changes needed?

To make the dev easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via:

```bash
cd R
./create-docs.sh
```

Closes #36423 from HyukjinKwon/SPARK-37474.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b)
Signed-off-by: Hyukjin Kwon 
---
 R/create-docs.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/create-docs.sh b/R/create-docs.sh
index 1774d5870de..4867fd99e64 100755
--- a/R/create-docs.sh
+++ b/R/create-docs.sh
@@ -55,7 +55,7 @@ pushd pkg/html
 
 
 # Determine Spark(R) version
-SPARK_VERSION=$(grep -oP "(?<=Version:\ ).*" ../DESCRIPTION)
+SPARK_VERSION=$(grep Version "../DESCRIPTION" | awk '{print $NF}')
 
 # Update url
 sed "s/{SPARK_VERSION}/$SPARK_VERSION/" ../pkgdown/_pkgdown_template.yml > 
../_pkgdown.yml


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

2022-05-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6479455b8db [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR 
documentation able to build on Mac OS
6479455b8db is described below

commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b
Author: Hyukjin Kwon 
AuthorDate: Mon May 2 17:50:01 2022 +0900

[SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build 
on Mac OS

### What changes were proposed in this pull request?

Currently SparkR documentation fails because of the usage `grep -oP `. Mac 
OS does not have this.
This PR fixes it via using the existing way used in the current scripts at: 
https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52

### Why are the changes needed?

To make the dev easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via:

```bash
cd R
./create-docs.sh
```

Closes #36423 from HyukjinKwon/SPARK-37474.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 R/create-docs.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/create-docs.sh b/R/create-docs.sh
index 1774d5870de..4867fd99e64 100755
--- a/R/create-docs.sh
+++ b/R/create-docs.sh
@@ -55,7 +55,7 @@ pushd pkg/html
 
 
 # Determine Spark(R) version
-SPARK_VERSION=$(grep -oP "(?<=Version:\ ).*" ../DESCRIPTION)
+SPARK_VERSION=$(grep Version "../DESCRIPTION" | awk '{print $NF}')
 
 # Update url
 sed "s/{SPARK_VERSION}/$SPARK_VERSION/" ../pkgdown/_pkgdown_template.yml > 
../_pkgdown.yml


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39085][SQL] Move the error message of `INCONSISTENT_BEHAVIOR_CROSS_VERSION` to error-classes.json

[spark] branch master updated: [SPARK-39087][SQL] Improve messages of error classes

[GitHub] [spark-website] yaooqinn commented on pull request #386: Add link to ASF events in rest of templates

[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"

[spark] branch branch-3.1 updated (660a9f845f9 -> 8f6a3a50b4b)

[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[spark] 01/02: Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion"

[spark] 02/02: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[spark] branch branch-3.2 updated (4dba99ae359 -> 744b5f429ca)

[spark] branch branch-3.1 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[spark] branch branch-3.2 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[spark] branch branch-3.3 updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[spark] branch master updated: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

[GitHub] [spark-website] srowen closed pull request #386: Add link to ASF events in rest of templates

[spark-website] branch asf-site updated: Add link to ASF events in rest of templates

[GitHub] [spark-website] srowen opened a new pull request, #386: Add link to ASF events in rest of templates

[spark] branch branch-3.3 updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

[spark] branch master updated: [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS

18 matches

Site Navigation

Mail list logo

Footer information