[spark] branch master updated (9f7420bc903 -> 5b7f89cee8e)

2022-11-28 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9f7420bc903 [SPARK-41312][CONNECT][PYTHON] Implement 
DataFrame.withColumnRenamed
 add 5b7f89cee8e [MINOR][PYTHON][DOCS] Fix types and docstring in 
DataFrame.toDF

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/dataframe.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41312][CONNECT][PYTHON] Implement DataFrame.withColumnRenamed

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9f7420bc903 [SPARK-41312][CONNECT][PYTHON] Implement 
DataFrame.withColumnRenamed
9f7420bc903 is described below

commit 9f7420bc9038d921ad7652ad66b13bfbf9faef9a
Author: Rui Wang 
AuthorDate: Tue Nov 29 15:42:11 2022 +0800

[SPARK-41312][CONNECT][PYTHON] Implement DataFrame.withColumnRenamed

### What changes were proposed in this pull request?

Implement DataFrame.withColumnRenamed by reusing existing Connect proto 
`RenameColumnsNameByName `.

### Why are the changes needed?

API coverage

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes #38831 from amaliujia/withColumnRenamed.

Authored-by: Rui Wang 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py| 48 ++
 python/pyspark/sql/connect/plan.py | 30 ++
 .../sql/tests/connect/test_connect_basic.py| 15 +++
 3 files changed, 93 insertions(+)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 725c7fc90da..2d6e7352df9 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -570,6 +570,54 @@ class DataFrame(object):
 session=self._session,
 )
 
+def withColumnRenamed(self, existing: str, new: str) -> "DataFrame":
+"""Returns a new :class:`DataFrame` by renaming an existing column.
+This is a no-op if schema doesn't contain the given column name.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+existing : str
+string, name of the existing column to rename.
+new : str
+string, new name of the column.
+
+Returns
+---
+:class:`DataFrame`
+DataFrame with renamed column.
+"""
+return self.withColumnsRenamed({existing: new})
+
+def withColumnsRenamed(self, colsMap: Dict[str, str]) -> "DataFrame":
+"""
+Returns a new :class:`DataFrame` by renaming multiple columns.
+This is a no-op if schema doesn't contain the given column names.
+
+.. versionadded:: 3.4.0
+   Added support for multiple columns renaming
+
+Parameters
+--
+colsMap : dict
+a dict of existing column names and corresponding desired column 
names.
+Currently, only single map is supported.
+
+Returns
+---
+:class:`DataFrame`
+DataFrame with renamed columns.
+
+See Also
+
+:meth:`withColumnRenamed`
+"""
+if not isinstance(colsMap, dict):
+raise TypeError("colsMap must be dict of existing column name and 
new column name.")
+
+return DataFrame.withPlan(plan.RenameColumnsNameByName(self._plan, 
colsMap), self._session)
+
 def _show_string(
 self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = 
False
 ) -> str:
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 19889cb9eb8..48eb69ffdc2 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -938,6 +938,36 @@ class Range(LogicalPlan):
 """
 
 
+class RenameColumnsNameByName(LogicalPlan):
+def __init__(self, child: Optional["LogicalPlan"], colsMap: Mapping[str, 
str]) -> None:
+super().__init__(child)
+self._colsMap = colsMap
+
+def plan(self, session: "SparkConnectClient") -> proto.Relation:
+assert self._child is not None
+
+plan = proto.Relation()
+
plan.rename_columns_by_name_to_name_map.input.CopyFrom(self._child.plan(session))
+for k, v in self._colsMap.items():
+plan.rename_columns_by_name_to_name_map.rename_columns_map[k] = v
+return plan
+
+def print(self, indent: int = 0) -> str:
+i = " " * indent
+return f"""{i}"""
+
+def _repr_html_(self) -> str:
+return f"""
+
+   
+  RenameColumns
+  ColsMap: {self._colsMap} 
+  {self._child_repr_()}
+   
+
+"""
+
+
 class NAFill(LogicalPlan):
 def __init__(
 self, child: Optional["LogicalPlan"], cols: Optional[List[str]], 
values: List[Any]
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index eb025fd5d04..3f673edb75f 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -123,6 +123,21 @@ class 

[spark] branch master updated: [SPARK-41260][PYTHON][SS] Cast NumPy instances to Python primitive types in GroupState update

2022-11-28 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7bc71910f9b [SPARK-41260][PYTHON][SS] Cast NumPy instances to Python 
primitive types in GroupState update
7bc71910f9b is described below

commit 7bc71910f9b08183d1e0572eef880e996892fa7d
Author: Hyukjin Kwon 
AuthorDate: Tue Nov 29 16:13:32 2022 +0900

[SPARK-41260][PYTHON][SS] Cast NumPy instances to Python primitive types in 
GroupState update

### What changes were proposed in this pull request?

This PR proposes to cast NumPy instances in `GroupState.update`.  
Previously, if we pass a NumPy instance to `GroupState.update`, then it failed 
with an exception as below:

```
net.razorvine.pickle.PickleException: expected zero arguments for 
construction of ClassDict (for numpy.dtype). This happens when an 
unsupported/unregistered class is being unpickled that requires construction 
arguments. Fix it by registering a custom IObjectConstructor for this class.
at 
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:759)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:199)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:109)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:122)
at 
org.apache.spark.sql.api.python.PythonSQLUtils$.$anonfun$toJVMRow$1(PythonSQLUtils.scala:125)
```

### Why are the changes needed?

`applyInPandasWithState` uses pandas instances so it is very common to 
extract a NumPy instance from the pandas and set it to the group state.

### Does this PR introduce _any_ user-facing change?

No because this new API is not released yet.

### How was this patch tested?

Manually tested, and unittest was added.

Closes #38796 from HyukjinKwon/SPARK-41260.

Authored-by: Hyukjin Kwon 
Signed-off-by: Jungtaek Lim 
---
 python/pyspark/sql/streaming/state.py  | 21 +++-
 .../FlatMapGroupsInPandasWithStateSuite.scala  | 63 ++
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/streaming/state.py 
b/python/pyspark/sql/streaming/state.py
index 66b225e1b10..f0ac427cbea 100644
--- a/python/pyspark/sql/streaming/state.py
+++ b/python/pyspark/sql/streaming/state.py
@@ -19,6 +19,7 @@ import json
 from typing import Tuple, Optional
 
 from pyspark.sql.types import DateType, Row, StructType
+from pyspark.sql.utils import has_numpy
 
 __all__ = ["GroupState", "GroupStateTimeout"]
 
@@ -130,7 +131,25 @@ class GroupState:
 if newValue is None:
 raise ValueError("'None' is not a valid state value")
 
-self._value = Row(*newValue)
+converted = []
+if has_numpy:
+import numpy as np
+
+# In order to convert NumPy types to Python primitive types.
+for v in newValue:
+if isinstance(v, np.generic):
+converted.append(v.tolist())
+# Address a couple of pandas dtypes too.
+elif hasattr(v, "to_pytimedelta"):
+converted.append(v.to_pytimedelta())
+elif hasattr(v, "to_pydatetime"):
+converted.append(v.to_pydatetime())
+else:
+converted.append(v)
+else:
+converted = list(newValue)
+
+self._value = Row(*converted)
 self._defined = True
 self._updated = True
 self._removed = False
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala
index a83f7cce4c1..ca738b805eb 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala
@@ -886,4 +886,67 @@ class FlatMapGroupsInPandasWithStateSuite extends 
StateStoreMetricsTest {
 total = Seq(1), updated = Seq(1), droppedByWatermark = Seq(0), removed 
= Some(Seq(1)))
 )
   }
+
+  test("SPARK-41260: applyInPandasWithState - NumPy instances to JVM rows in 
state") {
+assume(shouldTestPandasUDFs)
+
+val pythonScript =
+  """
+|import pandas as pd
+|import numpy as np
+|import datetime
+|from pyspark.sql.types import StructType, StructField, StringType
+|
+|tpe = StructType([StructField("key", StringType())])
+|
+|def func(key, pdf_iter, state):
+|pdf = pd.DataFrame({
+|'int32': 

[spark] branch master updated: [SPARK-41264][CONNECT][PYTHON] Make Literal support more datatypes

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c35c5bb5e5 [SPARK-41264][CONNECT][PYTHON] Make Literal support more 
datatypes
4c35c5bb5e5 is described below

commit 4c35c5bb5e545acb2f46a80218f68e69c868b388
Author: Ruifeng Zheng 
AuthorDate: Tue Nov 29 14:36:10 2022 +0800

[SPARK-41264][CONNECT][PYTHON] Make Literal support more datatypes

### What changes were proposed in this pull request?
1, in the sever side, try to match 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L63-L101,
 and use `CreateArray`, `CreateStruct`, `CreateMap` for complex inputs;

2, in the client side, try to match 
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1335-L1349
 ,
but do not support `datetime.time` since I don't find a corrsponding sql 
type for it.

### Why are the changes needed?
try to support all datatype

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated tests

Closes #38800 from zhengruifeng/connect_update_literal.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/protobuf/spark/connect/expressions.proto  | 103 ++---
 .../src/main/protobuf/spark/connect/types.proto|  16 -
 .../org/apache/spark/sql/connect/dsl/package.scala |   6 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  | 118 --
 .../messages/ConnectProtoMessagesSuite.scala   |   6 +-
 .../connect/planner/SparkConnectPlannerSuite.scala |   2 +-
 .../connect/planner/SparkConnectServiceSuite.scala |   2 +-
 python/pyspark/sql/connect/column.py   | 112 ++---
 python/pyspark/sql/connect/plan.py |   4 +-
 .../pyspark/sql/connect/proto/expressions_pb2.py   | 151 +++
 .../pyspark/sql/connect/proto/expressions_pb2.pyi  | 462 -
 python/pyspark/sql/connect/proto/types_pb2.py  | 126 +++---
 python/pyspark/sql/connect/proto/types_pb2.pyi |  63 ---
 .../connect/test_connect_column_expressions.py |  73 +++-
 .../sql/tests/connect/test_connect_plan_only.py|   4 +-
 15 files changed, 539 insertions(+), 709 deletions(-)

diff --git 
a/connector/connect/src/main/protobuf/spark/connect/expressions.proto 
b/connector/connect/src/main/protobuf/spark/connect/expressions.proto
index 595758f0443..2a1159c1d04 100644
--- a/connector/connect/src/main/protobuf/spark/connect/expressions.proto
+++ b/connector/connect/src/main/protobuf/spark/connect/expressions.proto
@@ -40,37 +40,34 @@ message Expression {
 
   message Literal {
 oneof literal_type {
-  bool boolean = 1;
-  int32 i8 = 2;
-  int32 i16 = 3;
-  int32 i32 = 5;
-  int64 i64 = 7;
-  float fp32 = 10;
-  double fp64 = 11;
-  string string = 12;
-  bytes binary = 13;
-  // Timestamp in units of microseconds since the UNIX epoch.
-  int64 timestamp = 14;
+  bool null = 1;
+  bytes binary = 2;
+  bool boolean = 3;
+
+  int32 byte = 4;
+  int32 short = 5;
+  int32 integer = 6;
+  int64 long = 7;
+  float float = 10;
+  double double = 11;
+  Decimal decimal = 12;
+
+  string string = 13;
+
   // Date in units of days since the UNIX epoch.
   int32 date = 16;
-  // Time in units of microseconds past midnight
-  int64 time = 17;
-  IntervalYearToMonth interval_year_to_month = 19;
-  IntervalDayToSecond interval_day_to_second = 20;
-  string fixed_char = 21;
-  VarChar var_char = 22;
-  bytes fixed_binary = 23;
-  Decimal decimal = 24;
-  Struct struct = 25;
-  Map map = 26;
   // Timestamp in units of microseconds since the UNIX epoch.
-  int64 timestamp_tz = 27;
-  bytes uuid = 28;
-  DataType null = 29; // a typed null literal
-  List list = 30;
-  DataType.Array empty_array = 31;
-  DataType.Map empty_map = 32;
-  UserDefined user_defined = 33;
+  int64 timestamp = 17;
+  // Timestamp in units of microseconds since the UNIX epoch (without 
timezone information).
+  int64 timestamp_ntz = 18;
+
+  CalendarInterval calendar_interval = 19;
+  int32 year_month_interval = 20;
+  int64 day_time_interval = 21;
+
+  Array array = 22;
+  Struct struct = 23;
+  Map map = 24;
 }
 
 // whether the literal type should be treated as a nullable type. Applies 
to
@@ -83,40 +80,20 @@ message Expression {
 // directly declare the type variation).
 uint32 type_variation_reference = 51;
 
-message VarChar {
-  string value = 1;
-  uint32 length = 2;
-}
-
 message Decimal {
-  // little-endian twos-complement integer representation of complete value
-  // 

[GitHub] [spark-website] dongjoon-hyun commented on pull request #428: Add 3.2.3 announcement news, release note and download link

2022-11-28 Thread GitBox


dongjoon-hyun commented on PR #428:
URL: https://github.com/apache/spark-website/pull/428#issuecomment-1330150893

   Merged to `asf-site`. 
   
   Hi, @gengliangwang . Could you help @sunchao with Search Engine setting 
please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] dongjoon-hyun merged pull request #428: Add 3.2.3 announcement news, release note and download link

2022-11-28 Thread GitBox


dongjoon-hyun merged PR #428:
URL: https://github.com/apache/spark-website/pull/428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] sunchao opened a new pull request, #428: Add 3.2.3 announcement news, release note and download link

2022-11-28 Thread GitBox


sunchao opened a new pull request, #428:
URL: https://github.com/apache/spark-website/pull/428

   - Apache Download: https://downloads.apache.org/spark/spark-3.2.3/
   - Apache Archieve: https://archive.apache.org/dist/spark/spark-3.2.3/
   - Maven Central: 
https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.2.3/
   - PySpark: https://pypi.org/project/pyspark/3.2.3/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and `DataFrame.na.drop`

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 661064a7a38 [SPARK-41148][CONNECT][PYTHON] Implement 
`DataFrame.dropna` and `DataFrame.na.drop`
661064a7a38 is described below

commit 661064a7a3811da27da0d5b024764238d2a1fb3f
Author: Ruifeng Zheng 
AuthorDate: Tue Nov 29 13:27:33 2022 +0800

[SPARK-41148][CONNECT][PYTHON] Implement `DataFrame.dropna` and 
`DataFrame.na.drop`

### What changes were proposed in this pull request?
 Implement `DataFrame.dropna ` and `DataFrame.na.drop`

### Why are the changes needed?
For API coverage

### Does this PR introduce _any_ user-facing change?
yes, new method

### How was this patch tested?
added UT

Closes #38819 from zhengruifeng/connect_df_na_drop.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/protobuf/spark/connect/relations.proto|  25 
 .../org/apache/spark/sql/connect/dsl/package.scala |  33 +
 .../sql/connect/planner/SparkConnectPlanner.scala  |  18 +++
 .../connect/planner/SparkConnectProtoSuite.scala   |  19 +++
 python/pyspark/sql/connect/dataframe.py|  81 +++
 python/pyspark/sql/connect/plan.py |  39 +
 python/pyspark/sql/connect/proto/relations_pb2.py  | 158 +++--
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  77 ++
 .../sql/tests/connect/test_connect_basic.py|  32 +
 .../sql/tests/connect/test_connect_plan_only.py|  16 +++
 10 files changed, 426 insertions(+), 72 deletions(-)

diff --git a/connector/connect/src/main/protobuf/spark/connect/relations.proto 
b/connector/connect/src/main/protobuf/spark/connect/relations.proto
index a676871c9e0..cbdf6311657 100644
--- a/connector/connect/src/main/protobuf/spark/connect/relations.proto
+++ b/connector/connect/src/main/protobuf/spark/connect/relations.proto
@@ -55,6 +55,7 @@ message Relation {
 
 // NA functions
 NAFill fill_na = 90;
+NADrop drop_na = 91;
 
 // stat functions
 StatSummary summary = 100;
@@ -440,6 +441,30 @@ message NAFill {
   repeated Expression.Literal values = 3;
 }
 
+
+// Drop rows containing null values.
+// It will invoke 'Dataset.na.drop' (same as 'DataFrameNaFunctions.drop') to 
compute the results.
+message NADrop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  //
+  // When it is empty, all the columns in the input relation will be 
considered.
+  repeated string cols = 2;
+
+  // (Optional) The minimum number of non-null and non-NaN values required to 
keep.
+  //
+  // When not set, it is equivalent to the number of considered columns, which 
means
+  // a row will be kept only if all columns are non-null.
+  //
+  // 'how' options ('all', 'any') can be easily converted to this field:
+  //   - 'all' -> set 'min_non_nulls' 1;
+  //   - 'any' -> keep 'min_non_nulls' unset;
+  optional int32 min_non_nulls = 3;
+}
+
+
 // Rename columns on the input relation by the same length of names.
 message RenameColumnsBySameLengthNames {
   // (Required) The input relation of RenameColumnsBySameLengthNames.
diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
index 61d7abe9e15..dd1c7f0574b 100644
--- 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -278,6 +278,39 @@ package object dsl {
   .build())
   .build()
   }
+
+  def drop(
+  how: Option[String] = None,
+  minNonNulls: Option[Int] = None,
+  cols: Seq[String] = Seq.empty): Relation = {
+require(!(how.nonEmpty && minNonNulls.nonEmpty))
+require(how.isEmpty || Seq("any", "all").contains(how.get))
+
+val dropna = proto.NADrop
+  .newBuilder()
+  .setInput(logicalPlan)
+
+if (cols.nonEmpty) {
+  dropna.addAllCols(cols.asJava)
+}
+
+var _minNonNulls = -1
+how match {
+  case Some("all") => _minNonNulls = 1
+  case _ =>
+}
+if (minNonNulls.nonEmpty) {
+  _minNonNulls = minNonNulls.get
+}
+if (_minNonNulls > 0) {
+  dropna.setMinNonNulls(_minNonNulls)
+}
+
+Relation
+  .newBuilder()
+  .setDropNa(dropna.build())
+  .build()
+  }
 }
 
 implicit class DslStatFunctions(val logicalPlan: Relation) {
diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 

[spark] branch master updated (c24077bddf8 -> 23279beec6e)

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c24077bddf8 [SPARK-41238][CONNECT][PYTHON][FOLLOWUP] Support 
`DayTimeIntervalType` in the client
 add 23279beec6e [SPARK-41308][CONNECT][PYTHON] Improve DataFrame.count()

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py| 12 +---
 python/pyspark/sql/tests/connect/test_connect_basic.py |  8 
 2 files changed, 17 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41238][CONNECT][PYTHON][FOLLOWUP] Support `DayTimeIntervalType` in the client

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c24077bddf8 [SPARK-41238][CONNECT][PYTHON][FOLLOWUP] Support 
`DayTimeIntervalType` in the client
c24077bddf8 is described below

commit c24077bddf84238bff28788ee49d9ddd01f8e8dd
Author: Ruifeng Zheng 
AuthorDate: Tue Nov 29 09:43:09 2022 +0800

[SPARK-41238][CONNECT][PYTHON][FOLLOWUP] Support `DayTimeIntervalType` in 
the client

### What changes were proposed in this pull request?
Support `DayTimeIntervalType` in the client

### Why are the changes needed?
In https://github.com/apache/spark/pull/38770, I forgot to deal with 
`DayTimeIntervalType`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added test case, the schema should be
```
In [1]: query = """ SELECT INTERVAL '100 10:30' DAY TO MINUTE AS interval 
"""

In [2]: spark.sql(query).schema
Out[2]: StructType([StructField('interval', DayTimeIntervalType(0, 2), 
False)])
```

Closes #38818 from zhengruifeng/connect_type_time_stamp.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/client.py   | 12 +++-
 python/pyspark/sql/tests/connect/test_connect_basic.py |  7 +++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/client.py 
b/python/pyspark/sql/connect/client.py
index a2a0797c49f..745ca79fda9 100644
--- a/python/pyspark/sql/connect/client.py
+++ b/python/pyspark/sql/connect/client.py
@@ -383,7 +383,17 @@ class SparkConnectClient(object):
 elif schema.HasField("timestamp"):
 return TimestampType()
 elif schema.HasField("day_time_interval"):
-return DayTimeIntervalType()
+start: Optional[int] = (
+schema.day_time_interval.start_field
+if schema.day_time_interval.HasField("start_field")
+else None
+)
+end: Optional[int] = (
+schema.day_time_interval.end_field
+if schema.day_time_interval.HasField("end_field")
+else None
+)
+return DayTimeIntervalType(startField=start, endField=end)
 elif schema.HasField("array"):
 return ArrayType(
 
self._proto_schema_to_pyspark_schema(schema.array.element_type),
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 028819a88ca..8d2aa5a1a35 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -185,6 +185,13 @@ class SparkConnectTests(SparkConnectSQLTestCase):
 self.connect.sql(query).schema,
 )
 
+# test DayTimeIntervalType
+query = """ SELECT INTERVAL '100 10:30' DAY TO MINUTE AS interval """
+self.assertEqual(
+self.spark.sql(query).schema,
+self.connect.sql(query).schema,
+)
+
 # test MapType
 query = """
 SELECT * FROM VALUES


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41306][CONNECT] Improve Connect Expression proto documentation

2022-11-28 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c7878850f55 [SPARK-41306][CONNECT] Improve Connect Expression proto 
documentation
c7878850f55 is described below

commit c7878850f55643fa061e25f9ffba60edf4f37275
Author: Rui Wang 
AuthorDate: Tue Nov 29 09:02:03 2022 +0800

[SPARK-41306][CONNECT] Improve Connect Expression proto documentation

### What changes were proposed in this pull request?

Improve Connect Expression proto documentation by following 
https://github.com/apache/spark/blob/master/connector/connect/docs/adding-proto-messages.md.

### Why are the changes needed?

Proto Documentation.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

N/A

Closes #38825 from amaliujia/improve_proto_expr_documentation.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../main/protobuf/spark/connect/expressions.proto   | 14 +-
 .../pyspark/sql/connect/proto/expressions_pb2.pyi   | 21 -
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git 
a/connector/connect/src/main/protobuf/spark/connect/expressions.proto 
b/connector/connect/src/main/protobuf/spark/connect/expressions.proto
index 85049f1d14b..595758f0443 100644
--- a/connector/connect/src/main/protobuf/spark/connect/expressions.proto
+++ b/connector/connect/src/main/protobuf/spark/connect/expressions.proto
@@ -142,18 +142,24 @@ message Expression {
   // An unresolved attribute that is not explicitly bound to a specific 
column, but the column
   // is resolved during analysis by name.
   message UnresolvedAttribute {
+// (Required) An identifier that will be parsed by Catalyst parser. This 
should follow the
+// Spark SQL identifier syntax.
 string unparsed_identifier = 1;
   }
 
   // An unresolved function is not explicitly bound to one explicit function, 
but the function
   // is resolved during analysis following Sparks name resolution rules.
   message UnresolvedFunction {
+// (Required) Names parts for the unresolved function.
 repeated string parts = 1;
+
+// (Optional) Function arguments. Empty arguments are allowed.
 repeated Expression arguments = 2;
   }
 
   // Expression as string.
   message ExpressionString {
+// (Required) A SQL expression that will be parsed by Catalyst parser.
 string expression = 1;
   }
 
@@ -162,9 +168,15 @@ message Expression {
   }
 
   message Alias {
+// (Required) The expression that alias will be added on.
 Expression expr = 1;
+
+// (Required) a list of name parts for the alias.
+//
+// Scalar columns only has one name that presents.
 repeated string name = 2;
-// Alias metadata expressed as a JSON map.
+
+// (Optional) Alias metadata expressed as a JSON map.
 optional string metadata = 3;
   }
 }
diff --git a/python/pyspark/sql/connect/proto/expressions_pb2.pyi 
b/python/pyspark/sql/connect/proto/expressions_pb2.pyi
index bf2dc06b685..23ef99c6c48 100644
--- a/python/pyspark/sql/connect/proto/expressions_pb2.pyi
+++ b/python/pyspark/sql/connect/proto/expressions_pb2.pyi
@@ -538,6 +538,9 @@ class Expression(google.protobuf.message.Message):
 
 UNPARSED_IDENTIFIER_FIELD_NUMBER: builtins.int
 unparsed_identifier: builtins.str
+"""(Required) An identifier that will be parsed by Catalyst parser. 
This should follow the
+Spark SQL identifier syntax.
+"""
 def __init__(
 self,
 *,
@@ -560,13 +563,15 @@ class Expression(google.protobuf.message.Message):
 @property
 def parts(
 self,
-) -> 
google.protobuf.internal.containers.RepeatedScalarFieldContainer[builtins.str]: 
...
+) -> 
google.protobuf.internal.containers.RepeatedScalarFieldContainer[builtins.str]:
+"""(Required) Names parts for the unresolved function."""
 @property
 def arguments(
 self,
 ) -> 
google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
 global___Expression
-]: ...
+]:
+"""(Optional) Function arguments. Empty arguments are allowed."""
 def __init__(
 self,
 *,
@@ -585,6 +590,7 @@ class Expression(google.protobuf.message.Message):
 
 EXPRESSION_FIELD_NUMBER: builtins.int
 expression: builtins.str
+"""(Required) A SQL expression that will be parsed by Catalyst 
parser."""
 def __init__(
 self,
 *,
@@ -610,13 +616,18 @@ class Expression(google.protobuf.message.Message):
 NAME_FIELD_NUMBER: builtins.int
 METADATA_FIELD_NUMBER: builtins.int
 @property
-def expr(self) -> 

[spark] branch master updated: [SPARK-41304][CONNECT][PYTHON][DOCS] Add missing docs for DataFrame API

2022-11-28 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c78676e82e2 [SPARK-41304][CONNECT][PYTHON][DOCS] Add missing docs for 
DataFrame API
c78676e82e2 is described below

commit c78676e82e2ec041270519a2d5391da078ab
Author: Rui Wang 
AuthorDate: Tue Nov 29 10:01:52 2022 +0900

[SPARK-41304][CONNECT][PYTHON][DOCS] Add missing docs for DataFrame API

### What changes were proposed in this pull request?

Add missing docs for DataFrame API.

### Why are the changes needed?

Improve documentation

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

N/A

Closes #38824 from amaliujia/improve_pyspark_doc.

Authored-by: Rui Wang 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py | 142 +++-
 1 file changed, 141 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index d77a3717248..068c6322698 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -309,6 +309,21 @@ class DataFrame(object):
 )
 
 def drop(self, *cols: "ColumnOrName") -> "DataFrame":
+"""Returns a new :class:`DataFrame` without specified columns.
+This is a no-op if schema doesn't contain the given column name(s).
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+cols: str or :class:`Column`
+a name of the column, or the :class:`Column` to drop
+
+Returns
+---
+:class:`DataFrame`
+DataFrame without given columns.
+"""
 _cols = list(cols)
 if any(not isinstance(c, (str, Column)) for c in _cols):
 raise TypeError(
@@ -326,6 +341,23 @@ class DataFrame(object):
 )
 
 def filter(self, condition: Expression) -> "DataFrame":
+"""Filters rows using the given condition.
+
+:func:`where` is an alias for :func:`filter`.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+condition : :class:`Column` or str
+a :class:`Column` of :class:`types.BooleanType`
+or a string of SQL expression.
+
+Returns
+---
+:class:`DataFrame`
+Filtered DataFrame.
+"""
 return DataFrame.withPlan(
 plan.Filter(child=self._plan, filter=condition), 
session=self._session
 )
@@ -409,9 +441,39 @@ class DataFrame(object):
 )
 
 def limit(self, n: int) -> "DataFrame":
+"""Limits the result count to the number specified.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+num : int
+Number of records to return. Will return this number of records
+or whatever number is available.
+
+Returns
+---
+:class:`DataFrame`
+Subset of the records
+"""
 return DataFrame.withPlan(plan.Limit(child=self._plan, limit=n), 
session=self._session)
 
 def offset(self, n: int) -> "DataFrame":
+"""Returns a new :class: `DataFrame` by skipping the first `n` rows.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+num : int
+Number of records to return. Will return this number of records
+or all records if the DataFrame contains less than this number of 
records.
+
+Returns
+---
+:class:`DataFrame`
+Subset of the records
+"""
 return DataFrame.withPlan(plan.Offset(child=self._plan, offset=n), 
session=self._session)
 
 def tail(self, num: int) -> List[Row]:
@@ -427,7 +489,7 @@ class DataFrame(object):
 --
 num : int
 Number of records to return. Will return this number of records
-or whataver number is available.
+or all records if the DataFrame contains less than this number of 
records.
 
 Returns
 ---
@@ -457,6 +519,31 @@ class DataFrame(object):
 withReplacement: bool = False,
 seed: Optional[int] = None,
 ) -> "DataFrame":
+"""Returns a sampled subset of this :class:`DataFrame`.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+withReplacement : bool, optional
+Sample with replacement or not (default ``False``).
+fraction : float
+Fraction of rows to generate, range [0.0, 1.0].
+seed : int, optional
+Seed for sampling (default a random seed).
+
+Returns
+---
+:class:`DataFrame`
+Sampled rows from given DataFrame.
+
+ 

[spark] branch master updated (60ab363183b -> 0ff201cc219)

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 60ab363183b [SPARK-41197][BUILD] Upgrade Kafka to 3.3.1
 new 8e3e9575099 Revert "[SPARK-41197][BUILD] Upgrade Kafka to 3.3.1"
 new 0ff201cc219 [SPARK-41197][BUILD] Upgrade Kafka to 3.3.1

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] 01/02: Revert "[SPARK-41197][BUILD] Upgrade Kafka to 3.3.1"

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 8e3e95750993277568010df9a53f56104ced7304
Author: Dongjoon Hyun 
AuthorDate: Mon Nov 28 12:56:30 2022 -0800

Revert "[SPARK-41197][BUILD] Upgrade Kafka to 3.3.1"

This reverts commit 60ab363183b8b9565ebadfd1b1e826d1d64ae212.
---
 .../src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala   | 2 --
 .../test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala | 2 --
 pom.xml | 2 +-
 3 files changed, 1 insertion(+), 5 deletions(-)

diff --git 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
index 7c9c40883a5..431d9d6b278 100644
--- 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
@@ -537,8 +537,6 @@ class KafkaTestUtils(
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
-props.put("partitioner.class",
-  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 setAuthenticationConfigIfNeeded(props)
 props
   }
diff --git 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
index 91fecacb6e7..d341b6977b2 100644
--- 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
@@ -263,8 +263,6 @@ private[kafka010] class KafkaTestUtils extends Logging {
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
-props.put("partitioner.class",
-  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 props
   }
 
diff --git a/pom.xml b/pom.xml
index 691ea8f563f..f47bedb18e7 100644
--- a/pom.xml
+++ b/pom.xml
@@ -133,7 +133,7 @@
 
 2.3
 
-3.3.1
+3.2.3
 
 10.14.2.0
 1.12.3


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] 02/02: [SPARK-41197][BUILD] Upgrade Kafka to 3.3.1

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 0ff201cc219884d3cbb6844732c681546e53f4d4
Author: Ted Yu 
AuthorDate: Mon Nov 28 12:49:43 2022 -0800

[SPARK-41197][BUILD] Upgrade Kafka to 3.3.1

This PR upgrades Kafka to 3.3.1 release.

The new default partitioner keeps track of how many bytes are produced 
per-partition and once the amount exceeds `batch.size`, it switches to the next 
partition. For spark kafka tests, this will result in records being sent to 
only one partition in some tests.
`KafkaTestUtils.producerConfiguration` is modified to use 
`DefaultPartitioner`.

Kafka 3.3.1 release has new features along with bug fixes: 
https://www.confluent.io/blog/apache-kafka-3-3-0-new-features-and-updates/

No

Existing test suite

Closes #38715 from tedyu/k-33.

Authored-by: Ted Yu 
Signed-off-by: Dongjoon Hyun 
---
 .../src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala   | 2 ++
 .../test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala | 2 ++
 pom.xml | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
index 431d9d6b278..7c9c40883a5 100644
--- 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
@@ -537,6 +537,8 @@ class KafkaTestUtils(
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
+props.put("partitioner.class",
+  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 setAuthenticationConfigIfNeeded(props)
 props
   }
diff --git 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
index d341b6977b2..91fecacb6e7 100644
--- 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
@@ -263,6 +263,8 @@ private[kafka010] class KafkaTestUtils extends Logging {
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
+props.put("partitioner.class",
+  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 props
   }
 
diff --git a/pom.xml b/pom.xml
index f47bedb18e7..691ea8f563f 100644
--- a/pom.xml
+++ b/pom.xml
@@ -133,7 +133,7 @@
 
 2.3
 
-3.2.3
+3.3.1
 
 10.14.2.0
 1.12.3


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41197][BUILD] Upgrade Kafka to 3.3.1

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 60ab363183b [SPARK-41197][BUILD] Upgrade Kafka to 3.3.1
60ab363183b is described below

commit 60ab363183b8b9565ebadfd1b1e826d1d64ae212
Author: Dongjoon Hyun 
AuthorDate: Mon Nov 28 12:49:43 2022 -0800

[SPARK-41197][BUILD] Upgrade Kafka to 3.3.1

### What changes were proposed in this pull request?
This PR upgrades Kafka to 3.3.1 release.

The new default partitioner keeps track of how many bytes are produced 
per-partition and once the amount exceeds `batch.size`, it switches to the next 
partition. For spark kafka tests, this will result in records being sent to 
only one partition in some tests.
`KafkaTestUtils.producerConfiguration` is modified to use 
`DefaultPartitioner`.

### Why are the changes needed?
Kafka 3.3.1 release has new features along with bug fixes: 
https://www.confluent.io/blog/apache-kafka-3-3-0-new-features-and-updates/

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite

Closes #38715 from tedyu/k-33.

Lead-authored-by: Dongjoon Hyun 
Co-authored-by: Ted Yu 
Signed-off-by: Dongjoon Hyun 
---
 .../src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala   | 2 ++
 .../test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala | 2 ++
 pom.xml | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
index 431d9d6b278..7c9c40883a5 100644
--- 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
@@ -537,6 +537,8 @@ class KafkaTestUtils(
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
+props.put("partitioner.class",
+  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 setAuthenticationConfigIfNeeded(props)
 props
   }
diff --git 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
index d341b6977b2..91fecacb6e7 100644
--- 
a/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
+++ 
b/connector/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala
@@ -263,6 +263,8 @@ private[kafka010] class KafkaTestUtils extends Logging {
 props.put("key.serializer", classOf[StringSerializer].getName)
 // wait for all in-sync replicas to ack sends
 props.put("acks", "all")
+props.put("partitioner.class",
+  
classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)
 props
   }
 
diff --git a/pom.xml b/pom.xml
index f47bedb18e7..691ea8f563f 100644
--- a/pom.xml
+++ b/pom.xml
@@ -133,7 +133,7 @@
 
 2.3
 
-3.2.3
+3.3.1
 
 10.14.2.0
 1.12.3


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 00185e3d8f9 [SPARK-41185][K8S][DOCS] Remove ARM limitation for 
YuniKorn from docs
00185e3d8f9 is described below

commit 00185e3d8f9a5bea7238f1387543cf01a8fe1fd4
Author: Wilfred Spiegelenburg 
AuthorDate: Mon Nov 28 12:39:27 2022 -0800

[SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs

### What changes were proposed in this pull request?
Remove the limitations section from the K8s documentation for YuniKorn.

### Why are the changes needed?
The limitation section is outdated because YuniKorn is fully supported from 
release 1.1.0 onwards. YuniKorn 1.1.0 is the release that is referenced in the 
documentation.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #38780 from wilfred-s/SPARK-41185.

Authored-by: Wilfred Spiegelenburg 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit bfc9e4ef111e21eee99407309ca6be278617d319)
Signed-off-by: Dongjoon Hyun 
---
 docs/running-on-kubernetes.md | 4 
 1 file changed, 4 deletions(-)

diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md
index f7f7ec539b8..5a76e6155dc 100644
--- a/docs/running-on-kubernetes.md
+++ b/docs/running-on-kubernetes.md
@@ -1842,10 +1842,6 @@ Submit Spark jobs with the following extra options:
 Note that `{{APP_ID}}` is the built-in variable that will be substituted with 
Spark job ID automatically.
 With the above configuration, the job will be scheduled by YuniKorn scheduler 
instead of the default Kubernetes scheduler.
 
-# Limitations
-
-- Apache YuniKorn currently only supports x86 Linux, running Spark on ARM64 
(or other platform) with Apache YuniKorn is not supported at present.
-
 ### Stage Level Scheduling Overview
 
 Stage level scheduling is supported on Kubernetes when dynamic allocation is 
enabled. This also requires 
spark.dynamicAllocation.shuffleTracking.enabled to be enabled 
since Kubernetes doesn't support an external shuffle service at this time. The 
order in which containers for different profiles is requested from Kubernetes 
is not guaranteed. Note that since dynamic allocation on Kubernetes requires 
the shuffle tracking feature, this means that executors from previous stages t 
[...]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (aee49e16188 -> bfc9e4ef111)

2022-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from aee49e16188 [SPARK-41301][CONNECT] Homogenize Behavior for 
SparkSession.range()
 add bfc9e4ef111 [SPARK-41185][K8S][DOCS] Remove ARM limitation for 
YuniKorn from docs

No new revisions were added by this update.

Summary of changes:
 docs/running-on-kubernetes.md | 4 
 1 file changed, 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] sunchao commented on pull request #427: Add docs for Spark 3.2.3

2022-11-28 Thread GitBox


sunchao commented on PR #427:
URL: https://github.com/apache/spark-website/pull/427#issuecomment-1329728990

   Thanks @dongjoon-hyun ! merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] sunchao merged pull request #427: Add docs for Spark 3.2.3

2022-11-28 Thread GitBox


sunchao merged PR #427:
URL: https://github.com/apache/spark-website/pull/427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] sunchao opened a new pull request, #427: Add docs for Spark 3.2.3

2022-11-28 Thread GitBox


sunchao opened a new pull request, #427:
URL: https://github.com/apache/spark-website/pull/427

   Adds docs from https://dist.apache.org/repos/dist/dev/spark/v3.2.3-rc1-docs/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41301][CONNECT] Homogenize Behavior for SparkSession.range()

2022-11-28 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aee49e16188 [SPARK-41301][CONNECT] Homogenize Behavior for 
SparkSession.range()
aee49e16188 is described below

commit aee49e161887e3dc15701d2f1c98ddf75e3ceeac
Author: Martin Grund 
AuthorDate: Mon Nov 28 14:47:40 2022 -0400

[SPARK-41301][CONNECT] Homogenize Behavior for SparkSession.range()

### What changes were proposed in this pull request?
In PySpark the `end` parameter to `SparkSession.range` is optional and the 
first parameter is then used with an implicit `0` for `start`. This patch 
homogenizes the behavior for Spark Connect.

### Why are the changes needed?
Compatibility.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #38822 from grundprinzip/SPARK-41301.

Authored-by: Martin Grund 
Signed-off-by: Herman van Hovell 
---
 python/pyspark/sql/connect/session.py  | 10 --
 python/pyspark/sql/tests/connect/test_connect_basic.py |  4 
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index c9b76cf47f9..2e24f4e7971 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -257,7 +257,7 @@ class SparkSession(object):
 def range(
 self,
 start: int,
-end: int,
+end: Optional[int] = None,
 step: int = 1,
 numPartitions: Optional[int] = None,
 ) -> DataFrame:
@@ -283,6 +283,12 @@ class SparkSession(object):
 ---
 :class:`DataFrame`
 """
+if end is None:
+actual_end = start
+start = 0
+else:
+actual_end = end
+
 return DataFrame.withPlan(
-Range(start=start, end=end, step=step, 
num_partitions=numPartitions), self
+Range(start=start, end=actual_end, step=step, 
num_partitions=numPartitions), self
 )
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index e0f5f23fdb4..028819a88ca 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -363,6 +363,10 @@ class SparkConnectTests(SparkConnectSQLTestCase):
 self.connect.range(start=0, end=10, step=3, 
numPartitions=2).toPandas(),
 self.spark.range(start=0, end=10, step=3, 
numPartitions=2).toPandas(),
 )
+# SPARK-41301
+self.assert_eq(
+self.connect.range(10).toPandas(), self.connect.range(start=0, 
end=10).toPandas()
+)
 
 def test_create_global_temp_view(self):
 # SPARK-41127: test global temp view creation.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] tag v3.2.3 created (now b53c341e0fe)

2022-11-28 Thread sunchao
This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a change to tag v3.2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


  at b53c341e0fe (commit)
No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r58283 - /release/spark/spark-3.2.2/

2022-11-28 Thread dongjoon
Author: dongjoon
Date: Mon Nov 28 18:14:42 2022
New Revision: 58283

Log:
Remove Apache Spark 3.2.2 after 3.2.3 release

Removed:
release/spark/spark-3.2.2/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r58282 - /release/spark/KEYS

2022-11-28 Thread dongjoon
Author: dongjoon
Date: Mon Nov 28 18:09:08 2022
New Revision: 58282

Log:
Add Chao Sun's Key

Modified:
release/spark/KEYS

Modified: release/spark/KEYS
==
--- release/spark/KEYS (original)
+++ release/spark/KEYS Mon Nov 28 18:09:08 2022
@@ -1790,3 +1790,62 @@ yS9NpUVtofwbRov8ZgxZJXyQQg3GQnnoRD6SUbW3
 Uadsu9LFJXvYJHO0YKVR+A==
 =eSVG
 -END PGP PUBLIC KEY BLOCK-
+pub   rsa4096 2020-12-07 [SC]
+  DE7FA241EB298D027C97B2A1D8F1A97BE51ECA98
+uid   [ unknown] Chao Sun (CODE SIGNING KEY) 
+sig 3D8F1A97BE51ECA98 2020-12-07  Chao Sun (CODE SIGNING KEY) 

+sub   rsa4096 2020-12-07 [E]
+sig  D8F1A97BE51ECA98 2020-12-07  Chao Sun (CODE SIGNING KEY) 

+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQINBF/OgGsBEADKi+Zl4dKoGGle79sDjMzb2DqZW/+v6zRecelcZAPqvNEhuSzj
+d9fjvKvdDxaV7+tf8rhVDfufrWIv2VXYG7PYPOQrrXCZbnF6Sm6aw39jtd8Qa85O
+Oowk4MgTKAbwLRQFOGLYIDY4E4u4fzrQOOK3Y5snEUqf0FG6DxSJyH7oRo+0azFx
+ImvfbvcXEYC7q/fn5KaSihD7HQIeVa23uebl5ibWsdbzd/YJQsstYKlfMprooc39
+FavaO8D4QigdOBBAYuAM+rJ/IcKiDy1hRvQmJ1QDrLj0Uc+3nszWM3qvOARpQtud
+NkpZGyh1gsxVKaJzdZKuhgrBYaLxPt4B3GuU/qZLJ7/HS8+6uc3gTmYBudEBaAZx
+M9E6JqPFyq0NWzaHb82QV3nWXjgK0E++0z2+JZiqz2s8KAvijSbdDaNUv2u+/aJO
+R8iEUtkxW6pL3tQfYlfQyUr7hGRV8MbJb3cBX2f6mjLKOk5ytBsLTWrv9m+TCBb6
+pBEY2hiALJBJvvva3YxuIXZfw2n6/U02BmOINRs4L8/C5FraQni8HzdeautyXf+x
+iJ76H6dkAzEtEiWH1bdMwIc1kCauOm8dw0aHowOJA8SKQ7XE++Yb0BgHSWYRCHV6
+SYj8B0fa1Ludsc0mcjO5NSJelGg6nWPd1pk3JWOidcFMO+dhTlCNoGFvUQARAQAB
+tDBDaGFvIFN1biAoQ09ERSBTSUdOSU5HIEtFWSkgPHN1bmNoYW9AYXBhY2hlLm9y
+Zz6JAk4EEwEIADgWIQTef6JB6ymNAnyXsqHY8al75R7KmAUCX86AawIbAwULCQgH
+AgYVCgkICwIEFgIDAQIeAQIXgAAKCRDY8al75R7KmJoGD/923ARe9ljzoApCu/mj
+PmNF/b6ilLQhU7kfhxCCz0nkWv1ghIacLBnb7471DXvhayRIa7WAJkS1KMrCfTTx
+KvNbqn6RHBkVPtU5YovLYj9SSNhHHbHIvX95OdjUVEJsfPo76aK9kQVd8voCelG6
+hFgI+v+w1GvlUdGSWO9FWzRQ8Z7JH8PIG1p1H2+/XtKOkxSthKzb37WQ96Z4q5f0
+JjpvesSp5//sLOoS2ikWnew8Wv14vHNxvSd4ovN9oNc0qqGuyw4IDiUCkVFzPr0Z
+Qhy2ISuTahd/l1SeoRONNKawP7xfVthhZiU/zbbQvU5rTScGg/BBZZCOn62aegBH
+Y7/odjiJdY5bv3wETBcmxZ5rlJFPU3dkq8Q7qnWu5hLKbijuZndkKrdOLg8RJCW2
+yeMvQRKDLvcNqGKDkeA/UtY5M9+9fQKoReeTO2Rx/LeO7GiMrnbXnmQuKuLlxhTM
+PmRgmIbvFyHVjzyL1+mLLCjGu4+26PbmaveILZY9QS23vckZ9TRnJvUHSnULbjQ/
+k3eWtohu8KFpyFiBZxUTmJgHST9UpmT7Mat4DFTslopTDtw2DunIakudfaZmEQQj
+L6bhZhsqff+aJQPXV22LPGbWvCSMTqh0KQq8O8e2aKSM/xFuRZsztwtDtSU/CKpF
+xDEiyV3to599x6Gy3RXEQnLHY7kCDQRfzoBrARAAy0ACs7AHSmvv2w7I4gon8F+Q
+9xUhFiCboKTaY1zcDOZUOQbhuJv+oQHHQhiHMlkHqcChjKtE6iQxoQmyhVUuMrrR
+XSvbO4UYS/iDBOuinPmnHppE/WFKillz1mu8oiamHkhfOCphEA7V7m7mFAIQYcza
+TdHrRKxQrm/r9EKXeaxwik6V1N1o10/e7VpQaDkjNik3iUpylHk+dYxTaFEkw+RT
+qhrufiDJW6vbMtYLzuqz6LL/2VH8j8fh3QdSIubycoltnc/p2L0bZORHpL9CWVh+
+sX7AHE3H95xwtaCT0ji4ocncqoqWEnMAMLT4/Gqc0hWo/dDEjZ5603hF9bKTykHZ
+e2f751qAx/Fyf2gPtQ9URLHB+abaQWpJc7fVYTSV32NFRbZJxHDxwrgkV51Vuvfq
+rgoo+Z8j8I66kkvEXyGdDUJM5zFBgVZdHIT65YRdVfkRR2uvxrm+b3jCQjeqLpqz
+wfX1LMIsdbbnluPdr9xudOgAVHAde5dGlA7WuL4eD0Q1iYPctQ4TS8IXWZpVmT7Y
+Q4bxC5y1MjHvCtqXmkwsROGixXP1REVsmbVTLC1KgpNkMcVawt89QIuWTcUw1hAT
+nQeMkN3bjvwps2D0rW90HytCRuRJ0TpMoV4SnYV+4w0r+jM8ADqfd/iSX1ybXl2S
+hOxh+5Dt5B/WjMsALPcAEQEAAYkCNgQYAQgAIBYhBN5/okHrKY0CfJeyodjxqXvl
+HsqYBQJfzoBrAhsMAAoJENjxqXvlHsqYMioP/j4QwPXaQD3VLPsG7YsgHukcOB0U
+b+B+P+ISKLrJxmIXX0dAO7Q7VwD+I64ojSc1zANQmXWu6qTYnWVYPN7Xr6HSefVF
+1gLiRJ7DGvV5Fz4xkZTu7Jf+TUD57WUR2Ddz68mzHOWqhGFCmbirjtKbwvB0M8DT
+qGn/cyGuWNDcmhMgXOwHof/quQrnwZmSokGa3lT8qdGmccUElY7txvD+JH5icxU0
+NMMgeWGuTdKOZPQL5C8Ud1LY9ZK+q74FllzoPOLn4ryqJWQhB+njdcLQnrNZexPr
+huJs9dYvxVPsA2OYFTBDPPxvk2VBrvmiPN3JBr4259WeEOnb6AeSYxY+ZentEZXp
+r8HqyS6Bt258L8YnLDE+xORML0Mk4z0KSZ5FT4aNrfHkUtlgSAYXfug9RuPtPdX3
+E6f5LF9Iz30gzgFQGJsh0lz7+eKdNaeth4ErHv8MXfPQ9yw8gwZ6LcWi9VM3faOn
+4akZM5tNRTReGinnYwhi8yNtw50EY31MUBgb9SS+S52o2p3rbjlrHJB4Spe14g2Z
+P+3d/bY7eHLaFnkIuQR2dzaJti/nf2b/7VQHLm6HyHAigUFFSA4CcV/R659sKacV
+Y2wH1LgDJJsoBLPFNxhgTLjMlErwsZlacmXyogrmOS+ZvgQz/LZ1mIryTAkd1Gym
+JznYPjY83fSKkeCh
+=3Ggj
+-END PGP PUBLIC KEY BLOCK-
\ No newline at end of file



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r58281 - /dev/spark/v3.2.3-rc1-bin/ /release/spark/spark-3.2.3/

2022-11-28 Thread dongjoon
Author: dongjoon
Date: Mon Nov 28 18:03:37 2022
New Revision: 58281

Log:
Release Apache Spark 3.2.3

Added:
release/spark/spark-3.2.3/
  - copied from r58280, dev/spark/v3.2.3-rc1-bin/
Removed:
dev/spark/v3.2.3-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41180][SQL] Reuse `INVALID_SCHEMA` instead of `_LEGACY_ERROR_TEMP_1227`

2022-11-28 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bdb4d5e4da5 [SPARK-41180][SQL] Reuse `INVALID_SCHEMA` instead of 
`_LEGACY_ERROR_TEMP_1227`
bdb4d5e4da5 is described below

commit bdb4d5e4da558775df2be712dd8760d5f5f14747
Author: yangjie01 
AuthorDate: Mon Nov 28 20:26:27 2022 +0300

[SPARK-41180][SQL] Reuse `INVALID_SCHEMA` instead of 
`_LEGACY_ERROR_TEMP_1227`

### What changes were proposed in this pull request?
This pr aims rename `_LEGACY_ERROR_TEMP_1227` to `INVALID_SCHEMA`

### Why are the changes needed?
Proper names of error classes to improve user experience with Spark SQL.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #38754 from LuciferYang/SPARK-41180.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 23 ---
 project/MimaExcludes.scala |  5 +-
 .../spark/sql/catalyst/expressions/ExprUtils.scala |  2 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  | 23 ---
 .../apache/spark/sql/errors/QueryErrorsBase.scala  |  4 ++
 .../org/apache/spark/sql/types/DataType.scala  | 13 ++--
 .../scala/org/apache/spark/sql/functions.scala |  1 -
 .../sql-tests/results/csv-functions.sql.out| 12 ++--
 .../sql-tests/results/json-functions.sql.out   | 12 ++--
 .../org/apache/spark/sql/CsvFunctionsSuite.scala   |  4 +-
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |  4 +-
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  | 73 --
 12 files changed, 115 insertions(+), 61 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 9f4337d0618..89728777201 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -780,8 +780,21 @@
   },
   "INVALID_SCHEMA" : {
 "message" : [
-  "The expression  is not a valid schema string."
-]
+  "The input schema  is not a valid schema string."
+],
+"subClass" : {
+  "NON_STRING_LITERAL" : {
+"message" : [
+  "The input expression must be string literal and not null."
+]
+  },
+  "PARSE_ERROR" : {
+"message" : [
+  "Cannot parse the schema:",
+  ""
+]
+  }
+}
   },
   "INVALID_SQL_SYNTAX" : {
 "message" : [
@@ -2844,12 +2857,6 @@
   "The SQL config '' was removed in the version . 
"
 ]
   },
-  "_LEGACY_ERROR_TEMP_1227" : {
-"message" : [
-  "",
-  "Failed fallback parsing: "
-]
-  },
   "_LEGACY_ERROR_TEMP_1228" : {
 "message" : [
   "Decimal scale () cannot be greater than precision ()."
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index d8f87a504fa..eed79d1f204 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -120,7 +120,10 @@ object MimaExcludes {
 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.apply"),
 
 // [SPARK-41072][SS] Add the error class STREAM_FAILED to 
StreamingQueryException
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.StreamingQueryException.this")
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.StreamingQueryException.this"),
+
+// [SPARK-41180][SQL] Reuse INVALID_SCHEMA instead of 
_LEGACY_ERROR_TEMP_1227
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.types.DataType.parseTypeWithFallback")
   )
 
   // Defulat exclude rules
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
index 3e10b820aa6..e9084442b22 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
@@ -35,7 +35,7 @@ object ExprUtils extends QueryErrorsBase {
 case s: UTF8String if s != null =>
   val dataType = DataType.fromDDL(s.toString)
   CharVarcharUtils.failIfHasCharVarchar(dataType)
-case _ => throw QueryCompilationErrors.invalidSchemaStringError(exp)
+case _ => throw QueryCompilationErrors.unexpectedSchemaTypeError(exp)
 
   }
 } else {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index 486bd21b844..ce99bf4aa47 

[spark] branch master updated: [SPARK-41300][CONNECT] Unset schema is interpreted as Schema

2022-11-28 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 40a5751d101 [SPARK-41300][CONNECT] Unset schema is interpreted as 
Schema
40a5751d101 is described below

commit 40a5751d1010e7d811f670131c29a4f40acd7ad2
Author: Martin Grund 
AuthorDate: Mon Nov 28 11:06:06 2022 -0400

[SPARK-41300][CONNECT] Unset schema is interpreted as Schema

### What changes were proposed in this pull request?
When a query is read from a DataSource using Spark Connect, the scalar 
string value would be empty and thus during processing we would treat it as set 
and fail the query because no schema can be parsed from the empty string.

This patch fixes this issue and adds the relevant test for it.

### Why are the changes needed?
Bugfix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #38821 from grundprinzip/SPARK-41300.

Authored-by: Martin Grund 
Signed-off-by: Herman van Hovell 
---
 .../spark/sql/connect/planner/SparkConnectPlanner.scala  |  2 +-
 python/pyspark/sql/tests/connect/test_connect_basic.py   | 12 
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index fa5a0068c68..b4eaa03df5d 100644
--- 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -300,7 +300,7 @@ class SparkConnectPlanner(session: SparkSession) {
 val reader = session.read
 reader.format(rel.getDataSource.getFormat)
 localMap.foreach { case (key, value) => reader.option(key, value) }
-if (rel.getDataSource.getSchema != null) {
+if (rel.getDataSource.getSchema != null && 
!rel.getDataSource.getSchema.isEmpty) {
   reader.schema(rel.getDataSource.getSchema)
 }
 reader.load().queryExecution.analyzed
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 97ba34d8269..e0f5f23fdb4 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -552,6 +552,18 @@ class SparkConnectTests(SparkConnectSQLTestCase):
 actualResult = pandasResult.values.tolist()
 self.assertEqual(len(expectResult), len(actualResult))
 
+def test_simple_read_without_schema(self) -> None:
+"""SPARK-41300: Schema not set when reading CSV."""
+writeDf = self.df_text
+tmpPath = tempfile.mkdtemp()
+shutil.rmtree(tmpPath)
+writeDf.write.csv(tmpPath, header=True)
+
+readDf = self.connect.read.format("csv").option("header", 
True).load(path=tmpPath)
+expectResult = set(writeDf.collect())
+pandasResult = set(readDf.collect())
+self.assertEqual(expectResult, pandasResult)
+
 def test_simple_transform(self) -> None:
 """SPARK-41203: Support DF.transform"""
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-41254][YARN] bugfix wrong usage when check YarnAllocator.rpIdToYarnResource key existence

2022-11-28 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 090bebd6a63 [SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence
090bebd6a63 is described below

commit 090bebd6a63fdd69b14d08c459fd5bd2301948e4
Author: John Caveman 
AuthorDate: Mon Nov 28 08:25:00 2022 -0600

[SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence

### What changes were proposed in this pull request?
bugfix, a misuse of ConcurrentHashMap.contains causing map 
YarnAllocator.rpIdToYarnResource always updated

### Why are the changes needed?
It causing duplicated log during yarn resource allocation and unnecessary 
object creation and gc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #38790 from CavemanIV/SPARK-41254.

Authored-by: John Caveman 
Signed-off-by: Sean Owen 
(cherry picked from commit bccfe5bca600b3091ea93b4c5d6437af8381973f)
Signed-off-by: Sean Owen 
---
 .../src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index a85b7174673..16cae4810e4 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -278,7 +278,7 @@ private[yarn] class YarnAllocator(
 
   // if a ResourceProfile hasn't been seen yet, create the corresponding YARN 
Resource for it
   private def createYarnResourceForResourceProfile(rp: ResourceProfile): Unit 
= synchronized {
-if (!rpIdToYarnResource.contains(rp.id)) {
+if (!rpIdToYarnResource.containsKey(rp.id)) {
   // track the resource profile if not already there
   getOrUpdateRunningExecutorForRPId(rp.id)
   logInfo(s"Resource profile ${rp.id} doesn't exist, adding it")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-41254][YARN] bugfix wrong usage when check YarnAllocator.rpIdToYarnResource key existence

2022-11-28 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 19450452dcd [SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence
19450452dcd is described below

commit 19450452dcd6134853f6c0db8e755e78cddef922
Author: John Caveman 
AuthorDate: Mon Nov 28 08:25:00 2022 -0600

[SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence

### What changes were proposed in this pull request?
bugfix, a misuse of ConcurrentHashMap.contains causing map 
YarnAllocator.rpIdToYarnResource always updated

### Why are the changes needed?
It causing duplicated log during yarn resource allocation and unnecessary 
object creation and gc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #38790 from CavemanIV/SPARK-41254.

Authored-by: John Caveman 
Signed-off-by: Sean Owen 
(cherry picked from commit bccfe5bca600b3091ea93b4c5d6437af8381973f)
Signed-off-by: Sean Owen 
---
 .../src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index 54ab643f275..26535d672d7 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -276,7 +276,7 @@ private[yarn] class YarnAllocator(
 
   // if a ResourceProfile hasn't been seen yet, create the corresponding YARN 
Resource for it
   private def createYarnResourceForResourceProfile(rp: ResourceProfile): Unit 
= synchronized {
-if (!rpIdToYarnResource.contains(rp.id)) {
+if (!rpIdToYarnResource.containsKey(rp.id)) {
   // track the resource profile if not already there
   getOrUpdateRunningExecutorForRPId(rp.id)
   logInfo(s"Resource profile ${rp.id} doesn't exist, adding it")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41254][YARN] bugfix wrong usage when check YarnAllocator.rpIdToYarnResource key existence

2022-11-28 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bccfe5bca60 [SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence
bccfe5bca60 is described below

commit bccfe5bca600b3091ea93b4c5d6437af8381973f
Author: John Caveman 
AuthorDate: Mon Nov 28 08:25:00 2022 -0600

[SPARK-41254][YARN] bugfix wrong usage when check 
YarnAllocator.rpIdToYarnResource key existence

### What changes were proposed in this pull request?
bugfix, a misuse of ConcurrentHashMap.contains causing map 
YarnAllocator.rpIdToYarnResource always updated

### Why are the changes needed?
It causing duplicated log during yarn resource allocation and unnecessary 
object creation and gc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #38790 from CavemanIV/SPARK-41254.

Authored-by: John Caveman 
Signed-off-by: Sean Owen 
---
 .../src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
index a90ab180d86..ee1d10c204a 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -298,7 +298,7 @@ private[yarn] class YarnAllocator(
 
   // if a ResourceProfile hasn't been seen yet, create the corresponding YARN 
Resource for it
   private def createYarnResourceForResourceProfile(rp: ResourceProfile): Unit 
= synchronized {
-if (!rpIdToYarnResource.contains(rp.id)) {
+if (!rpIdToYarnResource.containsKey(rp.id)) {
   // track the resource profile if not already there
   getOrUpdateRunningExecutorForRPId(rp.id)
   logInfo(s"Resource profile ${rp.id} doesn't exist, adding it")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41114][CONNECT][PYTHON][FOLLOW-UP] Python Client support for local data

2022-11-28 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5fbf7e2e22c [SPARK-41114][CONNECT][PYTHON][FOLLOW-UP] Python Client 
support for local data
5fbf7e2e22c is described below

commit 5fbf7e2e22c92f6a506e88ef6d5b5d5fea2447ea
Author: Martin Grund 
AuthorDate: Mon Nov 28 09:41:03 2022 -0400

[SPARK-41114][CONNECT][PYTHON][FOLLOW-UP] Python Client support for local 
data

### What changes were proposed in this pull request?
Since the Spark Connect server now supports reading local data from the 
client. This patch implements the necessary changes in the Python client to 
support reading from a local Pandas Data frame.

```
import pandas

pdf = pandas.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
df = spark.createDataFrame(pdf)
rows = df.filter(df.a == lit(3)).collect()
self.assertTrue(len(rows) == 1)
self.assertEqual(rows[0][0], 3)
self.assertEqual(rows[0][1], "c")
```

### Why are the changes needed?
Compatibility

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #38803 from grundprinzip/SPARK-41114.

Authored-by: Martin Grund 
Signed-off-by: Herman van Hovell 
---
 .../sql/connect/planner/SparkConnectPlanner.scala  |  3 ++
 python/pyspark/sql/connect/plan.py | 32 +-
 python/pyspark/sql/connect/session.py  | 30 
 .../sql/tests/connect/test_connect_basic.py| 14 ++
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 40d4ecc7556..fa5a0068c68 100644
--- 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -278,6 +278,9 @@ class SparkConnectPlanner(session: SparkSession) {
 val (rows, structType) = ArrowConverters.fromBatchWithSchemaIterator(
   Iterator(rel.getData.toByteArray),
   TaskContext.get())
+if (structType == null) {
+  throw InvalidPlanInput(s"Input data for LocalRelation does not produce a 
schema.")
+}
 val attributes = structType.toAttributes
 val proj = UnsafeProjection.create(attributes, attributes)
 new logical.LocalRelation(attributes, rows.map(r => proj(r).copy()).toSeq)
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 7a57168fa73..805628cfe5b 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -25,7 +25,8 @@ from typing import (
 TYPE_CHECKING,
 Mapping,
 )
-
+import pandas
+import pyarrow as pa
 import pyspark.sql.connect.proto as proto
 from pyspark.sql.connect.column import (
 Column,
@@ -177,6 +178,35 @@ class Read(LogicalPlan):
 """
 
 
+class LocalRelation(LogicalPlan):
+"""Creates a LocalRelation plan object based on a Pandas DataFrame."""
+
+def __init__(self, pdf: "pandas.DataFrame") -> None:
+super().__init__(None)
+self._pdf = pdf
+
+def plan(self, session: "SparkConnectClient") -> proto.Relation:
+sink = pa.BufferOutputStream()
+table = pa.Table.from_pandas(self._pdf)
+with pa.ipc.new_stream(sink, table.schema) as writer:
+for b in table.to_batches():
+writer.write_batch(b)
+
+plan = proto.Relation()
+plan.local_relation.data = sink.getvalue().to_pybytes()
+return plan
+
+def print(self, indent: int = 0) -> str:
+return f"{' ' * indent}\n"
+
+def _repr_html_(self) -> str:
+return """
+
+LocalRelation
+
+"""
+
+
 class ShowString(LogicalPlan):
 def __init__(
 self, child: Optional["LogicalPlan"], numRows: int, truncate: int, 
vertical: bool
diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 92f58140eac..c9b76cf47f9 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -17,6 +17,7 @@
 
 from threading import RLock
 from typing import Optional, Any, Union, Dict, cast, overload
+import pandas as pd
 
 import pyspark.sql.types
 from pyspark.sql.connect.client import SparkConnectClient
@@ -24,6 +25,7 @@ from pyspark.sql.connect.dataframe import DataFrame
 from pyspark.sql.connect.plan import SQL, Range
 from pyspark.sql.connect.readwriter import DataFrameReader
 from pyspark.sql.utils import to_str
+from . import plan
 from ._typing import OptionalPrimitiveType
 
 
@@ -205,6 

[spark] branch master updated: [SPARK-41293][SQL][TESTS] Code cleanup for `assertXXX` methods in `ExpressionTypeCheckingSuite`

2022-11-28 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c3de4ca1477 [SPARK-41293][SQL][TESTS] Code cleanup for `assertXXX` 
methods in `ExpressionTypeCheckingSuite`
c3de4ca1477 is described below

commit c3de4ca14772fa6dff703b662a561a9e65e23d9e
Author: yangjie01 
AuthorDate: Mon Nov 28 14:55:28 2022 +0300

[SPARK-41293][SQL][TESTS] Code cleanup for `assertXXX` methods in 
`ExpressionTypeCheckingSuite`

### What changes were proposed in this pull request?
This pr do some code clean up for `assertXXX` method in 
`ExpressionTypeCheckingSuite`:

1.  Reuse `analysisException` instead of duplicate 
`intercept[AnalysisException](assertSuccess(expr))` in `assertErrorForXXX` 
methods.
2. remove  `assertError` method that is no longer used
3. Change `assertErrorForXXX` methods access scope to `private` due to they 
are only used in `ExpressionTypeCheckingSuite`.

### Why are the changes needed?
Code clean up.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
Pass GitHub Actions

Closes #38820 from LuciferYang/SPARK-41293.

Authored-by: yangjie01 
Signed-off-by: Max Gekk 
---
 .../analysis/ExpressionTypeCheckingSuite.scala | 41 ++
 1 file changed, 11 insertions(+), 30 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala
index d406ec8f74a..6202d1e367a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala
@@ -44,65 +44,46 @@ class ExpressionTypeCheckingSuite extends SparkFunSuite 
with SQLHelper with Quer
 intercept[AnalysisException](assertSuccess(expr))
   }
 
-  def assertError(expr: Expression, errorMessage: String): Unit = {
-val e = intercept[AnalysisException] {
-  assertSuccess(expr)
-}
-assert(e.getMessage.contains(
-  s"cannot resolve '${expr.sql}' due to data type mismatch:"))
-assert(e.getMessage.contains(errorMessage))
-  }
-
-  def assertSuccess(expr: Expression): Unit = {
+  private def assertSuccess(expr: Expression): Unit = {
 val analyzed = testRelation.select(expr.as("c")).analyze
 SimpleAnalyzer.checkAnalysis(analyzed)
   }
 
-  def assertErrorForBinaryDifferingTypes(
+  private def assertErrorForBinaryDifferingTypes(
   expr: Expression, messageParameters: Map[String, String]): Unit = {
 checkError(
-  exception = intercept[AnalysisException] {
-assertSuccess(expr)
-  },
+  exception = analysisException(expr),
   errorClass = "DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES",
   parameters = messageParameters)
   }
 
-  def assertErrorForOrderingTypes(
+  private def assertErrorForOrderingTypes(
   expr: Expression, messageParameters: Map[String, String]): Unit = {
 checkError(
-  exception = intercept[AnalysisException] {
-assertSuccess(expr)
-  },
+  exception = analysisException(expr),
   errorClass = "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE",
   parameters = messageParameters)
   }
 
-  def assertErrorForDataDifferingTypes(
+  private def assertErrorForDataDifferingTypes(
   expr: Expression, messageParameters: Map[String, String]): Unit = {
 checkError(
-  exception = intercept[AnalysisException] {
-assertSuccess(expr)
-  },
+  exception = analysisException(expr),
   errorClass = "DATATYPE_MISMATCH.DATA_DIFF_TYPES",
   parameters = messageParameters)
   }
 
-  def assertErrorForWrongNumParameters(
+  private def assertErrorForWrongNumParameters(
   expr: Expression, messageParameters: Map[String, String]): Unit = {
 checkError(
-  exception = intercept[AnalysisException] {
-assertSuccess(expr)
-  },
+  exception = analysisException(expr),
   errorClass = "DATATYPE_MISMATCH.WRONG_NUM_ARGS",
   parameters = messageParameters)
   }
 
-  def assertForWrongType(expr: Expression, messageParameters: Map[String, 
String]): Unit = {
+  private def assertForWrongType(expr: Expression, messageParameters: 
Map[String, String]): Unit = {
 checkError(
-  exception = intercept[AnalysisException] {
-assertSuccess(expr)
-  },
+  exception = analysisException(expr),
   errorClass = "DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
   parameters = messageParameters)
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional 

[spark-docker] branch master updated: [SPARK-41287][INFRA] Add test workflow to help self-build image test in fork repo

2022-11-28 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new cfcbeac  [SPARK-41287][INFRA] Add test workflow to help self-build 
image test in fork repo
cfcbeac is described below

commit cfcbeac5d2b922a5ee7dfd2b4a5cf08072c827b7
Author: Yikun Jiang 
AuthorDate: Mon Nov 28 17:55:18 2022 +0800

[SPARK-41287][INFRA] Add test workflow to help self-build image test in 
fork repo

### What changes were proposed in this pull request?
This patch adds a test workflow to help fork repo to test image in their 
fork repos.


![image](https://user-images.githubusercontent.com/1736354/204183109-e2341397-251e-42a0-b5f7-c1c1f9334ff9.png)

such like:
- 
https://github.com/Yikun/spark-docker/actions/runs/3552072792/jobs/5966742869
- 
https://github.com/Yikun/spark-docker/actions/runs/3561513498/jobs/5982485960

### Why are the changes needed?
Help devs/users test their own image in their fork repo

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test in my fork repo:
https://github.com/Yikun/spark-docker/actions/workflows/test.yml

Closes #26 from Yikun/test-workflow.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/main.yml  | 28 +++--
 .github/workflows/publish.yml   |  2 +-
 .github/workflows/{publish.yml => test.yml} | 62 -
 3 files changed, 60 insertions(+), 32 deletions(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index ebafcdc..fd37990 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -37,13 +37,18 @@ on:
 required: true
 type: string
 default: 11
+  build:
+description: Build the image or not.
+required: false
+type: boolean
+default: true
   publish:
 description: Publish the image or not.
 required: false
 type: boolean
 default: false
   repository:
-description: The registry to be published (Avaliable only when publish 
is selected).
+description: The registry to be published/tested. (Available only in 
publish/test workflow)
 required: false
 type: string
 default: ghcr.io/apache/spark-docker
@@ -52,6 +57,11 @@ on:
 required: false
 type: string
 default: python
+  image-tag:
+type: string
+description: The image tag to be tested. (Available only in test 
workflow)
+required: false
+default: latest
 
 jobs:
   main:
@@ -83,11 +93,18 @@ jobs:
   esac
   TAG=scala${{ inputs.scala }}-java${{ inputs.java }}-$SUFFIX
 
-  REPO_OWNER=$(echo "${{ github.repository_owner }}" | tr '[:upper:]' 
'[:lower:]')
-  TEST_REPO=localhost:5000/$REPO_OWNER/spark-docker
   IMAGE_NAME=spark
   IMAGE_PATH=${{ inputs.spark }}/$TAG
-  UNIQUE_IMAGE_TAG=${{ inputs.spark }}-$TAG
+  if [ "${{ inputs.build }}" == "true" ]; then
+# Use the local registry to build and test
+REPO_OWNER=$(echo "${{ github.repository_owner }}" | tr 
'[:upper:]' '[:lower:]')
+TEST_REPO=localhost:5000/$REPO_OWNER/spark-docker
+UNIQUE_IMAGE_TAG=${{ inputs.spark }}-$TAG
+  else
+# Use specified {repository}/spark:{image-tag} image to test
+TEST_REPO=${{ inputs.repository }}
+UNIQUE_IMAGE_TAG=${{ inputs.image-tag }}
+  fi
   IMAGE_URL=$TEST_REPO/$IMAGE_NAME:$UNIQUE_IMAGE_TAG
 
   PUBLISH_REPO=${{ inputs.repository }}
@@ -119,15 +136,18 @@ jobs:
   echo "PUBLISH_IMAGE_URL:"${PUBLISH_IMAGE_URL}
 
   - name: Build - Set up QEMU
+if: ${{ inputs.build }}
 uses: docker/setup-qemu-action@v2
 
   - name: Build - Set up Docker Buildx
+if: ${{ inputs.build }}
 uses: docker/setup-buildx-action@v2
 with:
   # This required by local registry
   driver-opts: network=host
 
   - name: Build - Build and push test image
+if: ${{ inputs.build }}
 uses: docker/build-push-action@v3
 with:
   context: ${{ env.IMAGE_PATH }}
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
index 4a07f5d..2941cfb 100644
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -36,7 +36,7 @@ on:
 type: boolean
 required: true
   repository:
-description: The registry to be published (Avaliable only when publish 
is true).
+description: The registry to be published (Available only when publish 
is true).
 required: false
 default: ghcr.io/apache/spark-docker
  

[spark] branch master updated: [SPARK-41273][BUILD] Update plugins to latest versions

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 604700c400e [SPARK-41273][BUILD] Update plugins to latest versions
604700c400e is described below

commit 604700c400ef97b86e3025f8ac026b469eaf96ee
Author: panbingkun 
AuthorDate: Mon Nov 28 17:33:08 2022 +0800

[SPARK-41273][BUILD] Update plugins to latest versions

### What changes were proposed in this pull request?
This PR updates plugins to latest versions.

### Why are the changes needed?
This brings improvment & bug fixes like the following:

- maven-install-plugin (from 3.0.0-M1 to 3.1.0)
https://github.com/apache/maven-install-plugin/releases

https://github.com/apache/maven-install-plugin/compare/maven-install-plugin-3.0.0-M1...maven-install-plugin-3.1.0

- maven-deploy-plugin (from 3.0.0-M1 to 3.0.0)
https://github.com/apache/maven-deploy-plugin/releases

https://github.com/apache/maven-deploy-plugin/compare/maven-deploy-plugin-3.0.0-M1...maven-deploy-plugin-3.0.0

- maven-javadoc-plugin (from 3.4.0 to 3.4.1)
https://github.com/apache/maven-javadoc-plugin/releases

https://github.com/apache/maven-javadoc-plugin/compare/maven-javadoc-plugin-3.4.0...maven-javadoc-plugin-3.4.1

- maven-jar-plugin (from 3.2.2 to 3.3.0)
https://github.com/apache/maven-jar-plugin/releases

https://github.com/apache/maven-jar-plugin/compare/maven-jar-plugin-3.2.2...maven-jar-plugin-3.3.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA and testing with the existing code.

Closes #38809 from panbingkun/SPARK-41273.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 pom.xml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pom.xml b/pom.xml
index b78a9f18809..f47bedb18e7 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3082,7 +3082,7 @@
 
   org.apache.maven.plugins
   maven-javadoc-plugin
-  3.4.0
+  3.4.1
   
 
   -Xdoclint:all
@@ -3155,12 +3155,12 @@
 
   org.apache.maven.plugins
   maven-install-plugin
-  3.0.0-M1
+  3.1.0
 
 
   org.apache.maven.plugins
   maven-deploy-plugin
-  3.0.0-M1
+  3.0.0
 
 
   org.apache.maven.plugins
@@ -3773,7 +3773,7 @@
   
 org.apache.maven.plugins
 maven-jar-plugin
-3.2.2
+3.3.0
 
   test-jar
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [MINOR][DOC] Fix typo in pydoc for pyspark.sql.function.from_utc_timestamp

2022-11-28 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f7cfce5e237 [MINOR][DOC] Fix typo in pydoc for 
pyspark.sql.function.from_utc_timestamp
f7cfce5e237 is described below

commit f7cfce5e237adccff28c8d2c02423e5afb57e00c
Author: Fredrik Mile 
AuthorDate: Mon Nov 28 17:31:03 2022 +0800

[MINOR][DOC] Fix typo in pydoc for pyspark.sql.function.from_utc_timestamp

### What changes were proposed in this pull request?

Typo fix in pydoc for pyspark.sql.function.from_utc_timestamp

upported -> supported

### Why are the changes needed?

Great documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #38817 from mile95/master.

Authored-by: Fredrik Mile 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index ad1bc488e87..3aeb48adea7 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -4737,7 +4737,7 @@ def to_utc_timestamp(timestamp: "ColumnOrName", tz: 
"ColumnOrName") -> Column:
 be in the format of either region-based zone IDs or zone offsets. 
Region IDs must
 have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets 
must be in
 the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' 
and 'Z' are
-upported as aliases of '+00:00'. Other short names are not recommended 
to use
+supported as aliases of '+00:00'. Other short names are not 
recommended to use
 because they can be ambiguous.
 
 .. versionchanged:: 2.4.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41272][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_2019

2022-11-28 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d979736a9eb [SPARK-41272][SQL] Assign a name to the error class 
_LEGACY_ERROR_TEMP_2019
d979736a9eb is described below

commit d979736a9eb754725d33fd5baca88a1c1a8c23ce
Author: panbingkun 
AuthorDate: Mon Nov 28 12:01:02 2022 +0300

[SPARK-41272][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_2019

### What changes were proposed in this pull request?
In the PR, I propose to assign the name `NULL_MAP_KEY` to the error class 
`_LEGACY_ERROR_TEMP_2019`.

### Why are the changes needed?
Proper names of error classes should improve user experience with Spark SQL.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #38808 from panbingkun/LEGACY_2019.

Authored-by: panbingkun 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 10 ++---
 .../spark/sql/errors/QueryExecutionErrors.scala|  2 +-
 .../catalyst/encoders/ExpressionEncoderSuite.scala | 20 +++---
 .../expressions/CollectionExpressionsSuite.scala   | 10 ++---
 .../catalyst/expressions/ComplexTypeSuite.scala| 11 +++---
 .../expressions/ExpressionEvalHelper.scala | 43 +-
 .../expressions/ObjectExpressionsSuite.scala   | 10 ++---
 .../catalyst/util/ArrayBasedMapBuilderSuite.scala  |  8 +++-
 .../apache/spark/sql/DataFrameFunctionsSuite.scala | 38 ++-
 9 files changed, 113 insertions(+), 39 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 1246e870e0d..9f4337d0618 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -895,6 +895,11 @@
   "The comparison result is null. If you want to handle null as 0 (equal), 
you can set \"spark.sql.legacy.allowNullComparisonResultInArraySort\" to 
\"true\"."
 ]
   },
+  "NULL_MAP_KEY" : {
+"message" : [
+  "Cannot use null as map key."
+]
+  },
   "NUMERIC_OUT_OF_SUPPORTED_RANGE" : {
 "message" : [
   "The value  cannot be interpreted as a numeric since it has more 
than 38 digits."
@@ -3504,11 +3509,6 @@
   "class `` is not supported by `MapObjects` as resulting collection."
 ]
   },
-  "_LEGACY_ERROR_TEMP_2019" : {
-"message" : [
-  "Cannot use null as map key!"
-]
-  },
   "_LEGACY_ERROR_TEMP_2020" : {
 "message" : [
   "Couldn't find a valid constructor on "
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 5db54f7f4cf..15dfa581c59 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -444,7 +444,7 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 
   def nullAsMapKeyNotAllowedError(): SparkRuntimeException = {
 new SparkRuntimeException(
-  errorClass = "_LEGACY_ERROR_TEMP_2019",
+  errorClass = "NULL_MAP_KEY",
   messageParameters = Map.empty)
   }
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
index 9b481b13fee..e9336405a53 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
@@ -24,7 +24,7 @@ import java.util.Arrays
 import scala.collection.mutable.ArrayBuffer
 import scala.reflect.runtime.universe.TypeTag
 
-import org.apache.spark.SparkArithmeticException
+import org.apache.spark.{SparkArithmeticException, SparkRuntimeException}
 import org.apache.spark.sql.{Encoder, Encoders}
 import org.apache.spark.sql.catalyst.{FooClassWithEnum, FooEnum, OptionalData, 
PrimitiveData, ScroogeLikeExample}
 import org.apache.spark.sql.catalyst.analysis.AnalysisTest
@@ -539,14 +539,24 @@ class ExpressionEncoderSuite extends 
CodegenInterpretedPlanTest with AnalysisTes
 
   test("null check for map key: String") {
 val toRow = ExpressionEncoder[Map[String, Int]]().createSerializer()
-val e = intercept[RuntimeException](toRow(Map(("a", 1), (null, 2
-assert(e.getMessage.contains("Cannot use null as map key"))
+val e = intercept[SparkRuntimeException](toRow(Map(("a", 1), (null, 2
+assert(e.getCause.isInstanceOf[SparkRuntimeException])
+checkError(
+  exception = 

[spark] branch master updated (1a8f2952225 -> c1df6a0ddd8)

2022-11-28 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1a8f2952225 [SPARK-41291][CONNECT][PYTHON] DataFrame.explain` should 
print and return None
 add c1df6a0ddd8 [SPARK-41003][SQL] BHJ LeftAnti does not update 
numOutputRows when codegen is disabled

No new revisions were added by this update.

Summary of changes:
 .../execution/joins/BroadcastHashJoinExec.scala|  3 +++
 .../sql/execution/metric/SQLMetricsSuite.scala | 25 +-
 2 files changed, 27 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org