[spark] branch branch-3.2 updated (affd7a4 -> 4543ac6)

2021-09-21 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from affd7a4  [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
 add 4543ac6  [SPARK-36771][PYTHON][3.2] Fix `pop` of Categorical Series

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/series.py|  8 ++--
 python/pyspark/pandas/tests/test_series.py | 25 +
 2 files changed, 31 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36771][PYTHON] Fix `pop` of Categorical Series

2021-09-21 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 079a9c5  [SPARK-36771][PYTHON] Fix `pop` of Categorical Series
079a9c5 is described below

commit 079a9c52925818532b57c9cec1ddd31be723885e
Author: Xinrong Meng 
AuthorDate: Tue Sep 21 14:11:21 2021 -0700

[SPARK-36771][PYTHON] Fix `pop` of Categorical Series

### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas 
(1.3.2) behavior.

### Why are the changes needed?
As https://github.com/databricks/koalas/issues/2198, pandas API on Spark 
behaves differently from pandas on `pop` of Categorical Series.

### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.

 From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0a
1b
2c
3a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1b
2c
3a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1b
2c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```

 To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0a
1b
2c
3a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1b
2c
3a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1b
2c
dtype: category
Categories (3, object): ['a', 'b', 'c']

```

### How was this patch tested?
Unit tests.

Closes #34052 from xinrong-databricks/cat_pop.

Authored-by: Xinrong Meng 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/series.py|  8 ++--
 python/pyspark/pandas/tests/test_series.py | 25 +
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index d72c08d..da0d2fb 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -47,7 +47,7 @@ import numpy as np
 import pandas as pd
 from pandas.core.accessor import CachedAccessor
 from pandas.io.formats.printing import pprint_thing
-from pandas.api.types import is_list_like, is_hashable
+from pandas.api.types import is_list_like, is_hashable, CategoricalDtype
 from pandas.tseries.frequencies import DateOffset
 from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame
 from pyspark.sql.types import (
@@ -4098,7 +4098,11 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 pdf = sdf.limit(2).toPandas()
 length = len(pdf)
 if length == 1:
-return pdf[internal.data_spark_column_names[0]].iloc[0]
+val = pdf[internal.data_spark_column_names[0]].iloc[0]
+if isinstance(self.dtype, CategoricalDtype):
+return self.dtype.categories[val]
+else:
+return val
 
 item_string = name_like_string(item)
 sdf = sdf.withColumn(SPARK_DEFAULT_INDEX_NAME, 
SF.lit(str(item_string)))
diff --git a/python/pyspark/pandas/tests/test_series.py 
b/python/pyspark/pandas/tests/test_series.py
index 09e5d30..b7bb121 100644
--- a/python/pyspark/pandas/tests/test_series.py
+++ b/python/pyspark/pandas/tests/test_series.py
@@ -1669,6 +1669,31 @@ class SeriesTest(PandasOnSparkTestCase, SQLTestUtils):
 with self.assertRaisesRegex(KeyError, msg):
 psser.pop(("lama", "speed", "x"))
 
+pser = pd.Series(["a", "b", "c", "a"], dtype="category")
+psser = ps.from_pandas(pser)
+
+if LooseVersion(pd.__version__) >= LooseVersion("1.3.0"):
+self.assert_eq(psser.pop(0), pser.pop(0))
+self.assert_eq(psser, pser)
+
+self.assert_eq(psser.pop(3), pser.pop(3))
+self.assert_eq(psser, pser)
+else:
+# Before pandas 1.3.0, `pop` modifies the dtype of categorical 
series wrongly.
+self.assert_eq(psser.pop(0), "a")
+self.assert_eq(
+psser,
+pd.Series(
+pd.Categorical(["b", "c", "a"], categories=["a", "b", 
"c"]), index=[1, 2, 3]
+),
+)
+
+self.assert_eq(psser.pop(3), "a")
+self.assert_eq(
+psser,
+pd.Series(pd.Categorical(["b", "c"], categories=["a", "b", 
"c"]), index=[1, 2]),
+)
+
 def test_replace(self):
 

[spark] branch master updated: [SPARK-36615][CORE] Register shutdown hook earlier when start SC

2021-09-21 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b7d99e3  [SPARK-36615][CORE] Register shutdown hook earlier when start 
SC
b7d99e3 is described below

commit b7d99e3eea5f9c0b3d11ec578d6aa0720c256eeb
Author: Angerszh 
AuthorDate: Tue Sep 21 13:23:14 2021 -0500

[SPARK-36615][CORE] Register shutdown hook earlier when start SC

### What changes were proposed in this pull request?
Since user always use ctrl+c to stop a starting SC when register with yarn 
in client mode when resources are tight.

In this time, SC have not register the Shutdown hook, this cause we won't 
invoke `sc.stop()` when exit the application.
We should register the ShutdownHook earlier when starting a SparkContext.

### Why are the changes needed?

Make sure we will invoke `sc.stop()` when kill a starting SparkContext 
application.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33869 from AngersZh/SPARK-36615.

Authored-by: Angerszh 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../main/scala/org/apache/spark/SparkContext.scala | 29 +++---
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 3404a0f..e27499a15 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -645,20 +645,6 @@ class SparkContext(config: SparkConf) extends Logging {
 // Attach the driver metrics servlet handler to the web ui after the 
metrics system is started.
 _env.metricsSystem.getServletHandlers.foreach(handler => 
ui.foreach(_.attachHandler(handler)))
 
-// Post init
-_taskScheduler.postStartHook()
-if (isLocal) {
-  _env.metricsSystem.registerSource(Executor.executorSourceLocalModeOnly)
-}
-_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
-_env.metricsSystem.registerSource(new 
BlockManagerSource(_env.blockManager))
-_env.metricsSystem.registerSource(new JVMCPUSource())
-_executorMetricsSource.foreach(_.register(_env.metricsSystem))
-_executorAllocationManager.foreach { e =>
-  _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
-}
-appStatusSource.foreach(_env.metricsSystem.registerSource(_))
-_plugins.foreach(_.registerMetrics(applicationId))
 // Make sure the context is stopped if the user forgets about it. This 
avoids leaving
 // unfinished event logs around after the JVM exits cleanly. It doesn't 
help if the JVM
 // is killed, though.
@@ -673,6 +659,21 @@ class SparkContext(config: SparkConf) extends Logging {
   logWarning("Ignoring Exception while stopping SparkContext from 
shutdown hook", e)
   }
 }
+
+// Post init
+_taskScheduler.postStartHook()
+if (isLocal) {
+  _env.metricsSystem.registerSource(Executor.executorSourceLocalModeOnly)
+}
+_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
+_env.metricsSystem.registerSource(new 
BlockManagerSource(_env.blockManager))
+_env.metricsSystem.registerSource(new JVMCPUSource())
+_executorMetricsSource.foreach(_.register(_env.metricsSystem))
+_executorAllocationManager.foreach { e =>
+  _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
+}
+appStatusSource.foreach(_env.metricsSystem.registerSource(_))
+_plugins.foreach(_.registerMetrics(applicationId))
   } catch {
 case NonFatal(e) =>
   logError("Error initializing SparkContext.", e)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency

2021-09-21 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new affd7a4  [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
affd7a4 is described below

commit affd7a4d47576be863c13df2c6e8068fcea3ba7c
Author: Gengliang Wang 
AuthorDate: Tue Sep 21 10:57:20 2021 -0700

[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency

### What changes were proposed in this pull request?

Remove `com.github.rdblue:brotli-codec:0.1.1` dependency.

### Why are the changes needed?

As Stephen Coy pointed out in the dev list, we should not have 
`com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on 
Maven Central. This is to avoid possible artifact changes on `Jitpack.io`.
Also, the dependency is for tests only. I suggest that we remove it now to 
unblock the 3.2.0 release ASAP.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests.

Closes #34059 from gengliangwang/removeDeps.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit ba5708d944c5e38db750ad480668e524672ee963)
Signed-off-by: Dongjoon Hyun 
---
 pom.xml  | 6 --
 project/SparkBuild.scala | 4 +---
 sql/core/pom.xml | 6 --
 .../spark/sql/execution/datasources/FileSourceCodecSuite.scala   | 9 +++--
 4 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/pom.xml b/pom.xml
index 5adbe8a..4a3bd71 100644
--- a/pom.xml
+++ b/pom.xml
@@ -300,12 +300,6 @@
 false
   
 
-
-  jitpack.io
-  https://jitpack.io
-  Jitpack.io repository
-  
-
   
   
 
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index b9068cc..b1531a6 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -274,9 +274,7 @@ object SparkBuild extends PomBuild {
   "gcs-maven-central-mirror" at 
"https://maven-central.storage-download.googleapis.com/maven2/;,
   DefaultMavenRepository,
   Resolver.mavenLocal,
-  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns),
-  // needed for brotli-codec
-  "jitpack.io" at "https://jitpack.io;
+  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
 ),
 externalResolvers := resolvers.value,
 otherResolvers := SbtPomKeys.mvnLocalRepository(dotM2 => 
Seq(Resolver.file("dotM2", dotM2))).value,
diff --git a/sql/core/pom.xml b/sql/core/pom.xml
index 3cdd9a4..e023377 100644
--- a/sql/core/pom.xml
+++ b/sql/core/pom.xml
@@ -184,12 +184,6 @@
   htmlunit-driver
   test
 
-
-  com.github.rdblue
-  brotli-codec
-  0.1.1
-  test
-
   
   
 
target/scala-${scala.binary.version}/classes
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
index 3c226d6..ac1fd1c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
@@ -56,13 +56,10 @@ class ParquetCodecSuite extends FileSourceCodecSuite {
   override def format: String = "parquet"
   override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key
   // Exclude "lzo" because it is GPL-licenced so not included in Hadoop.
+  // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is 
not available
+  // on Maven Central.
   override protected def availableCodecs: Seq[String] =
-if (System.getProperty("os.arch") == "aarch64") {
-  // Exclude "brotli" due to PARQUET-1975.
-  Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd")
-} else {
-  Seq("none", "uncompressed", "snappy", "lz4", "gzip", "brotli", "zstd")
-}
+Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd")
 }
 
 class OrcCodecSuite extends FileSourceCodecSuite {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency

2021-09-21 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ba5708d  [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
ba5708d is described below

commit ba5708d944c5e38db750ad480668e524672ee963
Author: Gengliang Wang 
AuthorDate: Tue Sep 21 10:57:20 2021 -0700

[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency

### What changes were proposed in this pull request?

Remove `com.github.rdblue:brotli-codec:0.1.1` dependency.

### Why are the changes needed?

As Stephen Coy pointed out in the dev list, we should not have 
`com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on 
Maven Central. This is to avoid possible artifact changes on `Jitpack.io`.
Also, the dependency is for tests only. I suggest that we remove it now to 
unblock the 3.2.0 release ASAP.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests.

Closes #34059 from gengliangwang/removeDeps.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml  | 6 --
 project/SparkBuild.scala | 4 +---
 sql/core/pom.xml | 6 --
 .../spark/sql/execution/datasources/FileSourceCodecSuite.scala   | 9 +++--
 4 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/pom.xml b/pom.xml
index b99f6af..a849b74 100644
--- a/pom.xml
+++ b/pom.xml
@@ -300,12 +300,6 @@
 false
   
 
-
-  jitpack.io
-  https://jitpack.io
-  Jitpack.io repository
-  
-
   
   
 
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index b9068cc..b1531a6 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -274,9 +274,7 @@ object SparkBuild extends PomBuild {
   "gcs-maven-central-mirror" at 
"https://maven-central.storage-download.googleapis.com/maven2/;,
   DefaultMavenRepository,
   Resolver.mavenLocal,
-  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns),
-  // needed for brotli-codec
-  "jitpack.io" at "https://jitpack.io;
+  Resolver.file("ivyLocal", file(Path.userHome.absolutePath + 
"/.ivy2/local"))(Resolver.ivyStylePatterns)
 ),
 externalResolvers := resolvers.value,
 otherResolvers := SbtPomKeys.mvnLocalRepository(dotM2 => 
Seq(Resolver.file("dotM2", dotM2))).value,
diff --git a/sql/core/pom.xml b/sql/core/pom.xml
index f9c12d1..42a3d5e 100644
--- a/sql/core/pom.xml
+++ b/sql/core/pom.xml
@@ -184,12 +184,6 @@
   htmlunit-driver
   test
 
-
-  com.github.rdblue
-  brotli-codec
-  0.1.1
-  test
-
   
   
 
target/scala-${scala.binary.version}/classes
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
index 3c226d6..ac1fd1c 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
@@ -56,13 +56,10 @@ class ParquetCodecSuite extends FileSourceCodecSuite {
   override def format: String = "parquet"
   override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key
   // Exclude "lzo" because it is GPL-licenced so not included in Hadoop.
+  // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is 
not available
+  // on Maven Central.
   override protected def availableCodecs: Seq[String] =
-if (System.getProperty("os.arch") == "aarch64") {
-  // Exclude "brotli" due to PARQUET-1975.
-  Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd")
-} else {
-  Seq("none", "uncompressed", "snappy", "lz4", "gzip", "brotli", "zstd")
-}
+Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd")
 }
 
 class OrcCodecSuite extends FileSourceCodecSuite {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36769][PYTHON] Improve `filter` of single-indexed DataFrame

2021-09-21 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 33e463c  [SPARK-36769][PYTHON] Improve `filter` of single-indexed 
DataFrame
33e463c is described below

commit 33e463ccf99d09ad8a743d32104f590e204da93d
Author: Xinrong Meng 
AuthorDate: Tue Sep 21 10:20:15 2021 -0700

[SPARK-36769][PYTHON] Improve `filter` of single-indexed DataFrame

### What changes were proposed in this pull request?
Improve `filter` of single-indexed DataFrame by replacing a long Project 
with Filter or Join.

### Why are the changes needed?
When the given `items` have too many elements, a long Project is introduced.
We may replace that with `Column.isin` or joining depending on the length 
of `items` for better performance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33998 from xinrong-databricks/impr_filter.

Authored-by: Xinrong Meng 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/frame.py| 28 +++
 python/pyspark/pandas/tests/test_dataframe.py |  7 +++
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index 09efef2..cba1db1 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -9995,13 +9995,25 @@ defaultdict(, {'col..., 'col...})]
 raise ValueError("items should be a list-like object.")
 if axis == 0:
 if len(index_scols) == 1:
-col = None
-for item in items:
-if col is None:
-col = index_scols[0] == SF.lit(item)
-else:
-col = col | (index_scols[0] == SF.lit(item))
-elif len(index_scols) > 1:
+if len(items) <= ps.get_option("compute.isin_limit"):
+col = index_scols[0].isin([SF.lit(item) for item in 
items])
+return DataFrame(self._internal.with_filter(col))
+else:
+item_sdf_col = verify_temp_column_name(
+self._internal.spark_frame, "__item__"
+)
+item_sdf = default_session().createDataFrame(
+pd.DataFrame({item_sdf_col: items})
+)
+joined_sdf = self._internal.spark_frame.join(
+other=F.broadcast(item_sdf),
+on=(index_scols[0] == scol_for(item_sdf, 
item_sdf_col)),
+how="semi",
+)
+
+return 
DataFrame(self._internal.with_new_sdf(joined_sdf))
+
+else:
 # for multi-index
 col = None
 for item in items:
@@ -10019,7 +10031,7 @@ defaultdict(, {'col..., 'col...})]
 col = midx_col
 else:
 col = col | midx_col
-return DataFrame(self._internal.with_filter(col))
+return DataFrame(self._internal.with_filter(col))
 else:
 return self[items]
 elif like is not None:
diff --git a/python/pyspark/pandas/tests/test_dataframe.py 
b/python/pyspark/pandas/tests/test_dataframe.py
index 20aecc2..3cfbc03 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -4313,6 +4313,13 @@ class DataFrameTest(PandasOnSparkTestCase, SQLTestUtils):
 psdf.filter(items=["ab", "aa"], axis=0).sort_index(),
 pdf.filter(items=["ab", "aa"], axis=0).sort_index(),
 )
+
+with option_context("compute.isin_limit", 0):
+self.assert_eq(
+psdf.filter(items=["ab", "aa"], axis=0).sort_index(),
+pdf.filter(items=["ab", "aa"], axis=0).sort_index(),
+)
+
 self.assert_eq(
 psdf.filter(items=["ba", "db"], axis=1).sort_index(),
 pdf.filter(items=["ba", "db"], axis=1).sort_index(),

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36814][SQL] Make class ColumnarBatch extendable

2021-09-21 Thread dbtsai
This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 688b95b  [SPARK-36814][SQL] Make class ColumnarBatch extendable
688b95b is described below

commit 688b95b136571fa559f26e6582fc3fc296f9d1bf
Author: Yufei Gu 
AuthorDate: Tue Sep 21 15:24:55 2021 +

[SPARK-36814][SQL] Make class ColumnarBatch extendable

### What changes were proposed in this pull request?
Change class ColumnarBatch to a non-final class

### Why are the changes needed?
To support better vectorized reading in multiple data source, ColumnarBatch 
need to be extendable. For example, To support row-level delete(  
https://github.com/apache/iceberg/issues/3141) in Iceberg's vectorized read, we 
need to filter out deleted rows in a batch, which requires ColumnarBatch to be 
extendable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No test needed.

Closes #34054 from flyrain/columnarbatch-extendable.

Authored-by: Yufei Gu 
Signed-off-by: DB Tsai 
---
 .../src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java 
b/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java
index a2feac8..b5c3ed7 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java
@@ -31,7 +31,7 @@ import org.apache.spark.unsafe.types.UTF8String;
  * the entire data loading process.
  */
 @Evolving
-public final class ColumnarBatch implements AutoCloseable {
+public class ColumnarBatch implements AutoCloseable {
   private int numRows;
   private final ColumnVector[] columns;
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type

2021-09-21 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 7fa88b2  [SPARK-36807][SQL] Merge ANSI interval types to a tightest 
common type
7fa88b2 is described below

commit 7fa88b28a56a5be4fa78dbb690c9c72e8d856b56
Author: Max Gekk 
AuthorDate: Tue Sep 21 10:20:16 2021 +0300

[SPARK-36807][SQL] Merge ANSI interval types to a tightest common type

### What changes were proposed in this pull request?
In the PR, I propose to modify `StructType` to support merging of ANSI 
interval types with different fields.

### Why are the changes needed?
This will allow merging of schemas from different datasource files.

### Does this PR introduce _any_ user-facing change?
No, the ANSI interval types haven't released yet.

### How was this patch tested?
Added new test to `StructTypeSuite`.

Closes #34049 from MaxGekk/merge-ansi-interval-types.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
(cherry picked from commit d2340f8e1c342354e1a67d468b35e86e3496ccf9)
Signed-off-by: Max Gekk 
---
 .../org/apache/spark/sql/types/StructType.scala|  6 ++
 .../apache/spark/sql/types/StructTypeSuite.scala   | 23 ++
 2 files changed, 29 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
index 83ee191..c9862cb 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
@@ -653,6 +653,12 @@ object StructType extends AbstractDataType {
   case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_])
 if leftUdt.userClass == rightUdt.userClass => leftUdt
 
+  case (YearMonthIntervalType(lstart, lend), YearMonthIntervalType(rstart, 
rend)) =>
+YearMonthIntervalType(Math.min(lstart, rstart).toByte, Math.max(lend, 
rend).toByte)
+
+  case (DayTimeIntervalType(lstart, lend), DayTimeIntervalType(rstart, 
rend)) =>
+DayTimeIntervalType(Math.min(lstart, rstart).toByte, Math.max(lend, 
rend).toByte)
+
   case (leftType, rightType) if leftType == rightType =>
 leftType
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala
index 8db3831..8cc04c7 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala
@@ -25,7 +25,9 @@ import org.apache.spark.sql.catalyst.plans.SQLHelper
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{DayTimeIntervalType => DT}
 import org.apache.spark.sql.types.{YearMonthIntervalType => YM}
+import org.apache.spark.sql.types.DayTimeIntervalType._
 import org.apache.spark.sql.types.StructType.fromDDL
+import org.apache.spark.sql.types.YearMonthIntervalType._
 
 class StructTypeSuite extends SparkFunSuite with SQLHelper {
 
@@ -382,4 +384,25 @@ class StructTypeSuite extends SparkFunSuite with SQLHelper 
{
 assert(e.getMessage.contains(
   "Field name a2.element.C.name is invalid: a2.element.c is not a struct"))
   }
+
+  test("SPARK-36807: Merge ANSI interval types to a tightest common type") {
+Seq(
+  (YM(YEAR), YM(YEAR)) -> YM(YEAR),
+  (YM(YEAR), YM(MONTH)) -> YM(YEAR, MONTH),
+  (YM(MONTH), YM(MONTH)) -> YM(MONTH),
+  (YM(YEAR, MONTH), YM(YEAR)) -> YM(YEAR, MONTH),
+  (YM(YEAR, MONTH), YM(YEAR, MONTH)) -> YM(YEAR, MONTH),
+  (DT(DAY), DT(DAY)) -> DT(DAY),
+  (DT(SECOND), DT(SECOND)) -> DT(SECOND),
+  (DT(DAY), DT(SECOND)) -> DT(DAY, SECOND),
+  (DT(HOUR, SECOND), DT(DAY, MINUTE)) -> DT(DAY, SECOND),
+  (DT(HOUR, MINUTE), DT(DAY, SECOND)) -> DT(DAY, SECOND)
+).foreach { case ((i1, i2), expected) =>
+  val st1 = new StructType().add("interval", i1)
+  val st2 = new StructType().add("interval", i2)
+  val expectedStruct = new StructType().add("interval", expected)
+  assert(st1.merge(st2) === expectedStruct)
+  assert(st2.merge(st1) === expectedStruct)
+}
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (cc182fe -> d2340f8)

2021-09-21 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from cc182fe  [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has 
NaN value
 add d2340f8  [SPARK-36807][SQL] Merge ANSI interval types to a tightest 
common type

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/types/StructType.scala|  6 ++
 .../apache/spark/sql/types/StructTypeSuite.scala   | 23 ++
 2 files changed, 29 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org