[spark] branch branch-3.2 updated (affd7a4 -> 4543ac6)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from affd7a4 [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency add 4543ac6 [SPARK-36771][PYTHON][3.2] Fix `pop` of Categorical Series No new revisions were added by this update. Summary of changes: python/pyspark/pandas/series.py| 8 ++-- python/pyspark/pandas/tests/test_series.py | 25 + 2 files changed, 31 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36771][PYTHON] Fix `pop` of Categorical Series
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 079a9c5 [SPARK-36771][PYTHON] Fix `pop` of Categorical Series 079a9c5 is described below commit 079a9c52925818532b57c9cec1ddd31be723885e Author: Xinrong Meng AuthorDate: Tue Sep 21 14:11:21 2021 -0700 [SPARK-36771][PYTHON] Fix `pop` of Categorical Series ### What changes were proposed in this pull request? Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior. ### Why are the changes needed? As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series. ### Does this PR introduce _any_ user-facing change? Yes, results of `pop` of Categorical Series change. From ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0a 1b 2c 3a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 0 >>> psser 1b 2c 3a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 0 >>> psser 1b 2c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` To ```py >>> psser = ps.Series(["a", "b", "c", "a"], dtype="category") >>> psser 0a 1b 2c 3a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(0) 'a' >>> psser 1b 2c 3a dtype: category Categories (3, object): ['a', 'b', 'c'] >>> psser.pop(3) 'a' >>> psser 1b 2c dtype: category Categories (3, object): ['a', 'b', 'c'] ``` ### How was this patch tested? Unit tests. Closes #34052 from xinrong-databricks/cat_pop. Authored-by: Xinrong Meng Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/series.py| 8 ++-- python/pyspark/pandas/tests/test_series.py | 25 + 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index d72c08d..da0d2fb 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -47,7 +47,7 @@ import numpy as np import pandas as pd from pandas.core.accessor import CachedAccessor from pandas.io.formats.printing import pprint_thing -from pandas.api.types import is_list_like, is_hashable +from pandas.api.types import is_list_like, is_hashable, CategoricalDtype from pandas.tseries.frequencies import DateOffset from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame from pyspark.sql.types import ( @@ -4098,7 +4098,11 @@ class Series(Frame, IndexOpsMixin, Generic[T]): pdf = sdf.limit(2).toPandas() length = len(pdf) if length == 1: -return pdf[internal.data_spark_column_names[0]].iloc[0] +val = pdf[internal.data_spark_column_names[0]].iloc[0] +if isinstance(self.dtype, CategoricalDtype): +return self.dtype.categories[val] +else: +return val item_string = name_like_string(item) sdf = sdf.withColumn(SPARK_DEFAULT_INDEX_NAME, SF.lit(str(item_string))) diff --git a/python/pyspark/pandas/tests/test_series.py b/python/pyspark/pandas/tests/test_series.py index 09e5d30..b7bb121 100644 --- a/python/pyspark/pandas/tests/test_series.py +++ b/python/pyspark/pandas/tests/test_series.py @@ -1669,6 +1669,31 @@ class SeriesTest(PandasOnSparkTestCase, SQLTestUtils): with self.assertRaisesRegex(KeyError, msg): psser.pop(("lama", "speed", "x")) +pser = pd.Series(["a", "b", "c", "a"], dtype="category") +psser = ps.from_pandas(pser) + +if LooseVersion(pd.__version__) >= LooseVersion("1.3.0"): +self.assert_eq(psser.pop(0), pser.pop(0)) +self.assert_eq(psser, pser) + +self.assert_eq(psser.pop(3), pser.pop(3)) +self.assert_eq(psser, pser) +else: +# Before pandas 1.3.0, `pop` modifies the dtype of categorical series wrongly. +self.assert_eq(psser.pop(0), "a") +self.assert_eq( +psser, +pd.Series( +pd.Categorical(["b", "c", "a"], categories=["a", "b", "c"]), index=[1, 2, 3] +), +) + +self.assert_eq(psser.pop(3), "a") +self.assert_eq( +psser, +pd.Series(pd.Categorical(["b", "c"], categories=["a", "b", "c"]), index=[1, 2]), +) + def test_replace(self):
[spark] branch master updated: [SPARK-36615][CORE] Register shutdown hook earlier when start SC
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b7d99e3 [SPARK-36615][CORE] Register shutdown hook earlier when start SC b7d99e3 is described below commit b7d99e3eea5f9c0b3d11ec578d6aa0720c256eeb Author: Angerszh AuthorDate: Tue Sep 21 13:23:14 2021 -0500 [SPARK-36615][CORE] Register shutdown hook earlier when start SC ### What changes were proposed in this pull request? Since user always use ctrl+c to stop a starting SC when register with yarn in client mode when resources are tight. In this time, SC have not register the Shutdown hook, this cause we won't invoke `sc.stop()` when exit the application. We should register the ShutdownHook earlier when starting a SparkContext. ### Why are the changes needed? Make sure we will invoke `sc.stop()` when kill a starting SparkContext application. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33869 from AngersZh/SPARK-36615. Authored-by: Angerszh Signed-off-by: Mridul Muralidharan gmail.com> --- .../main/scala/org/apache/spark/SparkContext.scala | 29 +++--- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 3404a0f..e27499a15 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -645,20 +645,6 @@ class SparkContext(config: SparkConf) extends Logging { // Attach the driver metrics servlet handler to the web ui after the metrics system is started. _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler))) -// Post init -_taskScheduler.postStartHook() -if (isLocal) { - _env.metricsSystem.registerSource(Executor.executorSourceLocalModeOnly) -} -_env.metricsSystem.registerSource(_dagScheduler.metricsSource) -_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) -_env.metricsSystem.registerSource(new JVMCPUSource()) -_executorMetricsSource.foreach(_.register(_env.metricsSystem)) -_executorAllocationManager.foreach { e => - _env.metricsSystem.registerSource(e.executorAllocationManagerSource) -} -appStatusSource.foreach(_env.metricsSystem.registerSource(_)) -_plugins.foreach(_.registerMetrics(applicationId)) // Make sure the context is stopped if the user forgets about it. This avoids leaving // unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM // is killed, though. @@ -673,6 +659,21 @@ class SparkContext(config: SparkConf) extends Logging { logWarning("Ignoring Exception while stopping SparkContext from shutdown hook", e) } } + +// Post init +_taskScheduler.postStartHook() +if (isLocal) { + _env.metricsSystem.registerSource(Executor.executorSourceLocalModeOnly) +} +_env.metricsSystem.registerSource(_dagScheduler.metricsSource) +_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) +_env.metricsSystem.registerSource(new JVMCPUSource()) +_executorMetricsSource.foreach(_.register(_env.metricsSystem)) +_executorAllocationManager.foreach { e => + _env.metricsSystem.registerSource(e.executorAllocationManagerSource) +} +appStatusSource.foreach(_env.metricsSystem.registerSource(_)) +_plugins.foreach(_.registerMetrics(applicationId)) } catch { case NonFatal(e) => logError("Error initializing SparkContext.", e) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new affd7a4 [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency affd7a4 is described below commit affd7a4d47576be863c13df2c6e8068fcea3ba7c Author: Gengliang Wang AuthorDate: Tue Sep 21 10:57:20 2021 -0700 [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun (cherry picked from commit ba5708d944c5e38db750ad480668e524672ee963) Signed-off-by: Dongjoon Hyun --- pom.xml | 6 -- project/SparkBuild.scala | 4 +--- sql/core/pom.xml | 6 -- .../spark/sql/execution/datasources/FileSourceCodecSuite.scala | 9 +++-- 4 files changed, 4 insertions(+), 21 deletions(-) diff --git a/pom.xml b/pom.xml index 5adbe8a..4a3bd71 100644 --- a/pom.xml +++ b/pom.xml @@ -300,12 +300,6 @@ false - - jitpack.io - https://jitpack.io - Jitpack.io repository - - diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index b9068cc..b1531a6 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -274,9 +274,7 @@ object SparkBuild extends PomBuild { "gcs-maven-central-mirror" at "https://maven-central.storage-download.googleapis.com/maven2/;, DefaultMavenRepository, Resolver.mavenLocal, - Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns), - // needed for brotli-codec - "jitpack.io" at "https://jitpack.io; + Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) ), externalResolvers := resolvers.value, otherResolvers := SbtPomKeys.mvnLocalRepository(dotM2 => Seq(Resolver.file("dotM2", dotM2))).value, diff --git a/sql/core/pom.xml b/sql/core/pom.xml index 3cdd9a4..e023377 100644 --- a/sql/core/pom.xml +++ b/sql/core/pom.xml @@ -184,12 +184,6 @@ htmlunit-driver test - - com.github.rdblue - brotli-codec - 0.1.1 - test - target/scala-${scala.binary.version}/classes diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala index 3c226d6..ac1fd1c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala @@ -56,13 +56,10 @@ class ParquetCodecSuite extends FileSourceCodecSuite { override def format: String = "parquet" override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key // Exclude "lzo" because it is GPL-licenced so not included in Hadoop. + // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available + // on Maven Central. override protected def availableCodecs: Seq[String] = -if (System.getProperty("os.arch") == "aarch64") { - // Exclude "brotli" due to PARQUET-1975. - Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd") -} else { - Seq("none", "uncompressed", "snappy", "lz4", "gzip", "brotli", "zstd") -} +Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd") } class OrcCodecSuite extends FileSourceCodecSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ba5708d [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ba5708d is described below commit ba5708d944c5e38db750ad480668e524672ee963 Author: Gengliang Wang AuthorDate: Tue Sep 21 10:57:20 2021 -0700 [SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang Signed-off-by: Dongjoon Hyun --- pom.xml | 6 -- project/SparkBuild.scala | 4 +--- sql/core/pom.xml | 6 -- .../spark/sql/execution/datasources/FileSourceCodecSuite.scala | 9 +++-- 4 files changed, 4 insertions(+), 21 deletions(-) diff --git a/pom.xml b/pom.xml index b99f6af..a849b74 100644 --- a/pom.xml +++ b/pom.xml @@ -300,12 +300,6 @@ false - - jitpack.io - https://jitpack.io - Jitpack.io repository - - diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index b9068cc..b1531a6 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -274,9 +274,7 @@ object SparkBuild extends PomBuild { "gcs-maven-central-mirror" at "https://maven-central.storage-download.googleapis.com/maven2/;, DefaultMavenRepository, Resolver.mavenLocal, - Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns), - // needed for brotli-codec - "jitpack.io" at "https://jitpack.io; + Resolver.file("ivyLocal", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns) ), externalResolvers := resolvers.value, otherResolvers := SbtPomKeys.mvnLocalRepository(dotM2 => Seq(Resolver.file("dotM2", dotM2))).value, diff --git a/sql/core/pom.xml b/sql/core/pom.xml index f9c12d1..42a3d5e 100644 --- a/sql/core/pom.xml +++ b/sql/core/pom.xml @@ -184,12 +184,6 @@ htmlunit-driver test - - com.github.rdblue - brotli-codec - 0.1.1 - test - target/scala-${scala.binary.version}/classes diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala index 3c226d6..ac1fd1c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala @@ -56,13 +56,10 @@ class ParquetCodecSuite extends FileSourceCodecSuite { override def format: String = "parquet" override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key // Exclude "lzo" because it is GPL-licenced so not included in Hadoop. + // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available + // on Maven Central. override protected def availableCodecs: Seq[String] = -if (System.getProperty("os.arch") == "aarch64") { - // Exclude "brotli" due to PARQUET-1975. - Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd") -} else { - Seq("none", "uncompressed", "snappy", "lz4", "gzip", "brotli", "zstd") -} +Seq("none", "uncompressed", "snappy", "lz4", "gzip", "zstd") } class OrcCodecSuite extends FileSourceCodecSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36769][PYTHON] Improve `filter` of single-indexed DataFrame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 33e463c [SPARK-36769][PYTHON] Improve `filter` of single-indexed DataFrame 33e463c is described below commit 33e463ccf99d09ad8a743d32104f590e204da93d Author: Xinrong Meng AuthorDate: Tue Sep 21 10:20:15 2021 -0700 [SPARK-36769][PYTHON] Improve `filter` of single-indexed DataFrame ### What changes were proposed in this pull request? Improve `filter` of single-indexed DataFrame by replacing a long Project with Filter or Join. ### Why are the changes needed? When the given `items` have too many elements, a long Project is introduced. We may replace that with `Column.isin` or joining depending on the length of `items` for better performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33998 from xinrong-databricks/impr_filter. Authored-by: Xinrong Meng Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/frame.py| 28 +++ python/pyspark/pandas/tests/test_dataframe.py | 7 +++ 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 09efef2..cba1db1 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -9995,13 +9995,25 @@ defaultdict(, {'col..., 'col...})] raise ValueError("items should be a list-like object.") if axis == 0: if len(index_scols) == 1: -col = None -for item in items: -if col is None: -col = index_scols[0] == SF.lit(item) -else: -col = col | (index_scols[0] == SF.lit(item)) -elif len(index_scols) > 1: +if len(items) <= ps.get_option("compute.isin_limit"): +col = index_scols[0].isin([SF.lit(item) for item in items]) +return DataFrame(self._internal.with_filter(col)) +else: +item_sdf_col = verify_temp_column_name( +self._internal.spark_frame, "__item__" +) +item_sdf = default_session().createDataFrame( +pd.DataFrame({item_sdf_col: items}) +) +joined_sdf = self._internal.spark_frame.join( +other=F.broadcast(item_sdf), +on=(index_scols[0] == scol_for(item_sdf, item_sdf_col)), +how="semi", +) + +return DataFrame(self._internal.with_new_sdf(joined_sdf)) + +else: # for multi-index col = None for item in items: @@ -10019,7 +10031,7 @@ defaultdict(, {'col..., 'col...})] col = midx_col else: col = col | midx_col -return DataFrame(self._internal.with_filter(col)) +return DataFrame(self._internal.with_filter(col)) else: return self[items] elif like is not None: diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py index 20aecc2..3cfbc03 100644 --- a/python/pyspark/pandas/tests/test_dataframe.py +++ b/python/pyspark/pandas/tests/test_dataframe.py @@ -4313,6 +4313,13 @@ class DataFrameTest(PandasOnSparkTestCase, SQLTestUtils): psdf.filter(items=["ab", "aa"], axis=0).sort_index(), pdf.filter(items=["ab", "aa"], axis=0).sort_index(), ) + +with option_context("compute.isin_limit", 0): +self.assert_eq( +psdf.filter(items=["ab", "aa"], axis=0).sort_index(), +pdf.filter(items=["ab", "aa"], axis=0).sort_index(), +) + self.assert_eq( psdf.filter(items=["ba", "db"], axis=1).sort_index(), pdf.filter(items=["ba", "db"], axis=1).sort_index(), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36814][SQL] Make class ColumnarBatch extendable
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 688b95b [SPARK-36814][SQL] Make class ColumnarBatch extendable 688b95b is described below commit 688b95b136571fa559f26e6582fc3fc296f9d1bf Author: Yufei Gu AuthorDate: Tue Sep 21 15:24:55 2021 + [SPARK-36814][SQL] Make class ColumnarBatch extendable ### What changes were proposed in this pull request? Change class ColumnarBatch to a non-final class ### Why are the changes needed? To support better vectorized reading in multiple data source, ColumnarBatch need to be extendable. For example, To support row-level delete( https://github.com/apache/iceberg/issues/3141) in Iceberg's vectorized read, we need to filter out deleted rows in a batch, which requires ColumnarBatch to be extendable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test needed. Closes #34054 from flyrain/columnarbatch-extendable. Authored-by: Yufei Gu Signed-off-by: DB Tsai --- .../src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java b/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java index a2feac8..b5c3ed7 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java @@ -31,7 +31,7 @@ import org.apache.spark.unsafe.types.UTF8String; * the entire data loading process. */ @Evolving -public final class ColumnarBatch implements AutoCloseable { +public class ColumnarBatch implements AutoCloseable { private int numRows; private final ColumnVector[] columns; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 7fa88b2 [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type 7fa88b2 is described below commit 7fa88b28a56a5be4fa78dbb690c9c72e8d856b56 Author: Max Gekk AuthorDate: Tue Sep 21 10:20:16 2021 +0300 [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type ### What changes were proposed in this pull request? In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields. ### Why are the changes needed? This will allow merging of schemas from different datasource files. ### Does this PR introduce _any_ user-facing change? No, the ANSI interval types haven't released yet. ### How was this patch tested? Added new test to `StructTypeSuite`. Closes #34049 from MaxGekk/merge-ansi-interval-types. Authored-by: Max Gekk Signed-off-by: Max Gekk (cherry picked from commit d2340f8e1c342354e1a67d468b35e86e3496ccf9) Signed-off-by: Max Gekk --- .../org/apache/spark/sql/types/StructType.scala| 6 ++ .../apache/spark/sql/types/StructTypeSuite.scala | 23 ++ 2 files changed, 29 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala index 83ee191..c9862cb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala @@ -653,6 +653,12 @@ object StructType extends AbstractDataType { case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_]) if leftUdt.userClass == rightUdt.userClass => leftUdt + case (YearMonthIntervalType(lstart, lend), YearMonthIntervalType(rstart, rend)) => +YearMonthIntervalType(Math.min(lstart, rstart).toByte, Math.max(lend, rend).toByte) + + case (DayTimeIntervalType(lstart, lend), DayTimeIntervalType(rstart, rend)) => +DayTimeIntervalType(Math.min(lstart, rstart).toByte, Math.max(lend, rend).toByte) + case (leftType, rightType) if leftType == rightType => leftType diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala index 8db3831..8cc04c7 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala @@ -25,7 +25,9 @@ import org.apache.spark.sql.catalyst.plans.SQLHelper import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.{DayTimeIntervalType => DT} import org.apache.spark.sql.types.{YearMonthIntervalType => YM} +import org.apache.spark.sql.types.DayTimeIntervalType._ import org.apache.spark.sql.types.StructType.fromDDL +import org.apache.spark.sql.types.YearMonthIntervalType._ class StructTypeSuite extends SparkFunSuite with SQLHelper { @@ -382,4 +384,25 @@ class StructTypeSuite extends SparkFunSuite with SQLHelper { assert(e.getMessage.contains( "Field name a2.element.C.name is invalid: a2.element.c is not a struct")) } + + test("SPARK-36807: Merge ANSI interval types to a tightest common type") { +Seq( + (YM(YEAR), YM(YEAR)) -> YM(YEAR), + (YM(YEAR), YM(MONTH)) -> YM(YEAR, MONTH), + (YM(MONTH), YM(MONTH)) -> YM(MONTH), + (YM(YEAR, MONTH), YM(YEAR)) -> YM(YEAR, MONTH), + (YM(YEAR, MONTH), YM(YEAR, MONTH)) -> YM(YEAR, MONTH), + (DT(DAY), DT(DAY)) -> DT(DAY), + (DT(SECOND), DT(SECOND)) -> DT(SECOND), + (DT(DAY), DT(SECOND)) -> DT(DAY, SECOND), + (DT(HOUR, SECOND), DT(DAY, MINUTE)) -> DT(DAY, SECOND), + (DT(HOUR, MINUTE), DT(DAY, SECOND)) -> DT(DAY, SECOND) +).foreach { case ((i1, i2), expected) => + val st1 = new StructType().add("interval", i1) + val st2 = new StructType().add("interval", i2) + val expectedStruct = new StructType().add("interval", expected) + assert(st1.merge(st2) === expectedStruct) + assert(st2.merge(st1) === expectedStruct) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (cc182fe -> d2340f8)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from cc182fe [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value add d2340f8 [SPARK-36807][SQL] Merge ANSI interval types to a tightest common type No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/types/StructType.scala| 6 ++ .../apache/spark/sql/types/StructTypeSuite.scala | 23 ++ 2 files changed, 29 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org