[jira] [Assigned] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204
[ https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42309: Assignee: Apache Spark > Assign name to _LEGACY_ERROR_TEMP_1204 > -- > > Key: SPARK-42309 > URL: https://issues.apache.org/jira/browse/SPARK-42309 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204
[ https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42309: Assignee: (was: Apache Spark) > Assign name to _LEGACY_ERROR_TEMP_1204 > -- > > Key: SPARK-42309 > URL: https://issues.apache.org/jira/browse/SPARK-42309 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204
[ https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685723#comment-17685723 ] Apache Spark commented on SPARK-42309: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39937 > Assign name to _LEGACY_ERROR_TEMP_1204 > -- > > Key: SPARK-42309 > URL: https://issues.apache.org/jira/browse/SPARK-42309 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42267) Support left_outer join
[ https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42267: Assignee: (was: Apache Spark) > Support left_outer join > --- > > Key: SPARK-42267 > URL: https://issues.apache.org/jira/browse/SPARK-42267 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > ``` > >>> df = spark.range(1) > >>> df2 = spark.range(2) > >>> df.join(df2, how="left_outer") > Traceback (most recent call last): > File "", line 1, in > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", > line 438, in join > plan.Join(left=self._plan, right=other._plan, on=on, how=how), > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line > 730, in __init__ > raise NotImplementedError( > NotImplementedError: > Unsupported join type: left_outer. Supported join types > include: > "inner", "outer", "full", "fullouter", "full_outer", > "leftouter", "left", "left_outer", "rightouter", > "right", "right_outer", "leftsemi", "left_semi", > "semi", "leftanti", "left_anti", "anti", "cross", > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42267) Support left_outer join
[ https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685722#comment-17685722 ] Apache Spark commented on SPARK-42267: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39938 > Support left_outer join > --- > > Key: SPARK-42267 > URL: https://issues.apache.org/jira/browse/SPARK-42267 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > ``` > >>> df = spark.range(1) > >>> df2 = spark.range(2) > >>> df.join(df2, how="left_outer") > Traceback (most recent call last): > File "", line 1, in > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", > line 438, in join > plan.Join(left=self._plan, right=other._plan, on=on, how=how), > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line > 730, in __init__ > raise NotImplementedError( > NotImplementedError: > Unsupported join type: left_outer. Supported join types > include: > "inner", "outer", "full", "fullouter", "full_outer", > "leftouter", "left", "left_outer", "rightouter", > "right", "right_outer", "leftsemi", "left_semi", > "semi", "leftanti", "left_anti", "anti", "cross", > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42267) Support left_outer join
[ https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42267: Assignee: Apache Spark > Support left_outer join > --- > > Key: SPARK-42267 > URL: https://issues.apache.org/jira/browse/SPARK-42267 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > ``` > >>> df = spark.range(1) > >>> df2 = spark.range(2) > >>> df.join(df2, how="left_outer") > Traceback (most recent call last): > File "", line 1, in > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", > line 438, in join > plan.Join(left=self._plan, right=other._plan, on=on, how=how), > File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line > 730, in __init__ > raise NotImplementedError( > NotImplementedError: > Unsupported join type: left_outer. Supported join types > include: > "inner", "outer", "full", "fullouter", "full_outer", > "leftouter", "left", "left_outer", "rightouter", > "right", "right_outer", "leftsemi", "left_semi", > "semi", "leftanti", "left_anti", "anti", "cross", > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42024) createDataFrame should corse types of string float to float
[ https://issues.apache.org/jira/browse/SPARK-42024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42024: - Assignee: Ruifeng Zheng > createDataFrame should corse types of string float to float > --- > > Key: SPARK-42024 > URL: https://issues.apache.org/jira/browse/SPARK-42024 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > > {code} > pyspark/sql/tests/test_types.py:245 > (TypesParityTests.test_infer_schema_upcast_float_to_string) > self = testMethod=test_infer_schema_upcast_float_to_string> > def test_infer_schema_upcast_float_to_string(self): > > df = self.spark.createDataFrame([[1.33, 1], ["2.1", 1]], schema=["a", > > "b"]) > ../test_types.py:247: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > ../../connect/session.py:282: in createDataFrame > _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in > _data]) > pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist > ??? > pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist > ??? > pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays > ??? > pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays > ??? > pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays > ??? > pyarrow/array.pxi:320: in pyarrow.lib.array > ??? > pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array > ??? > pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert '2.1' with type str: tried to > convert to double > pyarrow/error.pxi:100: ArrowInvalid > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42024) createDataFrame should corse types of string float to float
[ https://issues.apache.org/jira/browse/SPARK-42024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42024. --- Resolution: Resolved > createDataFrame should corse types of string float to float > --- > > Key: SPARK-42024 > URL: https://issues.apache.org/jira/browse/SPARK-42024 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > pyspark/sql/tests/test_types.py:245 > (TypesParityTests.test_infer_schema_upcast_float_to_string) > self = testMethod=test_infer_schema_upcast_float_to_string> > def test_infer_schema_upcast_float_to_string(self): > > df = self.spark.createDataFrame([[1.33, 1], ["2.1", 1]], schema=["a", > > "b"]) > ../test_types.py:247: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > ../../connect/session.py:282: in createDataFrame > _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in > _data]) > pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist > ??? > pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist > ??? > pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays > ??? > pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays > ??? > pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays > ??? > pyarrow/array.pxi:320: in pyarrow.lib.array > ??? > pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array > ??? > pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert '2.1' with type str: tried to > convert to double > pyarrow/error.pxi:100: ArrowInvalid > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42380) Upgrade maven to 3.9.0
Yang Jie created SPARK-42380: Summary: Upgrade maven to 3.9.0 Key: SPARK-42380 URL: https://issues.apache.org/jira/browse/SPARK-42380 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie {code:java} [ERROR] An error occurred attempting to read POM org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen
[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`
[ https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42378: - Assignee: Ruifeng Zheng > Make `DataFrame.select` support `a.*` > - > > Key: SPARK-42378 > URL: https://issues.apache.org/jira/browse/SPARK-42378 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42378) Make `DataFrame.select` support `a.*`
[ https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42378. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39934 [https://github.com/apache/spark/pull/39934] > Make `DataFrame.select` support `a.*` > - > > Key: SPARK-42378 > URL: https://issues.apache.org/jira/browse/SPARK-42378 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
[ https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-42372. -- Fix Version/s: 3.4.0 Assignee: Kent Yao Resolution: Fixed issue resolved by https://github.com/apache/spark/pull/39929 > Improve performance of HiveGenericUDTF by making inputProjection instantiate > once > - > > Key: SPARK-42372 > URL: https://issues.apache.org/jira/browse/SPARK-42372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.0 > > > {code:java} > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 1574 1680 > 118 0.7 1501.1 1.0X > +Hive UDTF dup 4 2642 3076 > 588 0.4 2519.9 0.6X > + > diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > new file mode 100644 > index 00..8af8b6582c > --- /dev/null > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 712 789 > 101 1.5 678.7 1.0X > +Hive UDTF dup 4 1212 1294 > 78 0.9 1156.0 0.6X > + {code} > over 2x performance gain via a benchmarking -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-40045: -- Assignee: caican > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Assignee: caican >Priority: Major > Fix For: 3.4.0 > > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = > 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42315) Assign name to _LEGACY_ERROR_TEMP_2092
[ https://issues.apache.org/jira/browse/SPARK-42315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42315: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_2092 > -- > > Key: SPARK-42315 > URL: https://issues.apache.org/jira/browse/SPARK-42315 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42315) Assign name to _LEGACY_ERROR_TEMP_2092
[ https://issues.apache.org/jira/browse/SPARK-42315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42315. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39889 [https://github.com/apache/spark/pull/39889] > Assign name to _LEGACY_ERROR_TEMP_2092 > -- > > Key: SPARK-42315 > URL: https://issues.apache.org/jira/browse/SPARK-42315 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42358) Provide more details in ExecutorUpdated sent in Master.removeWorker
[ https://issues.apache.org/jira/browse/SPARK-42358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42358. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39903 [https://github.com/apache/spark/pull/39903] > Provide more details in ExecutorUpdated sent in Master.removeWorker > --- > > Key: SPARK-42358 > URL: https://issues.apache.org/jira/browse/SPARK-42358 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 3.5.0 > > > Currently field `message` in `ExecutorUpdated` sent in Master.removeWorker is > always `Some("worker lost")`. We should provide more information in the > message instead to better differentiate the cause of the worker removal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42358) Provide more details in ExecutorUpdated sent in Master.removeWorker
[ https://issues.apache.org/jira/browse/SPARK-42358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42358: - Assignee: Bo Zhang > Provide more details in ExecutorUpdated sent in Master.removeWorker > --- > > Key: SPARK-42358 > URL: https://issues.apache.org/jira/browse/SPARK-42358 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > > Currently field `message` in `ExecutorUpdated` sent in Master.removeWorker is > always `Some("worker lost")`. We should provide more information in the > message instead to better differentiate the cause of the worker removal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40045. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39892 [https://github.com/apache/spark/pull/39892] > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > Fix For: 3.4.0 > > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = > 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
[ https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685681#comment-17685681 ] Apache Spark commented on SPARK-42379: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/39936 > Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists > > > Key: SPARK-42379 > URL: https://issues.apache.org/jira/browse/SPARK-42379 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > Other methods in FileSystemBasedCheckpointFileManager already uses > FileSystem.exists for all cases checking existence of the path. Use > FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is > consistent with other methods in FileSystemBasedCheckpointFileManager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
[ https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42379: Assignee: (was: Apache Spark) > Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists > > > Key: SPARK-42379 > URL: https://issues.apache.org/jira/browse/SPARK-42379 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > Other methods in FileSystemBasedCheckpointFileManager already uses > FileSystem.exists for all cases checking existence of the path. Use > FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is > consistent with other methods in FileSystemBasedCheckpointFileManager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
[ https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42379: Assignee: Apache Spark > Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists > > > Key: SPARK-42379 > URL: https://issues.apache.org/jira/browse/SPARK-42379 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Other methods in FileSystemBasedCheckpointFileManager already uses > FileSystem.exists for all cases checking existence of the path. Use > FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is > consistent with other methods in FileSystemBasedCheckpointFileManager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
[ https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685680#comment-17685680 ] Apache Spark commented on SPARK-42379: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/39936 > Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists > > > Key: SPARK-42379 > URL: https://issues.apache.org/jira/browse/SPARK-42379 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > Other methods in FileSystemBasedCheckpointFileManager already uses > FileSystem.exists for all cases checking existence of the path. Use > FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is > consistent with other methods in FileSystemBasedCheckpointFileManager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
Jungtaek Lim created SPARK-42379: Summary: Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists Key: SPARK-42379 URL: https://issues.apache.org/jira/browse/SPARK-42379 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim Other methods in FileSystemBasedCheckpointFileManager already uses FileSystem.exists for all cases checking existence of the path. Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is consistent with other methods in FileSystemBasedCheckpointFileManager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685674#comment-17685674 ] Chao Sun commented on SPARK-33807: -- This is actually already resolved as part of SPARK-37377. > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Major > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-33807: Assignee: (was: Chao Sun) > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Major > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-33807: Assignee: Chao Sun > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Assignee: Chao Sun >Priority: Major > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685671#comment-17685671 ] Dongjoon Hyun commented on SPARK-41053: --- Hi, [~Gengliang.Wang]. Shall we resolve this issue? > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Labels: releasenotes > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
[ https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35563: -- Priority: Major (was: Blocker) > [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows > -- > > Key: SPARK-35563 > URL: https://issues.apache.org/jira/browse/SPARK-35563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: Robert Joseph Evans >Priority: Major > Labels: data-loss > > I think this impacts a lot more versions of Spark, but I don't know for sure > because it takes a long time to test. As a part of doing corner case > validation testing for spark rapids I found that if a window function has > more than {{Int.MaxValue + 1}} rows the result is silently truncated to that > many rows. I have only tested this on 3.0.2 with {{row_number}}, but I > suspect it will impact others as well. This is a really rare corner case, but > because it is silent data corruption I personally think it is quite serious. > {code:scala} > import org.apache.spark.sql.expressions.Window > val windowSpec = Window.partitionBy("a").orderBy("b") > val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as > b") > spark.time(df.select(col("a"), col("b"), > row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), > desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20)) > +-+--+ > > | dir| count| > +-+--+ > |false|2147483647| > | true| 1| > +-+--+ > Time taken: 1139089 ms > Int.MaxValue.toLong + 100 > res15: Long = 2147483747 > 2147483647L + 1 > res16: Long = 2147483648 > {code} > I had to make sure that I ran the above with at least 64GiB of heap for the > executor (I did it in local mode and it worked, but took forever to run) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685668#comment-17685668 ] Dongjoon Hyun commented on SPARK-33807: --- According to the discussion, I lowered the `Priority` from `Blocker` to `Major`. > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33807: -- Priority: Major (was: Blocker) > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Major > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42210) Standardize registered pickled Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685667#comment-17685667 ] Apache Spark commented on SPARK-42210: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/39860 > Standardize registered pickled Python UDFs > -- > > Key: SPARK-42210 > URL: https://issues.apache.org/jira/browse/SPARK-42210 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement spark.udf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42210) Standardize registered pickled Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42210: Assignee: Apache Spark > Standardize registered pickled Python UDFs > -- > > Key: SPARK-42210 > URL: https://issues.apache.org/jira/browse/SPARK-42210 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement spark.udf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42210) Standardize registered pickled Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42210: Assignee: (was: Apache Spark) > Standardize registered pickled Python UDFs > -- > > Key: SPARK-42210 > URL: https://issues.apache.org/jira/browse/SPARK-42210 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement spark.udf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42210) Standardize registered pickled Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685666#comment-17685666 ] Apache Spark commented on SPARK-42210: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/39860 > Standardize registered pickled Python UDFs > -- > > Key: SPARK-42210 > URL: https://issues.apache.org/jira/browse/SPARK-42210 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement spark.udf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42244) Refine error message by using Python types.
[ https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685655#comment-17685655 ] Apache Spark commented on SPARK-42244: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39935 > Refine error message by using Python types. > --- > > Key: SPARK-42244 > URL: https://issues.apache.org/jira/browse/SPARK-42244 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, the type name in error message is mixed like `string` and `str`. > We might need to consolidate them into one rule. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42371. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39928 [https://github.com/apache/spark/pull/39928] > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42371: Assignee: Hyukjin Kwon > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40819: - Fix Version/s: 3.2.4 3.3.2 > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.3.2, 3.4.0 >Reporter: Alfred Davidson >Assignee: Alfred Davidson >Priority: Critical > Labels: regression > Fix For: 3.2.4, 3.3.2, 3.4.0 > > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42244) Refine error message by using Python types.
[ https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42244. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39815 [https://github.com/apache/spark/pull/39815] > Refine error message by using Python types. > --- > > Key: SPARK-42244 > URL: https://issues.apache.org/jira/browse/SPARK-42244 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, the type name in error message is mixed like `string` and `str`. > We might need to consolidate them into one rule. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42244) Refine error message by using Python types.
[ https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42244: Assignee: Haejoon Lee > Refine error message by using Python types. > --- > > Key: SPARK-42244 > URL: https://issues.apache.org/jira/browse/SPARK-42244 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Currently, the type name in error message is mixed like `string` and `str`. > We might need to consolidate them into one rule. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42301) Assign name to _LEGACY_ERROR_TEMP_1129
[ https://issues.apache.org/jira/browse/SPARK-42301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42301. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39871 [https://github.com/apache/spark/pull/39871] > Assign name to _LEGACY_ERROR_TEMP_1129 > -- > > Key: SPARK-42301 > URL: https://issues.apache.org/jira/browse/SPARK-42301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42301) Assign name to _LEGACY_ERROR_TEMP_1129
[ https://issues.apache.org/jira/browse/SPARK-42301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42301: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_1129 > -- > > Key: SPARK-42301 > URL: https://issues.apache.org/jira/browse/SPARK-42301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42254) Assign name to _LEGACY_ERROR_TEMP_1117
[ https://issues.apache.org/jira/browse/SPARK-42254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42254: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_1117 > -- > > Key: SPARK-42254 > URL: https://issues.apache.org/jira/browse/SPARK-42254 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42254) Assign name to _LEGACY_ERROR_TEMP_1117
[ https://issues.apache.org/jira/browse/SPARK-42254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42254. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39837 [https://github.com/apache/spark/pull/39837] > Assign name to _LEGACY_ERROR_TEMP_1117 > -- > > Key: SPARK-42254 > URL: https://issues.apache.org/jira/browse/SPARK-42254 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42249) Refining html strings in error messages
[ https://issues.apache.org/jira/browse/SPARK-42249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42249. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39820 [https://github.com/apache/spark/pull/39820] > Refining html strings in error messages > --- > > Key: SPARK-42249 > URL: https://issues.apache.org/jira/browse/SPARK-42249 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Using relative path for html string -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42249) Refining html strings in error messages
[ https://issues.apache.org/jira/browse/SPARK-42249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42249: Assignee: Haejoon Lee > Refining html strings in error messages > --- > > Key: SPARK-42249 > URL: https://issues.apache.org/jira/browse/SPARK-42249 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Using relative path for html string -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42378) Make `DataFrame.select` support `a.*`
[ https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685633#comment-17685633 ] Apache Spark commented on SPARK-42378: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39934 > Make `DataFrame.select` support `a.*` > - > > Key: SPARK-42378 > URL: https://issues.apache.org/jira/browse/SPARK-42378 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`
[ https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42378: Assignee: Apache Spark > Make `DataFrame.select` support `a.*` > - > > Key: SPARK-42378 > URL: https://issues.apache.org/jira/browse/SPARK-42378 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`
[ https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42378: Assignee: (was: Apache Spark) > Make `DataFrame.select` support `a.*` > - > > Key: SPARK-42378 > URL: https://issues.apache.org/jira/browse/SPARK-42378 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42378) Make `DataFrame.select` support `a.*`
Ruifeng Zheng created SPARK-42378: - Summary: Make `DataFrame.select` support `a.*` Key: SPARK-42378 URL: https://issues.apache.org/jira/browse/SPARK-42378 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685623#comment-17685623 ] Herman van Hövell commented on SPARK-39375: --- [~xkrogen] Regarding the external classes. It is early days. We will submit a patch in the next couple of days that will allow REPL generated code. A step after would be jars and probably other artifacts. > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42352) Upgrade maven to 3.8.7
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42352. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39896 [https://github.com/apache/spark/pull/39896] > Upgrade maven to 3.8.7 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > [https://maven.apache.org/docs/3.8.7/release-notes.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42352) Upgrade maven to 3.8.7
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42352: Assignee: Yang Jie > Upgrade maven to 3.8.7 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > [https://maven.apache.org/docs/3.8.7/release-notes.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42094) Support `fill_value` for `ps.Series.add`
[ https://issues.apache.org/jira/browse/SPARK-42094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42094: Assignee: Haejoon Lee > Support `fill_value` for `ps.Series.add` > > > Key: SPARK-42094 > URL: https://issues.apache.org/jira/browse/SPARK-42094 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > For pandas function parity: > https://pandas.pydata.org/docs/reference/api/pandas.Series.add.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42094) Support `fill_value` for `ps.Series.add`
[ https://issues.apache.org/jira/browse/SPARK-42094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42094. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39790 [https://github.com/apache/spark/pull/39790] > Support `fill_value` for `ps.Series.add` > > > Key: SPARK-42094 > URL: https://issues.apache.org/jira/browse/SPARK-42094 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > For pandas function parity: > https://pandas.pydata.org/docs/reference/api/pandas.Series.add.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685619#comment-17685619 ] Herman van Hövell commented on SPARK-39375: --- [~xkrogen] the current work on UDFs is somewhat orthogonal to the way we execute UDFs. The current work uses the existing backend for execution. We can change the way we execute the UDFs later on, it would involve a small change to how we plan the UDF on the server side. I do think running the udfs in a separate process has merit (better isolation, lower blast radius, etc...). However it will have profound impact on performance since UDF execution will break the execution pipeline in pieces, it requires starting cold(ish) java processes, etc... > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685618#comment-17685618 ] Hyukjin Kwon commented on SPARK-39375: -- cc [~zhenli] [~hvanhovell] [~grundprinzip-db] ^ FYI > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42377) Test Framework for Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685613#comment-17685613 ] Apache Spark commented on SPARK-42377: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/39933 > Test Framework for Connect Scala Client > --- > > Key: SPARK-42377 > URL: https://issues.apache.org/jira/browse/SPARK-42377 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42377) Test Framework for Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42377: Assignee: (was: Apache Spark) > Test Framework for Connect Scala Client > --- > > Key: SPARK-42377 > URL: https://issues.apache.org/jira/browse/SPARK-42377 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42377) Test Framework for Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685612#comment-17685612 ] Apache Spark commented on SPARK-42377: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/39933 > Test Framework for Connect Scala Client > --- > > Key: SPARK-42377 > URL: https://issues.apache.org/jira/browse/SPARK-42377 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42377) Test Framework for Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42377: Assignee: Apache Spark > Test Framework for Connect Scala Client > --- > > Key: SPARK-42377 > URL: https://issues.apache.org/jira/browse/SPARK-42377 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42377) Test Framework for Connect Scala Client
Herman van Hövell created SPARK-42377: - Summary: Test Framework for Connect Scala Client Key: SPARK-42377 URL: https://issues.apache.org/jira/browse/SPARK-42377 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685600#comment-17685600 ] Erik Krogen commented on SPARK-39375: - UDFs are a complex space, e.g. for Scala the current impl completed in SPARK-42283 cannot handle externally defined classes, which are a common requirement in UDFs. It's also a notable design decision that we are choosing to process UDFs in the Spark Connect server session, vs. a sidecar process like a UDF server that can provide isolation between different UDFs (e.g. as [supported by Presto|https://github.com/prestodb/presto/issues/14053] and [leveraged heavily by Meta|https://www.databricks.com/session_na21/portable-udfs-write-once-run-anywhere]). It would be nice to see more discussion on the merits of various approaches to UDFs in the Spark Connect framework and a clear plan, rather than pushing forward with them piecemeal. It's of course reasonable that UDFs were left out of scope for the original SPIP, but based on that omission I was expecting we would have a subsequent discussion on UDFs for Spark Connect before starting implementation for them. > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any >
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685466#comment-17685466 ] Ritika Maheshwari commented on SPARK-42346: --- Hello added three rows to input_table. Still no error. I do have DPP enabled. * Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 12.0.2) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("a","b"),("c","d"),("e","f")).toDF("surname","first_name") *df*: *org.apache.spark.sql.DataFrame* = [surname: string, first_name: string] scala> df.createOrReplaceTempView("input_table") scala> spark.sql("select(Select Count(Distinct first_name) from input_table) As distinct_value_count from input_table Union all select (select count(Distinct surname) from input_table) as distinct_value_count from input_table").show() ++ |distinct_value_count| ++ | 3| | 3| | 3| | 3| | 3| | 3| ++ ** AdaptiveSparkPlan isFinalPlan=false +- Union :- Project [cast(Subquery subquery#145, [id=#571] as string) AS distinct_value_count#161] : : +- Subquery subquery#145, [id=#571] : : +- AdaptiveSparkPlan isFinalPlan=false : : +- HashAggregate(keys=[], functions=[count(distinct first_name#8)], output=[count(DISTINCT first_name)#152L]) : : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#569] : : +- HashAggregate(keys=[], functions=[partial_count(distinct first_name#8)], output=[count#167L]) : : +- HashAggregate(keys=[first_name#8], functions=[], output=[first_name#8]) : : +- Exchange hashpartitioning(first_name#8, 200), ENSURE_REQUIREMENTS, [id=#565] : : +- HashAggregate(keys=[first_name#8], functions=[], output=[first_name#8]) : : +- LocalTableScan [first_name#8] : +- LocalTableScan [_1#2, _2#3] +- Project [cast(Subquery subquery#147, [id=#590] as string) AS distinct_value_count#163] : +- Subquery subquery#147, [id=#590] : +- AdaptiveSparkPlan isFinalPlan=false : +- HashAggregate(keys=[], functions=[count(distinct surname#7)], output=[count(DISTINCT surname)#154L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#588] : +- HashAggregate(keys=[], functions=[partial_count(distinct surname#7)], output=[count#170L]) : +- HashAggregate(keys=[surname#7], functions=[], output=[surname#7]) : +- Exchange hashpartitioning(surname#7, 200), ENSURE_REQUIREMENTS, [id=#584] : +- HashAggregate(keys=[surname#7], functions=[], output=[surname#7]) : +- LocalTableScan [surname#7] +- LocalTableScan [_1#149, _2#150] > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384 ] Gera Shegalov commented on SPARK-41793: --- Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+
[ https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42369: -- Issue Type: Improvement (was: Bug) > Fix constructor for java.nio.DirectByteBuffer for Java 21+ > -- > > Key: SPARK-42369 > URL: https://issues.apache.org/jira/browse/SPARK-42369 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.5.0 >Reporter: Ludovic Henry >Assignee: Ludovic Henry >Priority: Major > Fix For: 3.5.0 > > > In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was > replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support > both by probing for the legacy one first and falling back to the newer one > second. > This change is completely transparent for the end-user, and makes sure Spark > works transparently on the latest JDK as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+
[ https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42369: - Assignee: Ludovic Henry > Fix constructor for java.nio.DirectByteBuffer for Java 21+ > -- > > Key: SPARK-42369 > URL: https://issues.apache.org/jira/browse/SPARK-42369 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.5.0 >Reporter: Ludovic Henry >Assignee: Ludovic Henry >Priority: Major > > In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was > replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support > both by probing for the legacy one first and falling back to the newer one > second. > This change is completely transparent for the end-user, and makes sure Spark > works transparently on the latest JDK as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+
[ https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42369. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39909 [https://github.com/apache/spark/pull/39909] > Fix constructor for java.nio.DirectByteBuffer for Java 21+ > -- > > Key: SPARK-42369 > URL: https://issues.apache.org/jira/browse/SPARK-42369 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.5.0 >Reporter: Ludovic Henry >Assignee: Ludovic Henry >Priority: Major > Fix For: 3.5.0 > > > In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was > replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support > both by probing for the legacy one first and falling back to the newer one > second. > This change is completely transparent for the end-user, and makes sure Spark > works transparently on the latest JDK as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42376) Introduce watermark propagation among operators
[ https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42376: Assignee: Apache Spark > Introduce watermark propagation among operators > --- > > Key: SPARK-42376 > URL: https://issues.apache.org/jira/browse/SPARK-42376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > With introduction of SPARK-40925, we enabled workloads containing multiple > stateful operators in a single streaming query. > The JIRA ticket clearly described out-of-scope, "Here we propose fixing the > late record filtering in stateful operators to allow chaining of stateful > operators {*}which do not produce delayed records (like time-interval join or > potentially flatMapGroupsWithState){*}". > We identified production use case for stream-stream time-interval join > followed by stateful operator (e.g. window aggregation), and propose to > address such use case via this ticket. > The design will be described in the PR, but the sketched idea is introducing > simulation of watermark propagation among operators. As of now, Spark > considers all stateful operators to have same input watermark and output > watermark, which introduced the limitation. With this ticket, we construct > the logic to simulate watermark propagation so that each operator can have > its own (input watermark, output watermark). Operators introducing delayed > records will produce delayed output watermark, and downstream operator can > take the delay into account as input watermark will be adjusted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42376) Introduce watermark propagation among operators
[ https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685350#comment-17685350 ] Apache Spark commented on SPARK-42376: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/39931 > Introduce watermark propagation among operators > --- > > Key: SPARK-42376 > URL: https://issues.apache.org/jira/browse/SPARK-42376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > With introduction of SPARK-40925, we enabled workloads containing multiple > stateful operators in a single streaming query. > The JIRA ticket clearly described out-of-scope, "Here we propose fixing the > late record filtering in stateful operators to allow chaining of stateful > operators {*}which do not produce delayed records (like time-interval join or > potentially flatMapGroupsWithState){*}". > We identified production use case for stream-stream time-interval join > followed by stateful operator (e.g. window aggregation), and propose to > address such use case via this ticket. > The design will be described in the PR, but the sketched idea is introducing > simulation of watermark propagation among operators. As of now, Spark > considers all stateful operators to have same input watermark and output > watermark, which introduced the limitation. With this ticket, we construct > the logic to simulate watermark propagation so that each operator can have > its own (input watermark, output watermark). Operators introducing delayed > records will produce delayed output watermark, and downstream operator can > take the delay into account as input watermark will be adjusted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42376) Introduce watermark propagation among operators
[ https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42376: Assignee: (was: Apache Spark) > Introduce watermark propagation among operators > --- > > Key: SPARK-42376 > URL: https://issues.apache.org/jira/browse/SPARK-42376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > With introduction of SPARK-40925, we enabled workloads containing multiple > stateful operators in a single streaming query. > The JIRA ticket clearly described out-of-scope, "Here we propose fixing the > late record filtering in stateful operators to allow chaining of stateful > operators {*}which do not produce delayed records (like time-interval join or > potentially flatMapGroupsWithState){*}". > We identified production use case for stream-stream time-interval join > followed by stateful operator (e.g. window aggregation), and propose to > address such use case via this ticket. > The design will be described in the PR, but the sketched idea is introducing > simulation of watermark propagation among operators. As of now, Spark > considers all stateful operators to have same input watermark and output > watermark, which introduced the limitation. With this ticket, we construct > the logic to simulate watermark propagation so that each operator can have > its own (input watermark, output watermark). Operators introducing delayed > records will produce delayed output watermark, and downstream operator can > take the delay into account as input watermark will be adjusted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685328#comment-17685328 ] Apache Spark commented on SPARK-37099: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39930 > Introduce a rank-based filter to optimize top-k computation > --- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Attachments: q67.png, q67_optimized.png, skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. so we can safely discard rows with partitial rank > k, > anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42376) Introduce watermark propagation among operators
[ https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685315#comment-17685315 ] Jungtaek Lim commented on SPARK-42376: -- Will submit a PR sooner. > Introduce watermark propagation among operators > --- > > Key: SPARK-42376 > URL: https://issues.apache.org/jira/browse/SPARK-42376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > With introduction of SPARK-40925, we enabled workloads containing multiple > stateful operators in a single streaming query. > The JIRA ticket clearly described out-of-scope, "Here we propose fixing the > late record filtering in stateful operators to allow chaining of stateful > operators {*}which do not produce delayed records (like time-interval join or > potentially flatMapGroupsWithState){*}". > We identified production use case for stream-stream time-interval join > followed by stateful operator (e.g. window aggregation), and propose to > address such use case via this ticket. > The design will be described in the PR, but the sketched idea is introducing > simulation of watermark propagation among operators. As of now, Spark > considers all stateful operators to have same input watermark and output > watermark, which introduced the limitation. With this ticket, we construct > the logic to simulate watermark propagation so that each operator can have > its own (input watermark, output watermark). Operators introducing delayed > records will produce delayed output watermark, and downstream operator can > take the delay into account as input watermark will be adjusted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42376) Introduce watermark propagation among operators
Jungtaek Lim created SPARK-42376: Summary: Introduce watermark propagation among operators Key: SPARK-42376 URL: https://issues.apache.org/jira/browse/SPARK-42376 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim With introduction of SPARK-40925, we enabled workloads containing multiple stateful operators in a single streaming query. The JIRA ticket clearly described out-of-scope, "Here we propose fixing the late record filtering in stateful operators to allow chaining of stateful operators {*}which do not produce delayed records (like time-interval join or potentially flatMapGroupsWithState){*}". We identified production use case for stream-stream time-interval join followed by stateful operator (e.g. window aggregation), and propose to address such use case via this ticket. The design will be described in the PR, but the sketched idea is introducing simulation of watermark propagation among operators. As of now, Spark considers all stateful operators to have same input watermark and output watermark, which introduced the limitation. With this ticket, we construct the logic to simulate watermark propagation so that each operator can have its own (input watermark, output watermark). Operators introducing delayed records will produce delayed output watermark, and downstream operator can take the delay into account as input watermark will be adjusted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation
[ https://issues.apache.org/jira/browse/SPARK-42136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42136: --- Assignee: Peter Toth > Refactor BroadcastHashJoinExec output partitioning generation > - > > Key: SPARK-42136 > URL: https://issues.apache.org/jira/browse/SPARK-42136 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation
[ https://issues.apache.org/jira/browse/SPARK-42136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42136. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 38038 [https://github.com/apache/spark/pull/38038] > Refactor BroadcastHashJoinExec output partitioning generation > - > > Key: SPARK-42136 > URL: https://issues.apache.org/jira/browse/SPARK-42136 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42375) Point out the user-facing documentation in Spark Connect server startup
Hyukjin Kwon created SPARK-42375: Summary: Point out the user-facing documentation in Spark Connect server startup Key: SPARK-42375 URL: https://issues.apache.org/jira/browse/SPARK-42375 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon See -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42374) User-facing documentaiton
[ https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42374: - Description: Should provide the user-facing documentation so end users how to use Spark Connect. > User-facing documentaiton > - > > Key: SPARK-42374 > URL: https://issues.apache.org/jira/browse/SPARK-42374 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Should provide the user-facing documentation so end users how to use Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42374) User-facing documentaiton
[ https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42374: Assignee: Haejoon Lee > User-facing documentaiton > - > > Key: SPARK-42374 > URL: https://issues.apache.org/jira/browse/SPARK-42374 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > > Should provide the user-facing documentation so end users how to use Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42375) Point out the user-facing documentation in Spark Connect server startup
[ https://issues.apache.org/jira/browse/SPARK-42375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42375: - Description: See SPARK-42375 in SparkSubmit.scala (was: See ) > Point out the user-facing documentation in Spark Connect server startup > --- > > Key: SPARK-42375 > URL: https://issues.apache.org/jira/browse/SPARK-42375 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-42375 in SparkSubmit.scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42374) User-facing documentaiton
Hyukjin Kwon created SPARK-42374: Summary: User-facing documentaiton Key: SPARK-42374 URL: https://issues.apache.org/jira/browse/SPARK-42374 Project: Spark Issue Type: Documentation Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42367) DataFrame.drop should handle duplicated columns properly
[ https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-42367: -- Summary: DataFrame.drop should handle duplicated columns properly (was: DataFrame.drop could handle duplicated columns) > DataFrame.drop should handle duplicated columns properly > > > Key: SPARK-42367 > URL: https://issues.apache.org/jira/browse/SPARK-42367 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > {code:java} > >>> df.join(df2, df.name == df2.name, 'inner').show() > +---++--++ > |age|name|height|name| > +---++--++ > | 16| Bob|85| Bob| > | 14| Tom|80| Tom| > +---++--++ > >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show() > +---+--+ > |age|height| > +---+--+ > | 16|85| > | 14|80| > +---+--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42373) Remove unused blank line removal from CSVExprUtils
[ https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42373: Assignee: Apache Spark > Remove unused blank line removal from CSVExprUtils > -- > > Key: SPARK-42373 > URL: https://issues.apache.org/jira/browse/SPARK-42373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Assignee: Apache Spark >Priority: Minor > > The non-multiline CSV read codepath contains references to removal of blank > lines throughout. This is not necessary as blank lines are removed by the > parser. Furthermore, it causes confusion, indicating that blank lines are > removed at this point when instead they are already omitted from the data. > The multiline code-path does not explicitly remove blank lines leading to > what looks like disparity in behavior between the two. > The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need > to explicitly skip lines, and this should be respected in {{CSVUtils}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42373) Remove unused blank line removal from CSVExprUtils
[ https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42373: Assignee: (was: Apache Spark) > Remove unused blank line removal from CSVExprUtils > -- > > Key: SPARK-42373 > URL: https://issues.apache.org/jira/browse/SPARK-42373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Priority: Minor > > The non-multiline CSV read codepath contains references to removal of blank > lines throughout. This is not necessary as blank lines are removed by the > parser. Furthermore, it causes confusion, indicating that blank lines are > removed at this point when instead they are already omitted from the data. > The multiline code-path does not explicitly remove blank lines leading to > what looks like disparity in behavior between the two. > The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need > to explicitly skip lines, and this should be respected in {{CSVUtils}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42373) Remove unused blank line removal from CSVExprUtils
[ https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685268#comment-17685268 ] Apache Spark commented on SPARK-42373: -- User 'ted-jenks' has created a pull request for this issue: https://github.com/apache/spark/pull/39927 > Remove unused blank line removal from CSVExprUtils > -- > > Key: SPARK-42373 > URL: https://issues.apache.org/jira/browse/SPARK-42373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Priority: Minor > > The non-multiline CSV read codepath contains references to removal of blank > lines throughout. This is not necessary as blank lines are removed by the > parser. Furthermore, it causes confusion, indicating that blank lines are > removed at this point when instead they are already omitted from the data. > The multiline code-path does not explicitly remove blank lines leading to > what looks like disparity in behavior between the two. > The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need > to explicitly skip lines, and this should be respected in {{CSVUtils}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42373) Remove unused blank line removal from CSVExprUtils
Willi Raschkowski created SPARK-42373: - Summary: Remove unused blank line removal from CSVExprUtils Key: SPARK-42373 URL: https://issues.apache.org/jira/browse/SPARK-42373 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1 Reporter: Willi Raschkowski The non-multiline CSV read codepath contains references to removal of blank lines throughout. This is not necessary as blank lines are removed by the parser. Furthermore, it causes confusion, indicating that blank lines are removed at this point when instead they are already omitted from the data. The multiline code-path does not explicitly remove blank lines leading to what looks like disparity in behavior between the two. The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need to explicitly skip lines, and this should be respected in {{CSVUtils}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
[ https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42372: Assignee: (was: Apache Spark) > Improve performance of HiveGenericUDTF by making inputProjection instantiate > once > - > > Key: SPARK-42372 > URL: https://issues.apache.org/jira/browse/SPARK-42372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 1574 1680 > 118 0.7 1501.1 1.0X > +Hive UDTF dup 4 2642 3076 > 588 0.4 2519.9 0.6X > + > diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > new file mode 100644 > index 00..8af8b6582c > --- /dev/null > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 712 789 > 101 1.5 678.7 1.0X > +Hive UDTF dup 4 1212 1294 > 78 0.9 1156.0 0.6X > + {code} > over 2x performance gain via a benchmarking -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
[ https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685237#comment-17685237 ] Apache Spark commented on SPARK-42372: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/39929 > Improve performance of HiveGenericUDTF by making inputProjection instantiate > once > - > > Key: SPARK-42372 > URL: https://issues.apache.org/jira/browse/SPARK-42372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 1574 1680 > 118 0.7 1501.1 1.0X > +Hive UDTF dup 4 2642 3076 > 588 0.4 2519.9 0.6X > + > diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > new file mode 100644 > index 00..8af8b6582c > --- /dev/null > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 712 789 > 101 1.5 678.7 1.0X > +Hive UDTF dup 4 1212 1294 > 78 0.9 1156.0 0.6X > + {code} > over 2x performance gain via a benchmarking -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
[ https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42372: Assignee: Apache Spark > Improve performance of HiveGenericUDTF by making inputProjection instantiate > once > - > > Key: SPARK-42372 > URL: https://issues.apache.org/jira/browse/SPARK-42372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > {code:java} > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 1574 1680 > 118 0.7 1501.1 1.0X > +Hive UDTF dup 4 2642 3076 > 588 0.4 2519.9 0.6X > + > diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > new file mode 100644 > index 00..8af8b6582c > --- /dev/null > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 712 789 > 101 1.5 678.7 1.0X > +Hive UDTF dup 4 1212 1294 > 78 0.9 1156.0 0.6X > + {code} > over 2x performance gain via a benchmarking -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
[ https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-42372: - Description: {code:java} +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt @@ -0,0 +1,7 @@ +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative + +Hive UDTF dup 2 1574 1680 118 0.7 1501.1 1.0X +Hive UDTF dup 4 2642 3076 588 0.4 2519.9 0.6X + diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt new file mode 100644 index 00..8af8b6582c --- /dev/null +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt @@ -0,0 +1,7 @@ +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative + +Hive UDTF dup 2 712 789 101 1.5 678.7 1.0X +Hive UDTF dup 4 1212 1294 78 0.9 1156.0 0.6X + {code} over 2x performance gain via a benchmarking > Improve performance of HiveGenericUDTF by making inputProjection instantiate > once > - > > Key: SPARK-42372 > URL: https://issues.apache.org/jira/browse/SPARK-42372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 1574 1680 > 118 0.7 1501.1 1.0X > +Hive UDTF dup 4 2642 3076 > 588 0.4 2519.9 0.6X > + > diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > new file mode 100644 > index 00..8af8b6582c > --- /dev/null > +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt > @@ -0,0 +1,7 @@ > +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1 > +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > +Hive UDTF benchmark: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > + > +Hive UDTF dup 2 712 789 > 101 1.5 678.7 1.0X > +Hive UDTF dup 4 1212 1294 > 78 0.9 1156.0 0.6X > + {code} > over 2x performance gain via a benchmarking -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once
Kent Yao created SPARK-42372: Summary: Improve performance of HiveGenericUDTF by making inputProjection instantiate once Key: SPARK-42372 URL: https://issues.apache.org/jira/browse/SPARK-42372 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685208#comment-17685208 ] Apache Spark commented on SPARK-42371: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39928 > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685209#comment-17685209 ] Apache Spark commented on SPARK-42371: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39928 > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42371: Assignee: (was: Apache Spark) > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server
[ https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42371: Assignee: Apache Spark > Add scripts to start and stop Spark Connect server > -- > > Key: SPARK-42371 > URL: https://issues.apache.org/jira/browse/SPARK-42371 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, there is no proper way to start and stop the Spark Connect server. > Now it requires you to start it with, for example, a Spark shell: > {code} > # For development, > ./bin/spark-shell \ >--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > {code} > # For released Spark versions > ./bin/spark-shell \ > --packages org.apache.spark:spark-connect_2.12:3.4.0 \ > --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin > {code} > which is awkward. > We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42266) Local mode should work with IPython
[ https://issues.apache.org/jira/browse/SPARK-42266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685207#comment-17685207 ] Hyukjin Kwon commented on SPARK-42266: -- Let me take a look > Local mode should work with IPython > --- > > Key: SPARK-42266 > URL: https://issues.apache.org/jira/browse/SPARK-42266 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > {code:java} > (spark_dev) ➜ spark git:(master) bin/pyspark --remote "local[*]" > Python 3.9.15 (main, Nov 24 2022, 08:28:41) > Type 'copyright', 'credits' or 'license' for more information > IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help. > /Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py:45: UserWarning: > Failed to initialize Spark session. > warnings.warn("Failed to initialize Spark session.") > Traceback (most recent call last): > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py", line 40, in > > spark = SparkSession.builder.getOrCreate() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/session.py", line > 429, in getOrCreate > from pyspark.sql.connect.session import SparkSession as RemoteSparkSession > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/__init__.py", line > 21, in > from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 35, in > import pandas > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 29, in > from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 34, in > require_minimum_pandas_version() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/utils.py", > line 37, in require_minimum_pandas_version > if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > AttributeError: partially initialized module 'pandas' has no attribute > '__version__' (most likely due to a circular import) > [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file > /Users/ruifeng.zheng/Dev/spark//python/pyspark/shell.py: > --- > AttributeErrorTraceback (most recent call last) > File ~/Dev/spark/python/pyspark/shell.py:40 > 38 try: > 39 # Creates pyspark.sql.connect.SparkSession. > ---> 40 spark = SparkSession.builder.getOrCreate() > 41 except Exception: > File ~/Dev/spark/python/pyspark/sql/session.py:429, in > SparkSession.Builder.getOrCreate(self) > 428 with SparkContext._lock: > --> 429 from pyspark.sql.connect.session import SparkSession as > RemoteSparkSession > 431 if ( > 432 SparkContext._active_spark_context is None > 433 and SparkSession._instantiatedSession is None > 434 ): > File ~/Dev/spark/python/pyspark/sql/connect/__init__.py:21 > 18 """Currently Spark Connect is very experimental and the APIs to > interact with > 19 Spark through this API are can be changed at any time without > warning.""" > ---> 21 from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > 22 from pyspark.sql.pandas.utils import ( > 23 require_minimum_pandas_version, > 24 require_minimum_pyarrow_version, > 25 require_minimum_grpc_version, > 26 ) > File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:35 > 34 import random > ---> 35 import pandas > 36 import json > File ~/Dev/spark/python/pyspark/pandas/__init__.py:29 > 27 from typing import Any > ---> 29 from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > 30 from pyspark.pandas.missing.scalars import MissingPandasLikeScalars > File ~/Dev/spark/python/pyspark/pandas/__init__.py:34 > 33 try: > ---> 34 require_minimum_pandas_version() > 35 require_minimum_pyarrow_version() > File ~/Dev/spark/python/pyspark/sql/pandas/utils.py:37, in > require_minimum_pandas_version() > 34 raise ImportError( > 35 "Pandas >= %s must be installed; however, " "it was not > found." % minimum_pandas_version > 36 ) from raised_error > ---> 37 if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > 38 raise ImportError( > 39 "Pandas >= %s must be installed; however, " > 40 "your version was %s." % (minimum_pandas_version, > pandas.__version__) > 41 ) > AttributeError: partially initialized module 'pandas' has no attribute > '__version__'
[jira] [Created] (SPARK-42371) Add scripts to start and stop Spark Connect server
Hyukjin Kwon created SPARK-42371: Summary: Add scripts to start and stop Spark Connect server Key: SPARK-42371 URL: https://issues.apache.org/jira/browse/SPARK-42371 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Currently, there is no proper way to start and stop the Spark Connect server. Now it requires you to start it with, for example, a Spark shell: {code} # For development, ./bin/spark-shell \ --jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \ --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin {code} {code} # For released Spark versions ./bin/spark-shell \ --packages org.apache.spark:spark-connect_2.12:3.4.0 \ --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin {code} which is awkward. We need some dedicated scripts for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685177#comment-17685177 ] Hyukjin Kwon commented on SPARK-39375: -- [~xkrogen] I just saw this. Some work is done and merged. Some initial work was done in https://github.com/apache/spark/pull/39585. I thought it's actually not that complicated - having one general layer shared with all Scala, Python, etc UDFs and it contains the actual Python UDF implementation. > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41289) Feature parity: Catalog API
[ https://issues.apache.org/jira/browse/SPARK-41289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41289. -- Resolution: Done > Feature parity: Catalog API > --- > > Key: SPARK-41289 > URL: https://issues.apache.org/jira/browse/SPARK-41289 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42370) Spark History Server fails to start on CentOS7 aarch64
[ https://issues.apache.org/jira/browse/SPARK-42370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiguo Wu updated SPARK-42370: -- Description: When I run `./sbin/start-history-server.sh` I'll get the error below !image-2023-02-07-16-54-43-593.png! Although we already use org.openlabtesting.leveldbjni on aarch64, which can works on aarch64, we still load org.fusesource.hawtjni.runtime.Library on wrong jar file When we run `export SPARK_DAEMON_JAVA_OPTS=-verbose:class`, we can see the class is loaded from jline-2.14.6.jar where the correct class file is under leveldbjni-all-1.8.jar Incorrect(now): [Loaded org.fusesource.hawtjni.runtime.Library from file:/yourdir/spark/jars/jline-2.14.6.jar] Correct(expected): [Loaded org.fusesource.hawtjni.runtime.Library from file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] was: When I run `./sbin/start-history-server.sh` I'll get the error below !image-2023-02-07-16-54-43-593.png! Although we already use org.openlabtesting.leveldbjni on aarch64, which can works on aarch64, we still load org.fusesource.hawtjni.runtime.Library on wrong jar file we can see the class is load from jline-2.14.6.jar where the correct class file is under leveldbjni-all-1.8.jar when we run export SPARK_DAEMON_JAVA_OPTS=-verbose:class Incorrect: [Loaded org.fusesource.hawtjni.runtime.Library from file:/yourdir/spark/jars/jline-2.14.6.jar] Correct: [Loaded org.fusesource.hawtjni.runtime.Library from file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] > Spark History Server fails to start on CentOS7 aarch64 > -- > > Key: SPARK-42370 > URL: https://issues.apache.org/jira/browse/SPARK-42370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Zhiguo Wu >Priority: Major > Attachments: image-2023-02-07-16-54-43-593.png > > > When I run `./sbin/start-history-server.sh` > I'll get the error below > !image-2023-02-07-16-54-43-593.png! > > Although we already use org.openlabtesting.leveldbjni on aarch64, which can > works on aarch64, > we still load org.fusesource.hawtjni.runtime.Library on wrong jar file > When we run `export SPARK_DAEMON_JAVA_OPTS=-verbose:class`, we can see the > class is loaded from jline-2.14.6.jar where the correct class file is under > leveldbjni-all-1.8.jar > > Incorrect(now): > [Loaded org.fusesource.hawtjni.runtime.Library from > file:/yourdir/spark/jars/jline-2.14.6.jar] > Correct(expected): > [Loaded org.fusesource.hawtjni.runtime.Library from > file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org