[jira] [Assigned] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42309:


Assignee: Apache Spark

> Assign name to _LEGACY_ERROR_TEMP_1204
> --
>
> Key: SPARK-42309
> URL: https://issues.apache.org/jira/browse/SPARK-42309
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42309:


Assignee: (was: Apache Spark)

> Assign name to _LEGACY_ERROR_TEMP_1204
> --
>
> Key: SPARK-42309
> URL: https://issues.apache.org/jira/browse/SPARK-42309
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42309) Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685723#comment-17685723
 ] 

Apache Spark commented on SPARK-42309:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39937

> Assign name to _LEGACY_ERROR_TEMP_1204
> --
>
> Key: SPARK-42309
> URL: https://issues.apache.org/jira/browse/SPARK-42309
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42267) Support left_outer join

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42267:


Assignee: (was: Apache Spark)

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42267) Support left_outer join

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685722#comment-17685722
 ] 

Apache Spark commented on SPARK-42267:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39938

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42267) Support left_outer join

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42267:


Assignee: Apache Spark

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42024) createDataFrame should corse types of string float to float

2023-02-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42024:
-

Assignee: Ruifeng Zheng

> createDataFrame should corse types of string float to float
> ---
>
> Key: SPARK-42024
> URL: https://issues.apache.org/jira/browse/SPARK-42024
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_types.py:245 
> (TypesParityTests.test_infer_schema_upcast_float_to_string)
> self =  testMethod=test_infer_schema_upcast_float_to_string>
> def test_infer_schema_upcast_float_to_string(self):
> >   df = self.spark.createDataFrame([[1.33, 1], ["2.1", 1]], schema=["a", 
> > "b"])
> ../test_types.py:247: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert '2.1' with type str: tried to 
> convert to double
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42024) createDataFrame should corse types of string float to float

2023-02-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42024.
---
Resolution: Resolved

> createDataFrame should corse types of string float to float
> ---
>
> Key: SPARK-42024
> URL: https://issues.apache.org/jira/browse/SPARK-42024
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_types.py:245 
> (TypesParityTests.test_infer_schema_upcast_float_to_string)
> self =  testMethod=test_infer_schema_upcast_float_to_string>
> def test_infer_schema_upcast_float_to_string(self):
> >   df = self.spark.createDataFrame([[1.33, 1], ["2.1", 1]], schema=["a", 
> > "b"])
> ../test_types.py:247: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert '2.1' with type str: tried to 
> convert to double
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42380) Upgrade maven to 3.9.0

2023-02-07 Thread Yang Jie (Jira)
Yang Jie created SPARK-42380:


 Summary: Upgrade maven to 3.9.0
 Key: SPARK-42380
 URL: https://issues.apache.org/jira/browse/SPARK-42380
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


{code:java}
[ERROR] An error occurred attempting to read POM
org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen 

[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42378:
-

Assignee: Ruifeng Zheng

> Make `DataFrame.select` support `a.*`
> -
>
> Key: SPARK-42378
> URL: https://issues.apache.org/jira/browse/SPARK-42378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42378.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39934
[https://github.com/apache/spark/pull/39934]

> Make `DataFrame.select` support `a.*`
> -
>
> Key: SPARK-42378
> URL: https://issues.apache.org/jira/browse/SPARK-42378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-42372.
--
Fix Version/s: 3.4.0
 Assignee: Kent Yao
   Resolution: Fixed

issue resolved by https://github.com/apache/spark/pull/39929

> Improve performance of HiveGenericUDTF by making inputProjection instantiate 
> once
> -
>
> Key: SPARK-42372
> URL: https://issues.apache.org/jira/browse/SPARK-42372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                    1574           1680       
>   118          0.7        1501.1       1.0X
> +Hive UDTF dup 4                                    2642           3076       
>   588          0.4        2519.9       0.6X
> +
> diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
> b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> new file mode 100644
> index 00..8af8b6582c
> --- /dev/null
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                     712            789       
>   101          1.5         678.7       1.0X
> +Hive UDTF dup 4                                    1212           1294       
>    78          0.9        1156.0       0.6X
> + {code}
> over 2x performance gain via a benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40045) The order of filtering predicates is not reasonable

2023-02-07 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-40045:
--

Assignee: caican

> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Assignee: caican
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a))` comes before `(id#22L = 2)`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
> 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = 
> 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42315) Assign name to _LEGACY_ERROR_TEMP_2092

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42315:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2092
> --
>
> Key: SPARK-42315
> URL: https://issues.apache.org/jira/browse/SPARK-42315
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42315) Assign name to _LEGACY_ERROR_TEMP_2092

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42315.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39889
[https://github.com/apache/spark/pull/39889]

> Assign name to _LEGACY_ERROR_TEMP_2092
> --
>
> Key: SPARK-42315
> URL: https://issues.apache.org/jira/browse/SPARK-42315
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42358) Provide more details in ExecutorUpdated sent in Master.removeWorker

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42358.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39903
[https://github.com/apache/spark/pull/39903]

> Provide more details in ExecutorUpdated sent in Master.removeWorker
> ---
>
> Key: SPARK-42358
> URL: https://issues.apache.org/jira/browse/SPARK-42358
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently field `message` in `ExecutorUpdated` sent in Master.removeWorker is 
> always `Some("worker lost")`. We should provide more information in the 
> message instead to better differentiate the cause of the worker removal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42358) Provide more details in ExecutorUpdated sent in Master.removeWorker

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42358:
-

Assignee: Bo Zhang

> Provide more details in ExecutorUpdated sent in Master.removeWorker
> ---
>
> Key: SPARK-42358
> URL: https://issues.apache.org/jira/browse/SPARK-42358
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>
> Currently field `message` in `ExecutorUpdated` sent in Master.removeWorker is 
> always `Some("worker lost")`. We should provide more information in the 
> message instead to better differentiate the cause of the worker removal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40045) The order of filtering predicates is not reasonable

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40045.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39892
[https://github.com/apache/spark/pull/39892]

> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a))` comes before `(id#22L = 2)`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
> 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = 
> 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685681#comment-17685681
 ] 

Apache Spark commented on SPARK-42379:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/39936

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42379:


Assignee: (was: Apache Spark)

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42379:


Assignee: Apache Spark

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685680#comment-17685680
 ] 

Apache Spark commented on SPARK-42379:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/39936

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-07 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-42379:


 Summary: Use FileSystem.exists in 
FileSystemBasedCheckpointFileManager.exists
 Key: SPARK-42379
 URL: https://issues.apache.org/jira/browse/SPARK-42379
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Jungtaek Lim


Other methods in FileSystemBasedCheckpointFileManager already uses 
FileSystem.exists for all cases checking existence of the path. Use 
FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions

2023-02-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685674#comment-17685674
 ] 

Chao Sun commented on SPARK-33807:
--

This is actually already resolved as part of SPARK-37377.

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33807) Data Source V2: Remove read specific distributions

2023-02-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-33807:


Assignee: (was: Chao Sun)

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33807) Data Source V2: Remove read specific distributions

2023-02-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-33807:


Assignee: Chao Sun

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Assignee: Chao Sun
>Priority: Major
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2023-02-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685671#comment-17685671
 ] 

Dongjoon Hyun commented on SPARK-41053:
---

Hi, [~Gengliang.Wang]. Shall we resolve this issue?

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35563:
--
Priority: Major  (was: Blocker)

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> --
>
> Key: SPARK-35563
> URL: https://issues.apache.org/jira/browse/SPARK-35563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-+--+
>   
> |  dir| count|
> +-+--+
> |false|2147483647|
> | true| 1|
> +-+--+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions

2023-02-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685668#comment-17685668
 ] 

Dongjoon Hyun commented on SPARK-33807:
---

According to the discussion, I lowered the `Priority` from `Blocker` to `Major`.

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33807) Data Source V2: Remove read specific distributions

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33807:
--
Priority: Major  (was: Blocker)

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42210) Standardize registered pickled Python UDFs

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685667#comment-17685667
 ] 

Apache Spark commented on SPARK-42210:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39860

> Standardize registered pickled Python UDFs
> --
>
> Key: SPARK-42210
> URL: https://issues.apache.org/jira/browse/SPARK-42210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement spark.udf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42210) Standardize registered pickled Python UDFs

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42210:


Assignee: Apache Spark

> Standardize registered pickled Python UDFs
> --
>
> Key: SPARK-42210
> URL: https://issues.apache.org/jira/browse/SPARK-42210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement spark.udf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42210) Standardize registered pickled Python UDFs

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42210:


Assignee: (was: Apache Spark)

> Standardize registered pickled Python UDFs
> --
>
> Key: SPARK-42210
> URL: https://issues.apache.org/jira/browse/SPARK-42210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement spark.udf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42210) Standardize registered pickled Python UDFs

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685666#comment-17685666
 ] 

Apache Spark commented on SPARK-42210:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39860

> Standardize registered pickled Python UDFs
> --
>
> Key: SPARK-42210
> URL: https://issues.apache.org/jira/browse/SPARK-42210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement spark.udf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42244) Refine error message by using Python types.

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685655#comment-17685655
 ] 

Apache Spark commented on SPARK-42244:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39935

> Refine error message by using Python types.
> ---
>
> Key: SPARK-42244
> URL: https://issues.apache.org/jira/browse/SPARK-42244
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the type name in error message is mixed like `string` and `str`.
> We might need to consolidate them into one rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42371.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39928
[https://github.com/apache/spark/pull/39928]

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42371:


Assignee: Hyukjin Kwon

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40819:
-
Fix Version/s: 3.2.4
   3.3.2

> Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type 
> instead of automatically converting to LongType 
> 
>
> Key: SPARK-40819
> URL: https://issues.apache.org/jira/browse/SPARK-40819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.3.2, 3.4.0
>Reporter: Alfred Davidson
>Assignee: Alfred Davidson
>Priority: Critical
>  Labels: regression
> Fix For: 3.2.4, 3.3.2, 3.4.0
>
>
> Since 3.2 parquet files containing attributes with type "INT64 
> (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read 
> throws:
>  
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
> INT64 (TIMESTAMP(NANOS,true))
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
>  {code}
> Prior to 3.2 successfully reads the parquet automatically converting to a 
> LongType.
> I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 
> introduced the change in behaviour, more specifically here: 
> [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154]
>  which throws the QueryCompilationErrors.illegalParquetTypeError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42244) Refine error message by using Python types.

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42244.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39815
[https://github.com/apache/spark/pull/39815]

> Refine error message by using Python types.
> ---
>
> Key: SPARK-42244
> URL: https://issues.apache.org/jira/browse/SPARK-42244
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the type name in error message is mixed like `string` and `str`.
> We might need to consolidate them into one rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42244) Refine error message by using Python types.

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42244:


Assignee: Haejoon Lee

> Refine error message by using Python types.
> ---
>
> Key: SPARK-42244
> URL: https://issues.apache.org/jira/browse/SPARK-42244
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Currently, the type name in error message is mixed like `string` and `str`.
> We might need to consolidate them into one rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42301) Assign name to _LEGACY_ERROR_TEMP_1129

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42301.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39871
[https://github.com/apache/spark/pull/39871]

> Assign name to _LEGACY_ERROR_TEMP_1129
> --
>
> Key: SPARK-42301
> URL: https://issues.apache.org/jira/browse/SPARK-42301
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42301) Assign name to _LEGACY_ERROR_TEMP_1129

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42301:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_1129
> --
>
> Key: SPARK-42301
> URL: https://issues.apache.org/jira/browse/SPARK-42301
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42254) Assign name to _LEGACY_ERROR_TEMP_1117

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42254:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_1117
> --
>
> Key: SPARK-42254
> URL: https://issues.apache.org/jira/browse/SPARK-42254
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42254) Assign name to _LEGACY_ERROR_TEMP_1117

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42254.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39837
[https://github.com/apache/spark/pull/39837]

> Assign name to _LEGACY_ERROR_TEMP_1117
> --
>
> Key: SPARK-42254
> URL: https://issues.apache.org/jira/browse/SPARK-42254
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42249) Refining html strings in error messages

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42249.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39820
[https://github.com/apache/spark/pull/39820]

> Refining html strings in error messages
> ---
>
> Key: SPARK-42249
> URL: https://issues.apache.org/jira/browse/SPARK-42249
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Using relative path for html string



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42249) Refining html strings in error messages

2023-02-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42249:


Assignee: Haejoon Lee

> Refining html strings in error messages
> ---
>
> Key: SPARK-42249
> URL: https://issues.apache.org/jira/browse/SPARK-42249
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Using relative path for html string



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685633#comment-17685633
 ] 

Apache Spark commented on SPARK-42378:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39934

> Make `DataFrame.select` support `a.*`
> -
>
> Key: SPARK-42378
> URL: https://issues.apache.org/jira/browse/SPARK-42378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42378:


Assignee: Apache Spark

> Make `DataFrame.select` support `a.*`
> -
>
> Key: SPARK-42378
> URL: https://issues.apache.org/jira/browse/SPARK-42378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42378:


Assignee: (was: Apache Spark)

> Make `DataFrame.select` support `a.*`
> -
>
> Key: SPARK-42378
> URL: https://issues.apache.org/jira/browse/SPARK-42378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42378) Make `DataFrame.select` support `a.*`

2023-02-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42378:
-

 Summary: Make `DataFrame.select` support `a.*`
 Key: SPARK-42378
 URL: https://issues.apache.org/jira/browse/SPARK-42378
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685623#comment-17685623
 ] 

Herman van Hövell commented on SPARK-39375:
---

[~xkrogen] Regarding the external classes. It is early days. We will submit a 
patch in the next couple of days that will allow REPL generated code. A step 
after would be jars and probably other artifacts.

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42352) Upgrade maven to 3.8.7

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42352.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39896
[https://github.com/apache/spark/pull/39896]

> Upgrade maven to 3.8.7
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> [https://maven.apache.org/docs/3.8.7/release-notes.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42352) Upgrade maven to 3.8.7

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42352:


Assignee: Yang Jie

> Upgrade maven to 3.8.7
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://maven.apache.org/docs/3.8.7/release-notes.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42094) Support `fill_value` for `ps.Series.add`

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42094:


Assignee: Haejoon Lee

> Support `fill_value` for `ps.Series.add`
> 
>
> Key: SPARK-42094
> URL: https://issues.apache.org/jira/browse/SPARK-42094
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> For pandas function parity: 
> https://pandas.pydata.org/docs/reference/api/pandas.Series.add.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42094) Support `fill_value` for `ps.Series.add`

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42094.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39790
[https://github.com/apache/spark/pull/39790]

> Support `fill_value` for `ps.Series.add`
> 
>
> Key: SPARK-42094
> URL: https://issues.apache.org/jira/browse/SPARK-42094
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> For pandas function parity: 
> https://pandas.pydata.org/docs/reference/api/pandas.Series.add.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685619#comment-17685619
 ] 

Herman van Hövell commented on SPARK-39375:
---

[~xkrogen] the current work on UDFs is somewhat orthogonal to the way we 
execute UDFs. The current work uses the existing backend for execution. We can 
change the way we execute the UDFs later on, it would involve a small change to 
how we plan the UDF on the server side.

I do think running the udfs in a separate process has merit (better isolation, 
lower blast radius, etc...). However it will have profound impact on 
performance since UDF execution will break the execution pipeline in pieces, it 
requires starting cold(ish) java processes, etc... 

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685618#comment-17685618
 ] 

Hyukjin Kwon commented on SPARK-39375:
--

cc [~zhenli] [~hvanhovell] [~grundprinzip-db] ^ FYI

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42377) Test Framework for Connect Scala Client

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685613#comment-17685613
 ] 

Apache Spark commented on SPARK-42377:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/39933

> Test Framework for Connect Scala Client
> ---
>
> Key: SPARK-42377
> URL: https://issues.apache.org/jira/browse/SPARK-42377
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42377) Test Framework for Connect Scala Client

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42377:


Assignee: (was: Apache Spark)

> Test Framework for Connect Scala Client
> ---
>
> Key: SPARK-42377
> URL: https://issues.apache.org/jira/browse/SPARK-42377
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42377) Test Framework for Connect Scala Client

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685612#comment-17685612
 ] 

Apache Spark commented on SPARK-42377:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/39933

> Test Framework for Connect Scala Client
> ---
>
> Key: SPARK-42377
> URL: https://issues.apache.org/jira/browse/SPARK-42377
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42377) Test Framework for Connect Scala Client

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42377:


Assignee: Apache Spark

> Test Framework for Connect Scala Client
> ---
>
> Key: SPARK-42377
> URL: https://issues.apache.org/jira/browse/SPARK-42377
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42377) Test Framework for Connect Scala Client

2023-02-07 Thread Jira
Herman van Hövell created SPARK-42377:
-

 Summary: Test Framework for Connect Scala Client
 Key: SPARK-42377
 URL: https://issues.apache.org/jira/browse/SPARK-42377
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-07 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685600#comment-17685600
 ] 

Erik Krogen commented on SPARK-39375:
-

UDFs are a complex space, e.g. for Scala the current impl completed in 
SPARK-42283 cannot handle externally defined classes, which are a common 
requirement in UDFs. It's also a notable design decision that we are choosing 
to process UDFs in the Spark Connect server session, vs. a sidecar process like 
a UDF server that can provide isolation between different UDFs (e.g. as 
[supported by Presto|https://github.com/prestodb/presto/issues/14053] and 
[leveraged heavily by 
Meta|https://www.databricks.com/session_na21/portable-udfs-write-once-run-anywhere]).
 It would be nice to see more discussion on the merits of various approaches to 
UDFs in the Spark Connect framework and a clear plan, rather than pushing 
forward with them piecemeal. It's of course reasonable that UDFs were left out 
of scope for the original SPIP, but based on that omission I was expecting we 
would have a subsequent discussion on UDFs for Spark Connect before starting 
implementation for them.

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> 

[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-07 Thread Ritika Maheshwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685466#comment-17685466
 ] 

Ritika Maheshwari commented on SPARK-42346:
---

Hello added three rows to input_table. Still no error. I do have DPP enabled.

*

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 12.0.2)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala> val df = Seq(("a","b"),("c","d"),("e","f")).toDF("surname","first_name")

*df*: *org.apache.spark.sql.DataFrame* = [surname: string, first_name: string]

 

scala> df.createOrReplaceTempView("input_table")

 

scala> spark.sql("select(Select Count(Distinct first_name) from input_table) As 
distinct_value_count from input_table Union all select (select count(Distinct 
surname) from input_table) as distinct_value_count from input_table").show()

++                                                          

|distinct_value_count|

++

|                   3|

|                   3|

|                   3|

|                   3|

|                   3|

|                   3|

++

 

**

AdaptiveSparkPlan isFinalPlan=false
+- Union
   :- Project [cast(Subquery subquery#145, [id=#571] as string) AS 
distinct_value_count#161]
   :  :  +- Subquery subquery#145, [id=#571]
   :  :     +- AdaptiveSparkPlan isFinalPlan=false
   :  :        +- HashAggregate(keys=[], functions=[count(distinct 
first_name#8)], output=[count(DISTINCT first_name)#152L])
   :  :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#569]
   :  :              +- HashAggregate(keys=[], 
functions=[partial_count(distinct first_name#8)], output=[count#167L])
   :  :                 +- HashAggregate(keys=[first_name#8], functions=[], 
output=[first_name#8])
   :  :                    +- Exchange hashpartitioning(first_name#8, 200), 
ENSURE_REQUIREMENTS, [id=#565]
   :  :                       +- HashAggregate(keys=[first_name#8], 
functions=[], output=[first_name#8])
   :  :                          +- LocalTableScan [first_name#8]
   :  +- LocalTableScan [_1#2, _2#3]
   +- Project [cast(Subquery subquery#147, [id=#590] as string) AS 
distinct_value_count#163]
      :  +- Subquery subquery#147, [id=#590]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[count(distinct surname#7)], 
output=[count(DISTINCT surname)#154L])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#588]
      :              +- HashAggregate(keys=[], 
functions=[partial_count(distinct surname#7)], output=[count#170L])
      :                 +- HashAggregate(keys=[surname#7], functions=[], 
output=[surname#7])
      :                    +- Exchange hashpartitioning(surname#7, 200), 
ENSURE_REQUIREMENTS, [id=#584]
      :                       +- HashAggregate(keys=[surname#7], functions=[], 
output=[surname#7])
      :                          +- LocalTableScan [surname#7]
      +- LocalTableScan [_1#149, _2#150]

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-07 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384
 ] 

Gera Shegalov commented on SPARK-41793:
---

Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42369:
--
Issue Type: Improvement  (was: Bug)

> Fix constructor for java.nio.DirectByteBuffer for Java 21+
> --
>
> Key: SPARK-42369
> URL: https://issues.apache.org/jira/browse/SPARK-42369
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 3.5.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
> Fix For: 3.5.0
>
>
> In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was 
> replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support 
> both by probing for the legacy one first and falling back to the newer one 
> second.
> This change is completely transparent for the end-user, and makes sure Spark 
> works transparently on the latest JDK as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42369:
-

Assignee: Ludovic Henry

> Fix constructor for java.nio.DirectByteBuffer for Java 21+
> --
>
> Key: SPARK-42369
> URL: https://issues.apache.org/jira/browse/SPARK-42369
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.5.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
>
> In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was 
> replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support 
> both by probing for the legacy one first and falling back to the newer one 
> second.
> This change is completely transparent for the end-user, and makes sure Spark 
> works transparently on the latest JDK as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+

2023-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42369.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39909
[https://github.com/apache/spark/pull/39909]

> Fix constructor for java.nio.DirectByteBuffer for Java 21+
> --
>
> Key: SPARK-42369
> URL: https://issues.apache.org/jira/browse/SPARK-42369
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.5.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
> Fix For: 3.5.0
>
>
> In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was 
> replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support 
> both by probing for the legacy one first and falling back to the newer one 
> second.
> This change is completely transparent for the end-user, and makes sure Spark 
> works transparently on the latest JDK as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42376) Introduce watermark propagation among operators

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42376:


Assignee: Apache Spark

> Introduce watermark propagation among operators
> ---
>
> Key: SPARK-42376
> URL: https://issues.apache.org/jira/browse/SPARK-42376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> With introduction of SPARK-40925, we enabled workloads containing multiple 
> stateful operators in a single streaming query.
> The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
> late record filtering in stateful operators to allow chaining of stateful 
> operators {*}which do not produce delayed records (like time-interval join or 
> potentially flatMapGroupsWithState){*}".
> We identified production use case for stream-stream time-interval join 
> followed by stateful operator (e.g. window aggregation), and propose to 
> address such use case via this ticket.
> The design will be described in the PR, but the sketched idea is introducing 
> simulation of watermark propagation among operators. As of now, Spark 
> considers all stateful operators to have same input watermark and output 
> watermark, which introduced the limitation. With this ticket, we construct 
> the logic to simulate watermark propagation so that each operator can have 
> its own (input watermark, output watermark). Operators introducing delayed 
> records will produce delayed output watermark, and downstream operator can 
> take the delay into account as input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42376) Introduce watermark propagation among operators

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685350#comment-17685350
 ] 

Apache Spark commented on SPARK-42376:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/39931

> Introduce watermark propagation among operators
> ---
>
> Key: SPARK-42376
> URL: https://issues.apache.org/jira/browse/SPARK-42376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> With introduction of SPARK-40925, we enabled workloads containing multiple 
> stateful operators in a single streaming query.
> The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
> late record filtering in stateful operators to allow chaining of stateful 
> operators {*}which do not produce delayed records (like time-interval join or 
> potentially flatMapGroupsWithState){*}".
> We identified production use case for stream-stream time-interval join 
> followed by stateful operator (e.g. window aggregation), and propose to 
> address such use case via this ticket.
> The design will be described in the PR, but the sketched idea is introducing 
> simulation of watermark propagation among operators. As of now, Spark 
> considers all stateful operators to have same input watermark and output 
> watermark, which introduced the limitation. With this ticket, we construct 
> the logic to simulate watermark propagation so that each operator can have 
> its own (input watermark, output watermark). Operators introducing delayed 
> records will produce delayed output watermark, and downstream operator can 
> take the delay into account as input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42376) Introduce watermark propagation among operators

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42376:


Assignee: (was: Apache Spark)

> Introduce watermark propagation among operators
> ---
>
> Key: SPARK-42376
> URL: https://issues.apache.org/jira/browse/SPARK-42376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> With introduction of SPARK-40925, we enabled workloads containing multiple 
> stateful operators in a single streaming query.
> The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
> late record filtering in stateful operators to allow chaining of stateful 
> operators {*}which do not produce delayed records (like time-interval join or 
> potentially flatMapGroupsWithState){*}".
> We identified production use case for stream-stream time-interval join 
> followed by stateful operator (e.g. window aggregation), and propose to 
> address such use case via this ticket.
> The design will be described in the PR, but the sketched idea is introducing 
> simulation of watermark propagation among operators. As of now, Spark 
> considers all stateful operators to have same input watermark and output 
> watermark, which introduced the limitation. With this ticket, we construct 
> the logic to simulate watermark propagation so that each operator can have 
> its own (input watermark, output watermark). Operators introducing delayed 
> records will produce delayed output watermark, and downstream operator can 
> take the delay into account as input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685328#comment-17685328
 ] 

Apache Spark commented on SPARK-37099:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39930

> Introduce a rank-based filter to optimize top-k computation
> ---
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Attachments: q67.png, q67_optimized.png, skewed_window.png
>
>
> in JD, we found that more than 90% usage of window function follows this 
> pattern:
> {code:java}
>  select (... (row_number|rank|dense_rank) () over( [partition by ...] order 
> by ... ) as rn)
> where rn (==|<|<=) k and other conditions{code}
>  
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> For these three rank functions (row_number|rank|dense_rank), the rank of a 
> key computed on partitial dataset  is always <=  its final rank computed on 
> the whole dataset. so we can safely discard rows with partitial rank > k, 
> anywhere.
>  
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42376) Introduce watermark propagation among operators

2023-02-07 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685315#comment-17685315
 ] 

Jungtaek Lim commented on SPARK-42376:
--

Will submit a PR sooner.

> Introduce watermark propagation among operators
> ---
>
> Key: SPARK-42376
> URL: https://issues.apache.org/jira/browse/SPARK-42376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> With introduction of SPARK-40925, we enabled workloads containing multiple 
> stateful operators in a single streaming query.
> The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
> late record filtering in stateful operators to allow chaining of stateful 
> operators {*}which do not produce delayed records (like time-interval join or 
> potentially flatMapGroupsWithState){*}".
> We identified production use case for stream-stream time-interval join 
> followed by stateful operator (e.g. window aggregation), and propose to 
> address such use case via this ticket.
> The design will be described in the PR, but the sketched idea is introducing 
> simulation of watermark propagation among operators. As of now, Spark 
> considers all stateful operators to have same input watermark and output 
> watermark, which introduced the limitation. With this ticket, we construct 
> the logic to simulate watermark propagation so that each operator can have 
> its own (input watermark, output watermark). Operators introducing delayed 
> records will produce delayed output watermark, and downstream operator can 
> take the delay into account as input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42376) Introduce watermark propagation among operators

2023-02-07 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-42376:


 Summary: Introduce watermark propagation among operators
 Key: SPARK-42376
 URL: https://issues.apache.org/jira/browse/SPARK-42376
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Jungtaek Lim


With introduction of SPARK-40925, we enabled workloads containing multiple 
stateful operators in a single streaming query.

The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
late record filtering in stateful operators to allow chaining of stateful 
operators {*}which do not produce delayed records (like time-interval join or 
potentially flatMapGroupsWithState){*}".

We identified production use case for stream-stream time-interval join followed 
by stateful operator (e.g. window aggregation), and propose to address such use 
case via this ticket.

The design will be described in the PR, but the sketched idea is introducing 
simulation of watermark propagation among operators. As of now, Spark considers 
all stateful operators to have same input watermark and output watermark, which 
introduced the limitation. With this ticket, we construct the logic to simulate 
watermark propagation so that each operator can have its own (input watermark, 
output watermark). Operators introducing delayed records will produce delayed 
output watermark, and downstream operator can take the delay into account as 
input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation

2023-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42136:
---

Assignee: Peter Toth

> Refactor BroadcastHashJoinExec output partitioning generation
> -
>
> Key: SPARK-42136
> URL: https://issues.apache.org/jira/browse/SPARK-42136
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation

2023-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42136.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 38038
[https://github.com/apache/spark/pull/38038]

> Refactor BroadcastHashJoinExec output partitioning generation
> -
>
> Key: SPARK-42136
> URL: https://issues.apache.org/jira/browse/SPARK-42136
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42375) Point out the user-facing documentation in Spark Connect server startup

2023-02-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-42375:


 Summary: Point out the user-facing documentation in Spark Connect 
server startup
 Key: SPARK-42375
 URL: https://issues.apache.org/jira/browse/SPARK-42375
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


See 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42374) User-facing documentaiton

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42374:
-
Description: Should provide the user-facing documentation so end users how 
to use Spark Connect.

> User-facing documentaiton
> -
>
> Key: SPARK-42374
> URL: https://issues.apache.org/jira/browse/SPARK-42374
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Should provide the user-facing documentation so end users how to use Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42374) User-facing documentaiton

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42374:


Assignee: Haejoon Lee

> User-facing documentaiton
> -
>
> Key: SPARK-42374
> URL: https://issues.apache.org/jira/browse/SPARK-42374
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>
> Should provide the user-facing documentation so end users how to use Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42375) Point out the user-facing documentation in Spark Connect server startup

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42375:
-
Description: See SPARK-42375 in SparkSubmit.scala  (was: See )

> Point out the user-facing documentation in Spark Connect server startup
> ---
>
> Key: SPARK-42375
> URL: https://issues.apache.org/jira/browse/SPARK-42375
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-42375 in SparkSubmit.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42374) User-facing documentaiton

2023-02-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-42374:


 Summary: User-facing documentaiton
 Key: SPARK-42374
 URL: https://issues.apache.org/jira/browse/SPARK-42374
 Project: Spark
  Issue Type: Documentation
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42367) DataFrame.drop should handle duplicated columns properly

2023-02-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42367:
--
Summary: DataFrame.drop should handle duplicated columns properly  (was: 
DataFrame.drop could handle duplicated columns)

> DataFrame.drop should handle duplicated columns properly
> 
>
> Key: SPARK-42367
> URL: https://issues.apache.org/jira/browse/SPARK-42367
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> >>> df.join(df2, df.name == df2.name, 'inner').show()
> +---++--++
> |age|name|height|name|
> +---++--++
> | 16| Bob|85| Bob|
> | 14| Tom|80| Tom|
> +---++--++
> >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> +---+--+
> |age|height|
> +---+--+
> | 16|85|
> | 14|80|
> +---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42373:


Assignee: Apache Spark

> Remove unused blank line removal from CSVExprUtils
> --
>
> Key: SPARK-42373
> URL: https://issues.apache.org/jira/browse/SPARK-42373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Assignee: Apache Spark
>Priority: Minor
>
> The non-multiline CSV read codepath contains references to removal of blank 
> lines throughout. This is not necessary as blank lines are removed by the 
> parser. Furthermore, it causes confusion, indicating that blank lines are 
> removed at this point when instead they are already omitted from the data. 
> The multiline code-path does not explicitly remove blank lines leading to 
> what looks like disparity in behavior between the two.
> The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need 
> to explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42373:


Assignee: (was: Apache Spark)

> Remove unused blank line removal from CSVExprUtils
> --
>
> Key: SPARK-42373
> URL: https://issues.apache.org/jira/browse/SPARK-42373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Minor
>
> The non-multiline CSV read codepath contains references to removal of blank 
> lines throughout. This is not necessary as blank lines are removed by the 
> parser. Furthermore, it causes confusion, indicating that blank lines are 
> removed at this point when instead they are already omitted from the data. 
> The multiline code-path does not explicitly remove blank lines leading to 
> what looks like disparity in behavior between the two.
> The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need 
> to explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685268#comment-17685268
 ] 

Apache Spark commented on SPARK-42373:
--

User 'ted-jenks' has created a pull request for this issue:
https://github.com/apache/spark/pull/39927

> Remove unused blank line removal from CSVExprUtils
> --
>
> Key: SPARK-42373
> URL: https://issues.apache.org/jira/browse/SPARK-42373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Minor
>
> The non-multiline CSV read codepath contains references to removal of blank 
> lines throughout. This is not necessary as blank lines are removed by the 
> parser. Furthermore, it causes confusion, indicating that blank lines are 
> removed at this point when instead they are already omitted from the data. 
> The multiline code-path does not explicitly remove blank lines leading to 
> what looks like disparity in behavior between the two.
> The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need 
> to explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-07 Thread Willi Raschkowski (Jira)
Willi Raschkowski created SPARK-42373:
-

 Summary: Remove unused blank line removal from CSVExprUtils
 Key: SPARK-42373
 URL: https://issues.apache.org/jira/browse/SPARK-42373
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1
Reporter: Willi Raschkowski


The non-multiline CSV read codepath contains references to removal of blank 
lines throughout. This is not necessary as blank lines are removed by the 
parser. Furthermore, it causes confusion, indicating that blank lines are 
removed at this point when instead they are already omitted from the data. The 
multiline code-path does not explicitly remove blank lines leading to what 
looks like disparity in behavior between the two.

The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need to 
explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42372:


Assignee: (was: Apache Spark)

> Improve performance of HiveGenericUDTF by making inputProjection instantiate 
> once
> -
>
> Key: SPARK-42372
> URL: https://issues.apache.org/jira/browse/SPARK-42372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                    1574           1680       
>   118          0.7        1501.1       1.0X
> +Hive UDTF dup 4                                    2642           3076       
>   588          0.4        2519.9       0.6X
> +
> diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
> b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> new file mode 100644
> index 00..8af8b6582c
> --- /dev/null
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                     712            789       
>   101          1.5         678.7       1.0X
> +Hive UDTF dup 4                                    1212           1294       
>    78          0.9        1156.0       0.6X
> + {code}
> over 2x performance gain via a benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685237#comment-17685237
 ] 

Apache Spark commented on SPARK-42372:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/39929

> Improve performance of HiveGenericUDTF by making inputProjection instantiate 
> once
> -
>
> Key: SPARK-42372
> URL: https://issues.apache.org/jira/browse/SPARK-42372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                    1574           1680       
>   118          0.7        1501.1       1.0X
> +Hive UDTF dup 4                                    2642           3076       
>   588          0.4        2519.9       0.6X
> +
> diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
> b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> new file mode 100644
> index 00..8af8b6582c
> --- /dev/null
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                     712            789       
>   101          1.5         678.7       1.0X
> +Hive UDTF dup 4                                    1212           1294       
>    78          0.9        1156.0       0.6X
> + {code}
> over 2x performance gain via a benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42372:


Assignee: Apache Spark

> Improve performance of HiveGenericUDTF by making inputProjection instantiate 
> once
> -
>
> Key: SPARK-42372
> URL: https://issues.apache.org/jira/browse/SPARK-42372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                    1574           1680       
>   118          0.7        1501.1       1.0X
> +Hive UDTF dup 4                                    2642           3076       
>   588          0.4        2519.9       0.6X
> +
> diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
> b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> new file mode 100644
> index 00..8af8b6582c
> --- /dev/null
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                     712            789       
>   101          1.5         678.7       1.0X
> +Hive UDTF dup 4                                    1212           1294       
>    78          0.9        1156.0       0.6X
> + {code}
> over 2x performance gain via a benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-42372:
-
Description: 
{code:java}
+++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
@@ -0,0 +1,7 @@
+OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
+Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
+Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+
+Hive UDTF dup 2                                    1574           1680         
118          0.7        1501.1       1.0X
+Hive UDTF dup 4                                    2642           3076         
588          0.4        2519.9       0.6X
+
diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
new file mode 100644
index 00..8af8b6582c
--- /dev/null
+++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
@@ -0,0 +1,7 @@
+OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
+Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
+Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+
+Hive UDTF dup 2                                     712            789         
101          1.5         678.7       1.0X
+Hive UDTF dup 4                                    1212           1294         
 78          0.9        1156.0       0.6X
+ {code}
over 2x performance gain via a benchmarking

> Improve performance of HiveGenericUDTF by making inputProjection instantiate 
> once
> -
>
> Key: SPARK-42372
> URL: https://issues.apache.org/jira/browse/SPARK-42372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-per-row-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                    1574           1680       
>   118          0.7        1501.1       1.0X
> +Hive UDTF dup 4                                    2642           3076       
>   588          0.4        2519.9       0.6X
> +
> diff --git a/sql/hive/benchmarks/HiveUDFBenchmark-results.txt 
> b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> new file mode 100644
> index 00..8af8b6582c
> --- /dev/null
> +++ b/sql/hive/benchmarks/HiveUDFBenchmark-results.txt
> @@ -0,0 +1,7 @@
> +OpenJDK 64-Bit Server VM 1.8.0_352-bre_2022_12_13_23_06-b00 on Mac OS X 13.1
> +Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> +Hive UDTF benchmark:                      Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
> +
> +Hive UDTF dup 2                                     712            789       
>   101          1.5         678.7       1.0X
> +Hive UDTF dup 4                                    1212           1294       
>    78          0.9        1156.0       0.6X
> + {code}
> over 2x performance gain via a benchmarking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42372) Improve performance of HiveGenericUDTF by making inputProjection instantiate once

2023-02-07 Thread Kent Yao (Jira)
Kent Yao created SPARK-42372:


 Summary: Improve performance of HiveGenericUDTF by making 
inputProjection instantiate once
 Key: SPARK-42372
 URL: https://issues.apache.org/jira/browse/SPARK-42372
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685208#comment-17685208
 ] 

Apache Spark commented on SPARK-42371:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39928

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685209#comment-17685209
 ] 

Apache Spark commented on SPARK-42371:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39928

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42371:


Assignee: (was: Apache Spark)

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42371:


Assignee: Apache Spark

> Add scripts to start and stop Spark Connect server
> --
>
> Key: SPARK-42371
> URL: https://issues.apache.org/jira/browse/SPARK-42371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, there is no proper way to start and stop the Spark Connect server. 
> Now it requires you to start it with, for example, a Spark shell:
> {code}
> # For development,
> ./bin/spark-shell \
>--jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> {code}
> # For released Spark versions
> ./bin/spark-shell \
>   --packages org.apache.spark:spark-connect_2.12:3.4.0 \
>   --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
> {code}
> which is awkward.
> We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42266) Local mode should work with IPython

2023-02-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685207#comment-17685207
 ] 

Hyukjin Kwon commented on SPARK-42266:
--

Let me take a look

> Local mode should work with IPython
> ---
>
> Key: SPARK-42266
> URL: https://issues.apache.org/jira/browse/SPARK-42266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> (spark_dev) ➜  spark git:(master) bin/pyspark --remote "local[*]"
> Python 3.9.15 (main, Nov 24 2022, 08:28:41) 
> Type 'copyright', 'credits' or 'license' for more information
> IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help.
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py:45: UserWarning: 
> Failed to initialize Spark session.
>   warnings.warn("Failed to initialize Spark session.")
> Traceback (most recent call last):
>   File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py", line 40, in 
> 
> spark = SparkSession.builder.getOrCreate()
>   File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/session.py", line 
> 429, in getOrCreate
> from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
>   File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/__init__.py", line 
> 21, in 
> from pyspark.sql.connect.dataframe import DataFrame  # noqa: F401
>   File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 35, in 
> import pandas
>   File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", 
> line 29, in 
> from pyspark.pandas.missing.general_functions import 
> MissingPandasLikeGeneralFunctions
>   File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", 
> line 34, in 
> require_minimum_pandas_version()
>   File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/utils.py", 
> line 37, in require_minimum_pandas_version
> if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
> AttributeError: partially initialized module 'pandas' has no attribute 
> '__version__' (most likely due to a circular import)
> [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file 
> /Users/ruifeng.zheng/Dev/spark//python/pyspark/shell.py:
> ---
> AttributeErrorTraceback (most recent call last)
> File ~/Dev/spark/python/pyspark/shell.py:40
>  38 try:
>  39 # Creates pyspark.sql.connect.SparkSession.
> ---> 40 spark = SparkSession.builder.getOrCreate()
>  41 except Exception:
> File ~/Dev/spark/python/pyspark/sql/session.py:429, in 
> SparkSession.Builder.getOrCreate(self)
> 428 with SparkContext._lock:
> --> 429 from pyspark.sql.connect.session import SparkSession as 
> RemoteSparkSession
> 431 if (
> 432 SparkContext._active_spark_context is None
> 433 and SparkSession._instantiatedSession is None
> 434 ):
> File ~/Dev/spark/python/pyspark/sql/connect/__init__.py:21
>  18 """Currently Spark Connect is very experimental and the APIs to 
> interact with
>  19 Spark through this API are can be changed at any time without 
> warning."""
> ---> 21 from pyspark.sql.connect.dataframe import DataFrame  # noqa: F401
>  22 from pyspark.sql.pandas.utils import (
>  23 require_minimum_pandas_version,
>  24 require_minimum_pyarrow_version,
>  25 require_minimum_grpc_version,
>  26 )
> File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:35
>  34 import random
> ---> 35 import pandas
>  36 import json
> File ~/Dev/spark/python/pyspark/pandas/__init__.py:29
>  27 from typing import Any
> ---> 29 from pyspark.pandas.missing.general_functions import 
> MissingPandasLikeGeneralFunctions
>  30 from pyspark.pandas.missing.scalars import MissingPandasLikeScalars
> File ~/Dev/spark/python/pyspark/pandas/__init__.py:34
>  33 try:
> ---> 34 require_minimum_pandas_version()
>  35 require_minimum_pyarrow_version()
> File ~/Dev/spark/python/pyspark/sql/pandas/utils.py:37, in 
> require_minimum_pandas_version()
>  34 raise ImportError(
>  35 "Pandas >= %s must be installed; however, " "it was not 
> found." % minimum_pandas_version
>  36 ) from raised_error
> ---> 37 if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
>  38 raise ImportError(
>  39 "Pandas >= %s must be installed; however, "
>  40 "your version was %s." % (minimum_pandas_version, 
> pandas.__version__)
>  41 )
> AttributeError: partially initialized module 'pandas' has no attribute 
> '__version__' 

[jira] [Created] (SPARK-42371) Add scripts to start and stop Spark Connect server

2023-02-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-42371:


 Summary: Add scripts to start and stop Spark Connect server
 Key: SPARK-42371
 URL: https://issues.apache.org/jira/browse/SPARK-42371
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Currently, there is no proper way to start and stop the Spark Connect server. 
Now it requires you to start it with, for example, a Spark shell:

{code}
# For development,
./bin/spark-shell \
   --jars `ls connector/connect/target/**/spark-connect*SNAPSHOT.jar` \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
{code}

{code}
# For released Spark versions
./bin/spark-shell \
  --packages org.apache.spark:spark-connect_2.12:3.4.0 \
  --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
{code}

which is awkward.

We need some dedicated scripts for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685177#comment-17685177
 ] 

Hyukjin Kwon commented on SPARK-39375:
--

[~xkrogen] I just saw this. Some work is done and merged. Some initial work was 
done in https://github.com/apache/spark/pull/39585. I thought it's actually not 
that complicated - having one general layer shared with all Scala, Python, etc 
UDFs and it contains the actual Python UDF implementation.

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41289) Feature parity: Catalog API

2023-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41289.
--
Resolution: Done

> Feature parity: Catalog API
> ---
>
> Key: SPARK-41289
> URL: https://issues.apache.org/jira/browse/SPARK-41289
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42370) Spark History Server fails to start on CentOS7 aarch64

2023-02-07 Thread Zhiguo Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiguo Wu updated SPARK-42370:
--
Description: 
When I run `./sbin/start-history-server.sh`

I'll get the error below

!image-2023-02-07-16-54-43-593.png!

 

Although we already use org.openlabtesting.leveldbjni on aarch64, which can 
works on aarch64,

we still load org.fusesource.hawtjni.runtime.Library on wrong jar file

When we run `export SPARK_DAEMON_JAVA_OPTS=-verbose:class`, we can see the 
class is loaded from jline-2.14.6.jar where the correct class file is under 
leveldbjni-all-1.8.jar

 

Incorrect(now):

[Loaded org.fusesource.hawtjni.runtime.Library from 
file:/yourdir/spark/jars/jline-2.14.6.jar] 

Correct(expected):

[Loaded org.fusesource.hawtjni.runtime.Library from 
file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] 

  was:
When I run `./sbin/start-history-server.sh`

I'll get the error below

!image-2023-02-07-16-54-43-593.png!

 

Although we already use org.openlabtesting.leveldbjni on aarch64, which can 
works on aarch64,

we still load org.fusesource.hawtjni.runtime.Library on wrong jar file

we can see the class is load from jline-2.14.6.jar where the correct class file 
is under leveldbjni-all-1.8.jar when we run export 
SPARK_DAEMON_JAVA_OPTS=-verbose:class

 

Incorrect:

[Loaded org.fusesource.hawtjni.runtime.Library from 
file:/yourdir/spark/jars/jline-2.14.6.jar] 

Correct:

[Loaded org.fusesource.hawtjni.runtime.Library from 
file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] 


> Spark History Server fails to start on CentOS7 aarch64
> --
>
> Key: SPARK-42370
> URL: https://issues.apache.org/jira/browse/SPARK-42370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: Zhiguo Wu
>Priority: Major
> Attachments: image-2023-02-07-16-54-43-593.png
>
>
> When I run `./sbin/start-history-server.sh`
> I'll get the error below
> !image-2023-02-07-16-54-43-593.png!
>  
> Although we already use org.openlabtesting.leveldbjni on aarch64, which can 
> works on aarch64,
> we still load org.fusesource.hawtjni.runtime.Library on wrong jar file
> When we run `export SPARK_DAEMON_JAVA_OPTS=-verbose:class`, we can see the 
> class is loaded from jline-2.14.6.jar where the correct class file is under 
> leveldbjni-all-1.8.jar
>  
> Incorrect(now):
> [Loaded org.fusesource.hawtjni.runtime.Library from 
> file:/yourdir/spark/jars/jline-2.14.6.jar] 
> Correct(expected):
> [Loaded org.fusesource.hawtjni.runtime.Library from 
> file:/yourdir/spark/jars/leveldbjni-all-1.8.jar] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >