[jira] [Resolved] (SPARK-38433) Add Shell Code Style Check Action

2022-09-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38433.
--
Resolution: Won't Fix

> Add Shell Code Style Check Action
> -
>
> Key: SPARK-38433
> URL: https://issues.apache.org/jira/browse/SPARK-38433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no shell code check in the current spark github actions. Some shell 
> codes are written incorrectly and run abnormally in special cases. Besides, 
> they cannot also pass the check of the shellcheck plugin, especially in IDEA 
> or shellcheck actions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40552) Upgrade protobuf-python from 4.21.5 to 4.21.6

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40552:
-
Priority: Minor  (was: Major)

> Upgrade protobuf-python from 4.21.5 to 4.21.6
> -
>
> Key: SPARK-40552
> URL: https://issues.apache.org/jira/browse/SPARK-40552
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> [CVE-2022-1941|https://nvd.nist.gov/vuln/detail/CVE-2022-1941]
> [Github|https://github.com/advisories/GHSA-8gq9-2x98-w8hf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40552) Upgrade protobuf-python from 4.21.5 to 4.21.6

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40552.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37991
[https://github.com/apache/spark/pull/37991]

> Upgrade protobuf-python from 4.21.5 to 4.21.6
> -
>
> Key: SPARK-40552
> URL: https://issues.apache.org/jira/browse/SPARK-40552
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> [CVE-2022-1941|https://nvd.nist.gov/vuln/detail/CVE-2022-1941]
> [Github|https://github.com/advisories/GHSA-8gq9-2x98-w8hf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40552) Upgrade protobuf-python from 4.21.5 to 4.21.6

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40552:
-
Component/s: Build

> Upgrade protobuf-python from 4.21.5 to 4.21.6
> -
>
> Key: SPARK-40552
> URL: https://issues.apache.org/jira/browse/SPARK-40552
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> [CVE-2022-1941|https://nvd.nist.gov/vuln/detail/CVE-2022-1941]
> [Github|https://github.com/advisories/GHSA-8gq9-2x98-w8hf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40552) Upgrade protobuf-python from 4.21.5 to 4.21.6

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40552:


Assignee: Bjørn Jørgensen

> Upgrade protobuf-python from 4.21.5 to 4.21.6
> -
>
> Key: SPARK-40552
> URL: https://issues.apache.org/jira/browse/SPARK-40552
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-1941|https://nvd.nist.gov/vuln/detail/CVE-2022-1941]
> [Github|https://github.com/advisories/GHSA-8gq9-2x98-w8hf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40478) Add create datasource table options docs

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40478.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37919
[https://github.com/apache/spark/pull/37919]

> Add create datasource table options docs
> 
>
> Key: SPARK-40478
> URL: https://issues.apache.org/jira/browse/SPARK-40478
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40478) Add create datasource table options docs

2022-09-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40478:


Assignee: XiDuo You

> Add create datasource table options docs
> 
>
> Key: SPARK-40478
> URL: https://issues.apache.org/jira/browse/SPARK-40478
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40142:
-
Priority: Minor  (was: Major)

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40476.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37918
[https://github.com/apache/spark/pull/37918]

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40476:


Assignee: Ruifeng Zheng

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-09-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40175:
-
Priority: Minor  (was: Major)

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40491:
-
Issue Type: Task  (was: New Feature)
  Priority: Trivial  (was: Major)

This didn't need a JIRA - it was not Major. Please set the fields appropriately

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Trivial
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly

2022-09-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40456:
-
Priority: Minor  (was: Major)

> PartitionIterator.hasNext should be cheap to call repeatedly
> 
>
> Key: SPARK-40456
> URL: https://issues.apache.org/jira/browse/SPARK-40456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40294) Repeat calls to `PartitionIterator.hasNext` can timeout

2022-09-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40294.
--
Resolution: Duplicate

> Repeat calls to `PartitionIterator.hasNext` can timeout
> ---
>
> Key: SPARK-40294
> URL: https://issues.apache.org/jira/browse/SPARK-40294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Richard Chen
>Priority: Major
>
> Repeat calls to {{PartitionIterator.hasNext}} where both calls return 
> {{false}} can result in timeouts. For example, 
> {{{}KafkaBatchPartitionReader.next(){}}}, which calls {{consumer.get}} (which 
> can potentially timeout with repeat calls), is called by 
> {{{}PartitionIterator.hasNext{}}}. Thus, repeat calls to 
> {{PartitionIterator.hasNext}} by its parent could timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40398) Use Loop instead of Arrays.stream api

2022-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40398.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37843
[https://github.com/apache/spark/pull/37843]

> Use Loop instead of Arrays.stream api
> -
>
> Key: SPARK-40398
> URL: https://issues.apache.org/jira/browse/SPARK-40398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> When the logic of stream pipe is relatively simple, using Arrays.stream is 
> always slower than using loop directly
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40398) Use Loop instead of Arrays.stream api

2022-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40398:


Assignee: Yang Jie

> Use Loop instead of Arrays.stream api
> -
>
> Key: SPARK-40398
> URL: https://issues.apache.org/jira/browse/SPARK-40398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> When the logic of stream pipe is relatively simple, using Arrays.stream is 
> always slower than using loop directly
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40376) `np.bool` will be deprecated

2022-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40376:


Assignee: Elhoussine Talab

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Assignee: Elhoussine Talab
>Priority: Trivial
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40376) `np.bool` will be deprecated

2022-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40376.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37817
[https://github.com/apache/spark/pull/37817]

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Assignee: Elhoussine Talab
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40331) Java 11 should be used as the recommended running environment

2022-09-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40331.
--
Resolution: Won't Fix

> Java 11 should be used as the recommended running environment
> -
>
> Key: SPARK-40331
> URL: https://issues.apache.org/jira/browse/SPARK-40331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Similar cases described in SPARK-40303  will not have negative effects if 
> Java 11+ is used as runtime
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40376) `np.bool` will be deprecated

2022-09-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40376:
-
Issue Type: Improvement  (was: Bug)
  Priority: Trivial  (was: Major)

This is not a bug

> `np.bool` will be deprecated
> 
>
> Key: SPARK-40376
> URL: https://issues.apache.org/jira/browse/SPARK-40376
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Elhoussine Talab
>Priority: Trivial
>
> Using `np.bool` generates this warning: 
> {quote}
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
> 3070E                     `np.bool` is a deprecated alias for the builtin 
> `bool`. To silence this warning, use `bool` by itself. Doing this will not 
> modify any behavior and is safe. If you specifically wanted the numpy scalar 
> type, use `np.bool_` here.
> 3071E                   Deprecated in NumPy 1.20; for more details and 
> guidance: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
> {quote}
>  
> See Numpy's deprecation statement here: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40326.
--
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37796
[https://github.com/apache/spark/pull/37796]

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40326:
-
Priority: Minor  (was: Major)

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40326:


Assignee: Bjørn Jørgensen

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40321) Upgrade rocksdbjni to 7.5.3

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40321:


Assignee: Yang Jie

> Upgrade rocksdbjni to 7.5.3
> ---
>
> Key: SPARK-40321
> URL: https://issues.apache.org/jira/browse/SPARK-40321
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/facebook/rocksdb/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40321) Upgrade rocksdbjni to 7.5.3

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40321.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37783
[https://github.com/apache/spark/pull/37783]

> Upgrade rocksdbjni to 7.5.3
> ---
>
> Key: SPARK-40321
> URL: https://issues.apache.org/jira/browse/SPARK-40321
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/facebook/rocksdb/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39996:


Assignee: Bjørn Jørgensen

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39996.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37762
[https://github.com/apache/spark/pull/37762]

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39996:
-
Component/s: Tests

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39996:
-
Priority: Minor  (was: Major)

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40316) Upgrading to Spark 3 is giving NullPointerException

2022-09-03 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599960#comment-17599960
 ] 

Sean R. Owen commented on SPARK-40316:
--

It's possible this was handled differently in earlier Spark/Scala differently 
(results in 0s?) but it still points to an error in your UDF. Why not pursue 
that?

> Upgrading to Spark 3 is giving NullPointerException
> ---
>
> Key: SPARK-40316
> URL: https://issues.apache.org/jira/browse/SPARK-40316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Sachit
>Priority: Major
>
> Getting below error while upgrading to Spark3
>  
> java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> - array element class: "scala.Long"
> - root class: "scala.collection.Seq"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> mapobjects(lambdavariable(MapObject, LongType, true, -1), 
> assertnotnull(lambdavariable(MapObject, LongType, true, -1)), input[0, 
> array, true], Some(interface scala.collection.Seq))
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1047)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:184)
>     at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:164)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40316) Upgrading to Spark 3 is giving NullPointerException

2022-09-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40316.
--
Resolution: Not A Problem

> Upgrading to Spark 3 is giving NullPointerException
> ---
>
> Key: SPARK-40316
> URL: https://issues.apache.org/jira/browse/SPARK-40316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Sachit
>Priority: Major
>
> Getting below error while upgrading to Spark3
>  
> java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> - array element class: "scala.Long"
> - root class: "scala.collection.Seq"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> mapobjects(lambdavariable(MapObject, LongType, true, -1), 
> assertnotnull(lambdavariable(MapObject, LongType, true, -1)), input[0, 
> array, true], Some(interface scala.collection.Seq))
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1047)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:184)
>     at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:164)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40316) Upgrading to Spark 3 is giving NullPointerException

2022-09-03 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599930#comment-17599930
 ] 

Sean R. Owen commented on SPARK-40316:
--

This says your UDF returns a Seq containing null, but the signature says it's 
going to be a Seq of primitive longs which can't be null

> Upgrading to Spark 3 is giving NullPointerException
> ---
>
> Key: SPARK-40316
> URL: https://issues.apache.org/jira/browse/SPARK-40316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Sachit
>Priority: Major
>
> Getting below error while upgrading to Spark3
>  
> java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> - array element class: "scala.Long"
> - root class: "scala.collection.Seq"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> mapobjects(lambdavariable(MapObject, LongType, true, -1), 
> assertnotnull(lambdavariable(MapObject, LongType, true, -1)), input[0, 
> array, true], Some(interface scala.collection.Seq))
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1047)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:184)
>     at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:164)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40316) Upgrading to Spark 3 is giving NullPointerException

2022-09-02 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599629#comment-17599629
 ] 

Sean R. Owen commented on SPARK-40316:
--

There's no info about what you're doing that triggers this, or what you've done 
to debug

> Upgrading to Spark 3 is giving NullPointerException
> ---
>
> Key: SPARK-40316
> URL: https://issues.apache.org/jira/browse/SPARK-40316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Sachit
>Priority: Major
>
> Getting below error while upgrading to Spark3
>  
> java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> - array element class: "scala.Long"
> - root class: "scala.collection.Seq"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> mapobjects(lambdavariable(MapObject, LongType, true, -1), 
> assertnotnull(lambdavariable(MapObject, LongType, true, -1)), input[0, 
> array, true], Some(interface scala.collection.Seq))
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1047)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:184)
>     at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:164)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40251) Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2

2022-09-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40251.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37700
[https://github.com/apache/spark/pull/37700]

> Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2
> --
>
> Key: SPARK-40251
> URL: https://issues.apache.org/jira/browse/SPARK-40251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/luhenry/netlib/compare/v2.2.1...v3.0.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40251) Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2

2022-09-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40251:


Assignee: BingKun Pan

> Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2
> --
>
> Key: SPARK-40251
> URL: https://issues.apache.org/jira/browse/SPARK-40251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> https://github.com/luhenry/netlib/compare/v2.2.1...v3.0.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval

2022-09-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40279:


Assignee: Luca Canali

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40279) Document spark.yarn.report.interval

2022-09-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40279.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37731
[https://github.com/apache/spark/pull/37731]

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.4.0
>
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598574#comment-17598574
 ] 

Sean R. Owen commented on SPARK-40286:
--

I could be completely wrong, but then I'd be quite as surprised as you are, if 
that's how this is meant to work. If so it needs to be in the docs

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598568#comment-17598568
 ] 

Sean R. Owen commented on SPARK-40286:
--

No, LOAD DATA does not delete source data. I'm not sure what's happening here, 
but I suspect something else is removing those files

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39916) Merge SchemaUtils from mlib to SQL

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39916.
--
Resolution: Won't Fix

> Merge SchemaUtils from mlib to SQL
> --
>
> Key: SPARK-39916
> URL: https://issues.apache.org/jira/browse/SPARK-39916
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>
> Today we have two SchemaUtils: SQL SchemaUtils and mllib SchemaUtils. the 
> SchemaUtils of mllib left a TODO tag to merge to SQL. Let's do this!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39269) spark3.2.0 commit tmp file is not found when rename

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39269.
--
Resolution: Invalid

> spark3.2.0 commit tmp file is not found when rename 
> 
>
> Key: SPARK-39269
> URL: https://issues.apache.org/jira/browse/SPARK-39269
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 3.2.0
> Environment: spark 3.2.0
> yarn
> 2 executors and 1 driver
> a job include of 4 stream query
>Reporter: cxb
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> a job include 4 stream query and one of query throw offset tmp file is not 
> found in running that lead to exit the job
> but it hasn't happen to me ever when i using spark 3.0.0
> i look at code of implement in spark3.2 that it is not big different to 
> spark3.0
> maybe jackson of new version problem?
>  
> {code:java}
> java.io.FileNotFoundException: rename source 
> /tmp/chenxiaobin/regist_gp_bmhb_v2/commits/.35362.b4684b94-c0bb-4d87-baf0-cd1a508d7be7.tmp
>  is not found.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.validateRenameSource(FSDirRenameOp.java:561)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.unprotectedRenameTo(FSDirRenameOp.java:361)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameTo(FSDirRenameOp.java:300)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameToInt(FSDirRenameOp.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3931)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename2(NameNodeRpcServer.java:1039)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename2(ClientNamenodeProtocolServerSideTranslatorPB.java:610)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:1991)
>   at org.apache.hadoop.fs.Hdfs.renameInternal(Hdfs.java:341)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:690)
>   at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
>   at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:346)
>   at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:154)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$addNewBatchByStream$2(HDFSMetadataLog.scala:176)
>   at 
> scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.addNewBatchByStream(HDFSMetadataLog.scala:171)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$18(MicroBatchExecution.scala:615)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:627)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:612)
>   at 
> 

[jira] [Commented] (SPARK-40233) Unable to load large pandas dataframe to pyspark

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598550#comment-17598550
 ] 

Sean R. Owen commented on SPARK-40233:
--

That's what happens, right?
Spark is of course meant to read data sets directly in parallel. You don't read 
them single-node and then send them to Spark in general.

> Unable to load large pandas dataframe to pyspark
> 
>
> Key: SPARK-40233
> URL: https://issues.apache.org/jira/browse/SPARK-40233
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Niranda Perera
>Priority: Major
>
> I've been trying to join two large pandas dataframes using pyspark using the 
> following code. I'm trying to vary executor cores allocated for the 
> application and measure scalability of pyspark (strong scaling).
> {code:java}
> r = 10 # 1Bn rows 
> it = 10
> w = 256
> unique = 0.9
> TOTAL_MEM = 240
> TOTAL_NODES = 14
> max_val = r * unique
> rng = default_rng()
> frame_data = rng.integers(0, max_val, size=(r, 2)) 
> frame_data1 = rng.integers(0, max_val, size=(r, 2)) 
> print(f"data generated", flush=True)
> df_l = pd.DataFrame(frame_data).add_prefix("col")
> df_r = pd.DataFrame(frame_data1).add_prefix("col")
> print(f"data loaded", flush=True)
> procs = int(math.ceil(w / TOTAL_NODES))
> mem = int(TOTAL_MEM*0.9)
> print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", 
> flush=True)
> spark = SparkSession\
> .builder\
> .appName(f'join {r} {w}')\
> .master('spark://node:7077')\
> .config('spark.executor.memory', f'{int(mem*0.6)}g')\
> .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
> .config('spark.cores.max', w)\
> .config('spark.driver.memory', '100g')\
> .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
> .getOrCreate()
> sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
> sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
> print(f"data loaded to spark", flush=True)
> try:   
> for i in range(it):
> t1 = time.time()
> out = sdf0.join(sdf1, on='col0', how='inner')
> count = out.count()
> t2 = time.time()
> print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", 
> flush=True)
> 
> del out
> del count
> gc.collect()
> finally:
> spark.stop() {code}
> {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 
> 48 cores and 240GB RAM each. I've spawned master and the driver code in 
> node1, while other 14 nodes have spawned workers allocating maximum memory. 
> In the spark context, I am reserving 90% of total memory to executor, 
> splitting 60% to jvm and 40% to pyspark.
> {*}Issue{*}: When I run the above program, I can see that the executors are 
> being assigned to the app. But it doesn't move forward, even after 60 mins. 
> For smaller row count (10M), this was working without a problem. Driver output
> {code:java}
> world sz 256 procs per worker 19 mem 216 iter 8
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425:
>  UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Negative initial size: -589934400
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warn(msg) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39895) pyspark drop doesn't accept *cols

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598545#comment-17598545
 ] 

Sean R. Owen commented on SPARK-39895:
--

Not a big deal, but the example doesn't make sense to me. It's multiple cols in 
one string, not multiple strings or cols. Right? that doesn't seem like the 
right example

> pyspark drop doesn't accept *cols 
> --
>
> Key: SPARK-39895
> URL: https://issues.apache.org/jira/browse/SPARK-39895
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Assignee: Santosh Pingale
>Priority: Minor
> Fix For: 3.4.0
>
>
> Pyspark dataframe drop has following signature:
> {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> 
> "DataFrame":}}{color}
> However when we try to pass multiple Column types to drop function it raises 
> TypeError
> {{each col in the param list should be a string}}
> *Minimal reproducible example:*
> {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), 
> ("id_1", 3, 3), ("id_2", 4, 3)]{color}
> {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, 
> count int"){color}
> |– id: string (nullable = true)|
> |– point: integer (nullable = true)|
> |– count: integer (nullable = true)|
> {color:#4c9aff}{{df.drop(df.point, df.count)}}{color}
> {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py 
> in drop(self, *cols){color}
> {color:#505f79}2537 for col in cols:{color}
> {color:#505f79}2538 if not isinstance(col, str):{color}
> {color:#505f79}-> 2539 raise TypeError("each col in the param list should be 
> a string"){color}
> {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color}
> {color:#505f79}2541{color}
> {color:#505f79}TypeError: each col in the param list should be a string{color}
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598542#comment-17598542
 ] 

Sean R. Owen commented on SPARK-40286:
--

Where is src stored? LOAD DATA should not affect the source, but, you are 
OVERWRITEing whatever is in src's storage.


> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39948) exclude velocity 1.5 jar

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598540#comment-17598540
 ] 

Sean R. Owen commented on SPARK-39948:
--

Do any of them affect Spark?

> exclude velocity 1.5 jar
> 
>
> Key: SPARK-39948
> URL: https://issues.apache.org/jira/browse/SPARK-39948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: melin
>Priority: Major
>
> hive-exec depends on importing velocity. Velocity has an older version and 
> has many security issues
> https://issues.apache.org/jira/browse/HIVE-25726
>  
> !image-2022-08-02-14-05-55-756.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598539#comment-17598539
 ] 

Sean R. Owen commented on SPARK-39995:
--

Would scala version generally matter to python users who would download from 
Pypi? I envision they're doing something local and probably never care about 
the JVM side

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40023) Issue with Spark Core version 3.3.0

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40023.
--
Resolution: Invalid

> Issue with Spark Core  version 3.3.0
> 
>
> Key: SPARK-40023
> URL: https://issues.apache.org/jira/browse/SPARK-40023
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
>Reporter: shamim
>Priority: Major
>
> Hi,
> While creating object in java of SparkConf myConf = new SparkConf();   
> shpwing noClassDefintion Found error
>  
> Not able set conf properties. Spark Version 3.3.0.
>  
> I have imported the  import org.apache.spark.SparkConf;   also but still not 
> getting initialized 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40122) py4j-0.10.9.5 often produces "Connection reset by peer" in Spark 3.3.0

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40122.
--
Resolution: Invalid

This itself doesn't mean anything - means the Python process died. It'd have to 
be more specific to be actionable

> py4j-0.10.9.5 often produces "Connection reset by peer"  in Spark 3.3.0
> ---
>
> Key: SPARK-40122
> URL: https://issues.apache.org/jira/browse/SPARK-40122
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Ihor Bobak
>Priority: Major
>
> Without any visible reason I am getting this error in my Jupyter notebook 
> (see stacktrace below) with pyspark kernel. Often it occurs even if no Spark 
> operations are made, e.g. when I am working with multiprocessing Pool for a 
> local piece of code that should parallelize on the cores of the driver, with 
> no spark transformations/actions done in that jupyter cell.
> INFO:py4j.clientserver:Error while sending or receiving.
> Traceback (most recent call last):
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
>  line 503, in send_command
> self.socket.sendall(command.encode("utf-8"))
> ConnectionResetError: [Errno 104] Connection reset by peer
> INFO:py4j.clientserver:Closing down clientserver connection
> INFO:root:Exception while sending command.
> Traceback (most recent call last):
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
>  line 503, in send_command
> self.socket.sendall(command.encode("utf-8"))
> ConnectionResetError: [Errno 104] Connection reset by peer
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1038, in send_command
> response = connection.send_command(command)
>   File 
> "/opt/spark-3.3.0-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py",
>  line 506, in send_command
> raise Py4JNetworkError(
> py4j.protocol.Py4JNetworkError: Error while sending
> INFO:py4j.clientserver:Closing down clientserver connection



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40123) Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598536#comment-17598536
 ] 

Sean R. Owen commented on SPARK-40123:
--

Mesos is deprecated, but, if you want you can open a PR to update this. I think 
this is going away in Spark 4 but probably fine to just update the lib now

> Security Vulnerability CVE-2018-11793 due to mesos-1.4.3-shaded-protobuf.jar
> 
>
> Key: SPARK-40123
> URL: https://issues.apache.org/jira/browse/SPARK-40123
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.3.0
>Reporter: manohar
>Priority: Major
>  Labels: security-issue
>
> Hello Team,
> We are facing this vulnerability on Spark Installation 3.3.3 , Can we please 
> upgrade the version of mesos in our installation to address this 
> vulnerability. 
> ||Package||cve||cvss||severity||pkg_version||fixed_in_pkg||pkg_path||
> |1|org.apache.mesos_mesos|CVE-2018-11793|7|high|1.4.0|1.7.1, 1.6.2, 1.5.2, 
> 1.4.3|/opt/domino/spark/python/build/lib/pyspark/jars/mesos-1.4.0-shaded-protobuf.jar|
> In our source code i found that the depedant version of mesos jar is 1.4.3 
> user@ThinkPad-E14-02:~/Downloads/spark-master$ grep -ir mesos- * 
> core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala: * 
> TaskSchedulerImpl. We assume a Mesos-like model where the application gets 
> resource offers as
> *dev/deps/spark-deps-hadoop-2-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> dev/deps/spark-deps-hadoop-3-hive-2.3:mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar
> *



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40126.
--
Resolution: Invalid

> Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical 
> vulnerability
> 
>
> Key: SPARK-40126
> URL: https://issues.apache.org/jira/browse/SPARK-40126
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jason Tan
>Priority: Major
>
> Dear Spark Team,
> Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
> discovered the following vulnerability/scan results within the image :
>       Type:            VULNERABILITY
>       Name:            DSA-5169-1
>       CVSS Score v3:   9.8
>       Severity:        critical
> The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
> upgrade the version of openssl to 1.1.1n-0+deb11u3
> Steps to reproduce:
> Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/]
> trivy image docker.io/apache/spark:v3.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40126) Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical vulnerability

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598534#comment-17598534
 ] 

Sean R. Owen commented on SPARK-40126:
--

This isn't part of Spark. You're looking at some convenience distribution in a 
Docker image

> Security scanning spark v3.3.0 docker image results in DSA-5169-1 critical 
> vulnerability
> 
>
> Key: SPARK-40126
> URL: https://issues.apache.org/jira/browse/SPARK-40126
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Jason Tan
>Priority: Major
>
> Dear Spark Team,
> Whilst security scanning the docker image: docker.io/apache/spark:v3.3.0, I 
> discovered the following vulnerability/scan results within the image :
>       Type:            VULNERABILITY
>       Name:            DSA-5169-1
>       CVSS Score v3:   9.8
>       Severity:        critical
> The advice from [https://www.debian.org/security/2022/dsa-5169] suggests to 
> upgrade the version of openssl to 1.1.1n-0+deb11u3
> Steps to reproduce:
> Install trivy [https://aquasecurity.github.io/trivy/v0.18.3/installation/]
> trivy image docker.io/apache/spark:v3.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40170.
--
Resolution: Invalid

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598532#comment-17598532
 ] 

Sean R. Owen commented on SPARK-40200:
--

I can't make out what this is reporting, please start over and explain the 
expected vs actual behavior clearly with as simple an example as possible

> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> _*monotonically_increasing_id*_ cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
> {code:java}
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark.createDataset(Seq("a", "b", "c"))
>  .withColumn("id", monotonically_increasing_id)
>  .as[a]
>  .persist(storageLevel)
> val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
>  .persist(storageLevel)
> val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
> parent2DS("value"))
>.map(i =>{ 
>   acc.add(1) 
>   i
> }).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40200.
--
Resolution: Invalid

> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> _*monotonically_increasing_id*_ cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
> {code:java}
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark.createDataset(Seq("a", "b", "c"))
>  .withColumn("id", monotonically_increasing_id)
>  .as[a]
>  .persist(storageLevel)
> val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
>  .persist(storageLevel)
> val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
> parent2DS("value"))
>.map(i =>{ 
>   acc.add(1) 
>   i
> }).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40232.
--
Resolution: Not A Problem

No, initSteps controls an aspect of the initialization. I don't think you want 
to change it. I would expect potentially different results with different seeds 
and initializations. Maybe not really different results but I don't know if 
your maxIter is high enough or whether the comparison to sklearn is apples to 
apples. Too many vairables

> KMeans: high variability in results despite high initSteps parameter value
> --
>
> Key: SPARK-40232
> URL: https://issues.apache.org/jira/browse/SPARK-40232
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Patryk Piekarski
>Priority: Major
> Attachments: sample_data.csv
>
>
> I'm running KMeans on a sample dataset using PySpark. I want the results to 
> be fairly stable, so I play with the _initSteps_ parameter. My understanding 
> is that the higher the number of steps for k-means|| initialization mode, the 
> higher the number of iterations the algorithm runs and in the end selects the 
> best model out of all iterations. And that's the behavior I observe when 
> running sklearn implementation with _n_init_ >= 10. However, when running 
> PySpark implementation, regardless of the number of partitions of underlying 
> data frame (tested on 1, 4, 8 number of partitions), even when setting 
> _initSteps_ to 10, 50, or let's say 500, the results I get with different 
> seeds are different and trainingCost value I observe is sometimes far from 
> the lowest.
> As a workaround, to force the algorithm to iterate and select the best model 
> I used a loop with dynamic seed.
> SKlearn in each iteration gets the trainingCost near 276655.
> PySpark implementation of KMeans gets there in the 2nd, 5th and 6th 
> iteration, but all the remaining iterations yield higher values.
> Does the _initSteps_ parameter work as expected? Because my findings suggest 
> that something might be off here.
> Let me know where I could upload this sample dataset (2MB)
>  
> {code:java}
> import pandas as pd
> from sklearn.cluster import KMeans as KMeansSKlearn
> df = pd.read_csv('sample_data.csv')
> minimum = 
> for i in range(1,10):
>     kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, 
> random_state=i)
>     model = kmeans.fit(df)
>     print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 
> import SparkSession
> spark= SparkSession.builder \
>     .appName("kmeans-test") \
>     .config('spark.driver.memory', '2g') \
>     .master("local[2]") \
>     .getOrCreate()df1 = spark.createDataFrame(df)
> from pyspark.ml.clustering import KMeans
> from pyspark.ml.feature import VectorAssembler
> assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
> assembled_data=assemble.transform(df1)
> minimum = 
> for i in range(1,10):
>     kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, 
> seed=i, tol=0.0001)
>     model = kmeans.fit(assembled_data)
>     summary = model.summary
>     print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40233) Unable to load large pandas dataframe to pyspark

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40233.
--
Resolution: Not A Problem

> Unable to load large pandas dataframe to pyspark
> 
>
> Key: SPARK-40233
> URL: https://issues.apache.org/jira/browse/SPARK-40233
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Niranda Perera
>Priority: Major
>
> I've been trying to join two large pandas dataframes using pyspark using the 
> following code. I'm trying to vary executor cores allocated for the 
> application and measure scalability of pyspark (strong scaling).
> {code:java}
> r = 10 # 1Bn rows 
> it = 10
> w = 256
> unique = 0.9
> TOTAL_MEM = 240
> TOTAL_NODES = 14
> max_val = r * unique
> rng = default_rng()
> frame_data = rng.integers(0, max_val, size=(r, 2)) 
> frame_data1 = rng.integers(0, max_val, size=(r, 2)) 
> print(f"data generated", flush=True)
> df_l = pd.DataFrame(frame_data).add_prefix("col")
> df_r = pd.DataFrame(frame_data1).add_prefix("col")
> print(f"data loaded", flush=True)
> procs = int(math.ceil(w / TOTAL_NODES))
> mem = int(TOTAL_MEM*0.9)
> print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", 
> flush=True)
> spark = SparkSession\
> .builder\
> .appName(f'join {r} {w}')\
> .master('spark://node:7077')\
> .config('spark.executor.memory', f'{int(mem*0.6)}g')\
> .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
> .config('spark.cores.max', w)\
> .config('spark.driver.memory', '100g')\
> .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
> .getOrCreate()
> sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
> sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
> print(f"data loaded to spark", flush=True)
> try:   
> for i in range(it):
> t1 = time.time()
> out = sdf0.join(sdf1, on='col0', how='inner')
> count = out.count()
> t2 = time.time()
> print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", 
> flush=True)
> 
> del out
> del count
> gc.collect()
> finally:
> spark.stop() {code}
> {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 
> 48 cores and 240GB RAM each. I've spawned master and the driver code in 
> node1, while other 14 nodes have spawned workers allocating maximum memory. 
> In the spark context, I am reserving 90% of total memory to executor, 
> splitting 60% to jvm and 40% to pyspark.
> {*}Issue{*}: When I run the above program, I can see that the executors are 
> being assigned to the app. But it doesn't move forward, even after 60 mins. 
> For smaller row count (10M), this was working without a problem. Driver output
> {code:java}
> world sz 256 procs per worker 19 mem 216 iter 8
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425:
>  UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Negative initial size: -589934400
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warn(msg) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40233) Unable to load large pandas dataframe to pyspark

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598524#comment-17598524
 ] 

Sean R. Owen commented on SPARK-40233:
--

This is more a problem with trying send a huge amount of data from the driver 
and failing to serialize from pandas, not a Spark issue, or really a good use 
case for Spark

> Unable to load large pandas dataframe to pyspark
> 
>
> Key: SPARK-40233
> URL: https://issues.apache.org/jira/browse/SPARK-40233
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Niranda Perera
>Priority: Major
>
> I've been trying to join two large pandas dataframes using pyspark using the 
> following code. I'm trying to vary executor cores allocated for the 
> application and measure scalability of pyspark (strong scaling).
> {code:java}
> r = 10 # 1Bn rows 
> it = 10
> w = 256
> unique = 0.9
> TOTAL_MEM = 240
> TOTAL_NODES = 14
> max_val = r * unique
> rng = default_rng()
> frame_data = rng.integers(0, max_val, size=(r, 2)) 
> frame_data1 = rng.integers(0, max_val, size=(r, 2)) 
> print(f"data generated", flush=True)
> df_l = pd.DataFrame(frame_data).add_prefix("col")
> df_r = pd.DataFrame(frame_data1).add_prefix("col")
> print(f"data loaded", flush=True)
> procs = int(math.ceil(w / TOTAL_NODES))
> mem = int(TOTAL_MEM*0.9)
> print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", 
> flush=True)
> spark = SparkSession\
> .builder\
> .appName(f'join {r} {w}')\
> .master('spark://node:7077')\
> .config('spark.executor.memory', f'{int(mem*0.6)}g')\
> .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
> .config('spark.cores.max', w)\
> .config('spark.driver.memory', '100g')\
> .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
> .getOrCreate()
> sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
> sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
> print(f"data loaded to spark", flush=True)
> try:   
> for i in range(it):
> t1 = time.time()
> out = sdf0.join(sdf1, on='col0', how='inner')
> count = out.count()
> t2 = time.time()
> print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", 
> flush=True)
> 
> del out
> del count
> gc.collect()
> finally:
> spark.stop() {code}
> {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 
> 48 cores and 240GB RAM each. I've spawned master and the driver code in 
> node1, while other 14 nodes have spawned workers allocating maximum memory. 
> In the spark context, I am reserving 90% of total memory to executor, 
> splitting 60% to jvm and 40% to pyspark.
> {*}Issue{*}: When I run the above program, I can see that the executors are 
> being assigned to the app. But it doesn't move forward, even after 60 mins. 
> For smaller row count (10M), this was working without a problem. Driver output
> {code:java}
> world sz 256 procs per worker 19 mem 216 iter 8
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425:
>  UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Negative initial size: -589934400
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warn(msg) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40237) Can't get JDBC type for map in Spark 3.3.0 and PostgreSQL

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40237:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> Can't get JDBC type for map in Spark 3.3.0 and PostgreSQL
> 
>
> Key: SPARK-40237
> URL: https://issues.apache.org/jira/browse/SPARK-40237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: Linux 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 
> 19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> 22/08/27 00:30:01 INFO SparkContext: Running Spark version 3.3.0
>Reporter: Igor Suhorukov
>Priority: Minor
>
>  
> Exception 'Can't get JDBC type for map' happens when I try to 
> save dataset into PostgreSQL via JDBC
> {code:java}
> Dataset ds = ...;
> ds.write().mode(SaveMode.Overwrite)
> .option("truncate", true).format("jdbc")
> .option("url", "jdbc:postgresql://127.0.0.1:5432/???")
> .option("dbtable", "t1")
> .option("isolationLevel", "NONE")
> .option("user", "???")
> .option("password", "???")
> .save();
> {code}
>  
> This Issue related to unimplemented PostgresDialect#getJDBCType  and it is 
> strange as PostgreSQL supports hstore/json/jsonb types sutable to store map. 
> PostgreSql JDBC driver support hstore write by using 
> statement.setObject(parameterIndex, map) for hstore and 
> statement.setString(parameterIndex,map) with  cast(? as JSON). 
>  
>  
>  
> {code:java}
> java.lang.IllegalArgumentException: Can't get JDBC type for map
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGetJdbcTypeError(QueryExecutionErrors.scala:810)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getJdbcType$2(JdbcUtils.scala:162)
>     at scala.Option.getOrElse(Option.scala:201)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getJdbcType(JdbcUtils.scala:162)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$4(JdbcUtils.scala:782)
>     at scala.collection.immutable.Map$EmptyMap$.getOrElse(Map.scala:228)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$3(JdbcUtils.scala:782)
>     at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(JdbcUtils.scala:779)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:883)
>     at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:81)
>     at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>     at 
> 

[jira] [Commented] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598522#comment-17598522
 ] 

Sean R. Owen commented on SPARK-40274:
--

Yes, it is at least not clear it's due to you using a different version of 
Jackson than Spark does. Try not using your own version

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
>         at scala.collection.immutable.List.flatMap(List.scala:366)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80)
>         at 
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308)
>         at 
> 

[jira] [Resolved] (SPARK-40277) Use DataFrame's column for referring to DDL schema for from_csv() and from_json()

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40277.
--
Resolution: Invalid

This doesn't state any problem or specific change

> Use DataFrame's column for referring to DDL schema for from_csv() and 
> from_json()
> -
>
> Key: SPARK-40277
> URL: https://issues.apache.org/jira/browse/SPARK-40277
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jayant Kumar
>Priority: Major
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> With spark's DataFrame api one has to explicitly pass the StrucType to 
> functions like from_csv and from_json. This works okay in general.
> In certain circumstances when schema depends on the one of the DataFrame's 
> field, it gets complicated and one has to switch to RDD. This requires 
> additional libraries to be added with additional parsing logic.
> I am trying to explore a way to enable such use cases with DataFrame api and 
> function itself. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40282.
--
Resolution: Not A Problem

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Major
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
> {code:java}
> Exception in thread "main" scala.MatchError: 
> org.apache.spark.sql.types.IntegerType@363fe35a (of class 
> org.apache.spark.sql.types.IntegerType)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
>     at 
> org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
>     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
>     at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
> {code}
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598515#comment-17598515
 ] 

Sean R. Owen commented on SPARK-40282:
--

Try just IntegerType (no parens) as in Scala; otherwise this is a Kotlin thing

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Major
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
> {code:java}
> Exception in thread "main" scala.MatchError: 
> org.apache.spark.sql.types.IntegerType@363fe35a (of class 
> org.apache.spark.sql.types.IntegerType)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
>     at 
> org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
>     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
>     at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
> {code}
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598511#comment-17598511
 ] 

Sean R. Owen commented on SPARK-40284:
--

You have a race condition where two requests try to delete then write. I don't 
think this is a Spark issue.

> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> 
>
> Key: SPARK-40284
> URL: https://issues.apache.org/jira/browse/SPARK-40284
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Liu
>Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(4) as 
> time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSession.sql("select 2 as id, 1 as 
> time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in 
> InsertIntoHadoopFsRelationCommand this class.
>  
> When the target directory already exists, Spark directly deletes the target 
> directory and writes to the _temporary directory that it requests. However, 
> when multiple requests are written, the data will all append in; For example, 
> in Write SQL above, this procedure occurs
> 1. excute write sql1, spark  create the _temporary directory for SQL1, and 
> continue
> 2. excute write sql2 , spark will  delete the entire target directory and 
> create its own 
> _temporary
> 3. sql2 writes  its data
> 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id 
> directory does not exist and so the request fail. However, the task is 
> retried, but the _temporary  directory is not deleted when the task is 
> retried. Therefore, the execution result of sql1  result is append to the 
> target directory 
>  
> Based on the above process, the write process, can  spark do a directory 
> check before the write task or some other way to avoid this kind of problem?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40285:
-
Priority: Minor  (was: Major)

> Simplify the roundTo[Numeric] for Decimal
> -
>
> Key: SPARK-40285
> URL: https://issues.apache.org/jira/browse/SPARK-40285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Minor
>
> Spark Decimal have a lot of methods named roundTo*.
> Except roundToLong, everything else is redundant



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598510#comment-17598510
 ] 

Sean R. Owen commented on SPARK-40286:
--

There is no delete here. Why do you think Spark is deleting something vs 
something else you're doing? what files where?

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40290) Uncatchable exceptions in SparkSession Java API

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40290.
--
Resolution: Won't Fix

> Uncatchable exceptions in SparkSession Java API
> ---
>
> Key: SPARK-40290
> URL: https://issues.apache.org/jira/browse/SPARK-40290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Gérald Quintana
>Priority: Major
>
> SparkSession#sql may raise exceptions extending 
> org.apache.spark.sql.AnalysisException like 
> org.apache.spark.sql.catalyst.analysis.NoSuchPartitionException for instance. 
> These exceptions extend java.lang.Exception.
> As a result, they are considered as *checked* Exception by the Java compiler, 
> they can not be caught from Java code because the Java compiler considers 
> SparkSession#sql doesn't throw this kind of method (there is not throws 
> AnalysisException in the signature).
> The only workaround is to catch java.lang.Exception which produces a very 
> wide catch.
> A cleaner solution would be to make org.apache.spark.sql.AnalysisException 
> extends java.lang.RuntimeException



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40290) Uncatchable exceptions in SparkSession Java API

2022-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598506#comment-17598506
 ] 

Sean R. Owen commented on SPARK-40290:
--

It doesn't make sense to consider it a RuntimeException. You're identifying a 
general issue with how Scala code looks from Java. Yes, sometimes not 
convenient, but not specific to Spark.

> Uncatchable exceptions in SparkSession Java API
> ---
>
> Key: SPARK-40290
> URL: https://issues.apache.org/jira/browse/SPARK-40290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Gérald Quintana
>Priority: Major
>
> SparkSession#sql may raise exceptions extending 
> org.apache.spark.sql.AnalysisException like 
> org.apache.spark.sql.catalyst.analysis.NoSuchPartitionException for instance. 
> These exceptions extend java.lang.Exception.
> As a result, they are considered as *checked* Exception by the Java compiler, 
> they can not be caught from Java code because the Java compiler considers 
> SparkSession#sql doesn't throw this kind of method (there is not throws 
> AnalysisException in the signature).
> The only workaround is to catch java.lang.Exception which produces a very 
> wide catch.
> A cleaner solution would be to make org.apache.spark.sql.AnalysisException 
> extends java.lang.RuntimeException



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39708) ALS Model Loading

2022-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39708.
--
Resolution: Not A Problem

> ALS Model Loading
> -
>
> Key: SPARK-39708
> URL: https://issues.apache.org/jira/browse/SPARK-39708
> Project: Spark
>  Issue Type: Question
>  Components: PySpark, Spark Submit
>Affects Versions: 3.2.0
>Reporter: zehra
>Priority: Critical
>  Labels: model, pyspark
>
> I have an ALS model and saved it with these codes: 
> {code:java}
>                 als_path = "saved_models/best"
>                 best_model.save(sc, path= als_path){code}
> However, when I try to load this model, it gives this error message:
>  
> {code:java}
>     ---> 10 model2 = ALS.load(als_path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:332, in 
> MLReadable.load(cls, path)
>         329 @classmethod
>         330 def load(cls, path):
>         331     """Reads an ML instance from the input path, a shortcut of 
> `read().load(path)`."""
>     --> 332     return cls.read().load(path)
>     
>     File /usr/local/spark/python/pyspark/ml/util.py:282, in 
> JavaMLReader.load(self, path)
>         280 if not isinstance(path, str):
>         281     raise TypeError("path should be a string, got type %s" % 
> type(path))
>     --> 282 java_obj = self._jread.load(path)
>         283 if not hasattr(self._clazz, "_from_java"):
>         284     raise NotImplementedError("This Java ML type cannot be loaded 
> into Python currently: %r"
>         285                               % self._clazz)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, 
> in JavaMember.__call__(self, *args)
>        1315 command = proto.CALL_COMMAND_NAME +\
>        1316     self.command_header +\
>        1317     args_command +\
>        1318     proto.END_COMMAND_PART
>        1320 answer = self.gateway_client.send_command(command)
>     -> 1321 return_value = get_return_value(
>        1322     answer, self.gateway_client, self.target_id, self.name)
>        1324 for temp_arg in temp_args:
>        1325     temp_arg._detach()
>     
>     File /usr/local/spark/python/pyspark/sql/utils.py:111, in 
> capture_sql_exception..deco(*a, **kw)
>         109 def deco(*a, **kw):
>         110     try:
>     --> 111         return f(*a, **kw)
>         112     except py4j.protocol.Py4JJavaError as e:
>         113         converted = convert_exception(e.java_exception)
>     
>     File 
> /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py:326, in 
> get_return_value(answer, gateway_client, target_id, name)
>         324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>         325 if answer[1] == REFERENCE_TYPE:
>     --> 326     raise Py4JJavaError(
>         327         "An error occurred while calling {0}{1}{2}.\n".
>         328         format(target_id, ".", name), value)
>         329 else:
>         330     raise Py4JError(
>         331         "An error occurred while calling {0}{1}{2}. 
> Trace:\n{3}\n".
>         332         format(target_id, ".", name, value))
>     
>     Py4JJavaError: An error occurred while calling o372.load.
>     : org.json4s.MappingException: Did not find value which can be converted 
> into java.lang.String
>         at org.json4s.reflect.package$.fail(package.scala:53)
>         at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.json4s.Extraction$.convert(Extraction.scala:881)
>         at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456)
>         at 
> org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
>  
> {code}
>  
> I both tried to use `ALS.load` or `ALSModel.load` as shown in the Apache 
> spark documentation:
> [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E][1]
>  
>   [1]: 
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#:~:text=als_path%20%3D%20temp_path%20%2B%20%22/als%22%0A%3E%3E%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40252) Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` Api

2022-08-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40252:


Assignee: Yang Jie

> Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` 
> Api
> ---
>
> Key: SPARK-40252
> URL: https://issues.apache.org/jira/browse/SPARK-40252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Stream.collect(Collectors.joining(delimiter))  is slower than pure 
> StringJoiner api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40252) Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` Api

2022-08-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40252.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37701
[https://github.com/apache/spark/pull/37701]

> Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` 
> Api
> ---
>
> Key: SPARK-40252
> URL: https://issues.apache.org/jira/browse/SPARK-40252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> Stream.collect(Collectors.joining(delimiter))  is slower than pure 
> StringJoiner api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40192) Remove redundant groupby

2022-08-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40192.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37628
[https://github.com/apache/spark/pull/37628]

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40192) Remove redundant groupby

2022-08-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40192:


Assignee: deshanxiao

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40192) Remove redundant groupby

2022-08-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40192:
-
Priority: Trivial  (was: Minor)

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Trivial
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40152) Codegen compilation error when using split_part

2022-08-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40152.
--
Fix Version/s: 3.4.0
   3.3.1
 Assignee: Yuming Wang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37589

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40163) [SPARK][SQL] feat: SparkSession.confing(Map)

2022-08-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40163.
--
Fix Version/s: 3.4.0
 Assignee: seunggabi
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37478

> [SPARK][SQL] feat: SparkSession.confing(Map)
> 
>
> Key: SPARK-40163
> URL: https://issues.apache.org/jira/browse/SPARK-40163
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: seunggabi
>Assignee: seunggabi
>Priority: Trivial
> Fix For: 3.4.0
>
>
> [https://github.com/apache/spark/pull/37478] 
> - as-is
> {code:java}
> private fun config(builder: SparkSession.Builder): SparkSession.Builder {
> val map = YamlUtils.read(this::class.java, "spark", Extension.YAML)
> var b = builder
> map.keys.forEach {
> val k = it
> val v = map[k]
> b = when (v) {
> is Long -> b.config(k, v)
> is String -> b.config(k, v)
> is Double -> b.config(k, v)
> is Boolean -> b.config(k, v)
> else -> b
> }
> }
> return b
> }
> } {code}
> - to-be
> {code:java}
> private fun config(builder: SparkSession.Builder): SparkSession.Builder {
> val map = YamlUtils.read(this::class.java, "spark", Extension.YAML)
> return b.config(map)
> }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40162) Upgrade RoaringBitmap from 0.9.30 to 0.9.31

2022-08-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40162.
--
  Assignee: BingKun Pan
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37597

> Upgrade RoaringBitmap from 0.9.30 to 0.9.31
> ---
>
> Key: SPARK-40162
> URL: https://issues.apache.org/jira/browse/SPARK-40162
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.30...0.9.31
> [simplify BatchIterators, fix bug in advanceIfNeeded 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/56b1bba400e9f91c682648fa90b890f3a0bb561c]
>  [#573|https://github.com/RoaringBitmap/RoaringBitmap/pull/573] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/56b1bba400e9f91c682648fa90b890f3a0bb561c]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40165) Update test plugins to latest versions

2022-08-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40165.
--
  Assignee: BingKun Pan
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37598

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40165) Update test plugins to latest versions

2022-08-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40165:
-
Priority: Trivial  (was: Minor)

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39975) Upgrade rocksdbjni to 7.4.5

2022-08-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39975.
--
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37543

> Upgrade rocksdbjni to 7.4.5
> ---
>
> Key: SPARK-39975
> URL: https://issues.apache.org/jira/browse/SPARK-39975
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> [https://github.com/facebook/rocksdb/releases/tag/v7.4.5]
>  
> {code:java}
> Fix a bug starting in 7.4.0 in which some fsync operations might be skipped 
> in a DB after any DropColumnFamily on that DB, until it is re-opened. This 
> can lead to data loss on power loss. (For custom FileSystem implementations, 
> this could lead to FSDirectory::Fsync or FSDirectory::Close after the first 
> FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) 
> {code}
>  
> Fixed a bug that caused data loss
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40132.
--
Fix Version/s: 3.4.0
   3.3.1
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37561

> MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
> 
>
> Key: SPARK-40132
> URL: https://issues.apache.org/jira/browse/SPARK-40132
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in 
> Pyspark ML's classification.py but inadvertently removed the parameter 
> rawPredictionCol from MultilayerPerceptronClassifier's setParams. This causes 
> its constructor to fail when this param is set in the constructor, as it 
> isn't recognized by setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40132) MultilayerPerceptronClassifier rawPredictionCol param missing from setParams

2022-08-17 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-40132:


 Summary: MultilayerPerceptronClassifier rawPredictionCol param 
missing from setParams
 Key: SPARK-40132
 URL: https://issues.apache.org/jira/browse/SPARK-40132
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.3.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


https://issues.apache.org/jira/browse/SPARK-37398 inlined type hints in Pyspark 
ML's classification.py but inadvertently removed the parameter rawPredictionCol 
from MultilayerPerceptronClassifier's setParams. This causes its constructor to 
fail when this param is set in the constructor, as it isn't recognized by 
setParams, called by the constructor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36511) Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36511.
--
Fix Version/s: 3.4.0
 Assignee: BingKun Pan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37529

> Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
> -
>
> Key: SPARK-36511
> URL: https://issues.apache.org/jira/browse/SPARK-36511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{ColumnIO}} doesn't expose methods to get repetition and definition level so 
> Spark has to use a hack. This should be removed once PARQUET-2050 is released.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40036) LevelDB/RocksDBIterator.next should return false after iterator or db close

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40036.
--
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37471

> LevelDB/RocksDBIterator.next should return false after iterator or db close
> ---
>
> Key: SPARK-40036
> URL: https://issues.apache.org/jira/browse/SPARK-40036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
> @Test
> public void testHasNextAndNextAfterIteratorClose() throws Exception {
>   String prefix = "test_db_iter_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one records for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close iter
>   iter.close();
>   // iter.hasNext should be false after iter close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after iter close
>   assertThrows(NoSuchElementException.class, iter::next);
>   db.close();
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> }
> @Test
> public void testHasNextAndNextAfterDBClose() throws Exception {
>   String prefix = "test_db_db_close.";
>   String suffix = ".ldb";
>   File path = File.createTempFile(prefix, suffix);
>   path.delete();
>   LevelDB db = new LevelDB(path);
>   // Write one record for test
>   db.write(createCustomType1(0));
>   KVStoreIterator iter =
> db.view(CustomType1.class).closeableIterator();
>   // iter should be true
>   assertTrue(iter.hasNext());
>   // close db
>   db.close();
>   // iter.hasNext should be false after db close
>   assertFalse(iter.hasNext());
>   // iter.next should throw NoSuchElementException after db close
>   assertThrows(NoSuchElementException.class, iter::next);
>   assertTrue(path.exists());
>   FileUtils.deleteQuietly(path);
>   assertFalse(path.exists());
> } {code}
>  
> For the above two cases, when iterator/db is closed, `hasNext` will return 
> true, and `next` will return the value not obtained before close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40042.
--
Fix Version/s: 3.4.0
 Assignee: Qian Sun
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37482

> Make pyspark.sql.streaming.query examples self-contained
> 
>
> Key: SPARK-40042
> URL: https://issues.apache.org/jira/browse/SPARK-40042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40042) Make pyspark.sql.streaming.query examples self-contained

2022-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40042:
-
Priority: Minor  (was: Major)

> Make pyspark.sql.streaming.query examples self-contained
> 
>
> Key: SPARK-40042
> URL: https://issues.apache.org/jira/browse/SPARK-40042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40009) Add missing doc string info to DataFrame API

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40009.
--
Fix Version/s: 3.4.0
 Assignee: Khalid Mammadov
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37441

> Add missing doc string info to DataFrame API
> 
>
> Key: SPARK-40009
> URL: https://issues.apache.org/jira/browse/SPARK-40009
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Minor
> Fix For: 3.4.0
>
>
> Some of the docstrings in Python DataFrame API is not complete, for example 
> some missing Parameters section or Return or Examples. It would help users if 
> we can provide these missing infos for all methods/functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40058) Avoid filter twice in HadoopFSUtils

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40058:


Assignee: ZiyueGuan

> Avoid filter twice in HadoopFSUtils
> ---
>
> Key: SPARK-40058
> URL: https://issues.apache.org/jira/browse/SPARK-40058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: ZiyueGuan
>Assignee: ZiyueGuan
>Priority: Minor
> Fix For: 3.4.0
>
>
> In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive 
> method call. This may waste more time when filter logic is heavy. Would like 
> to have a refactor on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40058) Avoid filter twice in HadoopFSUtils

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40058.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37498

> Avoid filter twice in HadoopFSUtils
> ---
>
> Key: SPARK-40058
> URL: https://issues.apache.org/jira/browse/SPARK-40058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: ZiyueGuan
>Priority: Minor
> Fix For: 3.4.0
>
>
> In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive 
> method call. This may waste more time when filter logic is heavy. Would like 
> to have a refactor on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40035.
--
Resolution: Duplicate

> Avoid apply filter twice when listing files
> ---
>
> Key: SPARK-40035
> URL: https://issues.apache.org/jira/browse/SPARK-40035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39982) StructType.fromJson method missing documentation

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39982.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37408

> StructType.fromJson method missing documentation
> 
>
> Key: SPARK-39982
> URL: https://issues.apache.org/jira/browse/SPARK-39982
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Trivial
> Fix For: 3.4.0
>
>
> StructType.fromJson method does not have any documentation. It would be good 
> to have one that explains how one can use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39982) StructType.fromJson method missing documentation

2022-08-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39982:


Assignee: Khalid Mammadov

> StructType.fromJson method missing documentation
> 
>
> Key: SPARK-39982
> URL: https://issues.apache.org/jira/browse/SPARK-39982
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Trivial
>
> StructType.fromJson method does not have any documentation. It would be good 
> to have one that explains how one can use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40072) MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml

2022-08-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40072:


  Assignee: YUBI LEE
Issue Type: Improvement  (was: Bug)

> MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml
> --
>
> Key: SPARK-40072
> URL: https://issues.apache.org/jira/browse/SPARK-40072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Assignee: YUBI LEE
>Priority: Minor
> Fix For: 3.4.0
>
>
> Building spark with make-distribution.sh is failed with default setting 
> because default MAVEN_OPTS is different from the one in pom.xml.
>  It is related to 
> [SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].
> PR: https://github.com/apache/spark/pull/37510



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40072) MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml

2022-08-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40072.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37510

> MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml
> --
>
> Key: SPARK-40072
> URL: https://issues.apache.org/jira/browse/SPARK-40072
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Priority: Minor
> Fix For: 3.4.0
>
>
> Building spark with make-distribution.sh is failed with default setting 
> because default MAVEN_OPTS is different from the one in pom.xml.
>  It is related to 
> [SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].
> PR: https://github.com/apache/spark/pull/37510



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40071) Update plugins to latest versions

2022-08-13 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40071.
--
  Assignee: BingKun Pan
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37506

> Update plugins to latest versions
> -
>
> Key: SPARK-40071
> URL: https://issues.apache.org/jira/browse/SPARK-40071
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> include:
>  # maven-compiler-plugin (from 3.8.1 to 3.10.1)
>  # maven-jar-plugin (from 3.1.2 to 3.2.2)
>  # maven-javadoc-plugin (from 3.1.1 to 3.4.0)
>  # checkstyle (from 8.43 to 9.3)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40037.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/37473

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40037:
-
Priority: Minor  (was: Major)

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Minor
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40037) Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40037:


Assignee: Bjørn Jørgensen

> Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
> ---
>
> Key: SPARK-40037
> URL: https://issues.apache.org/jira/browse/SPARK-40037
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> [CVE-2022-25647|https://www.cve.org/CVERecord?id=CVE-2022-25647]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLECODEGSON-1730327]
> [CVE-2021-22569|https://www.cve.org/CVERecord?id=CVE-2021-22569]
> [Info at 
> SNYK|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-2331703]
> [releases log|https://github.com/google/tink/releases/tag/v1.7.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40056:


Assignee: BingKun Pan

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40056:
-
Priority: Trivial  (was: Minor)

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40056.
--
Resolution: Fixed

Issue resolved by pull request 37489
[https://github.com/apache/spark/pull/37489]

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    2   3   4   5   6   7   8   9   10   11   >