[jira] [Comment Edited] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384666#comment-17384666
 ] 

dgd_contributor edited comment on SPARK-36229 at 7/21/21, 5:44 AM:
---

After look closely, I found out that the overflow check in encode is wrong and 
need to work on too.

For example:
{code:java}
scala> spark.sql(select conv('aaa0aaa0a', 16, 10)).show

+---+
|conv(aaa0aaa0a, 16, 10)|
+---+
|   12297828695278266890|
+---+{code}
which should be 18446744073709551615

 

I will raise a pull request soon


was (Author: dc-heros):
After look closely, I found out that the overflow check in encode is wrong and 
need to work on too.

For example:
{code:java}
scala> spark.sql(select conv('aaa0aaa0a', 16, 10)).show

+---+
|conv(aaa0aaa0a, 16, 10)|
+---+
|   12297828695278266890|
+---+

which should be 18446744073709551615{code}
I will raise a pull request soon

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>
> SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
> inconsistency in behaviour where the returned value is different above the 64 
> char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384666#comment-17384666
 ] 

dgd_contributor commented on SPARK-36229:
-

After look closely, I found out that the overflow check in encode is wrong and 
need to work on too.

For example:
{code:java}
scala> spark.sql(select conv('aaa0aaa0a', 16, 10)).show

+---+
|conv(aaa0aaa0a, 16, 10)|
+---+
|   12297828695278266890|
+---+

which should be 18446744073709551615{code}
I will raise a pull request soon

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>
> SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
> inconsistency in behaviour where the returned value is different above the 64 
> char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-20 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384660#comment-17384660
 ] 

Gengliang Wang commented on SPARK-25075:


[~dongjoon]Thanks for the ping. Yes, we should have both Scala 2.12/2.13 
artifacts in Spark 3.2.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384659#comment-17384659
 ] 

Apache Spark commented on SPARK-36030:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33454

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36236) RocksDB state store: Add additional metrics for better observability into state store operations

2021-07-20 Thread Venki Korukanti (Jira)
Venki Korukanti created SPARK-36236:
---

 Summary: RocksDB state store: Add additional metrics for better 
observability into state store operations
 Key: SPARK-36236
 URL: https://issues.apache.org/jira/browse/SPARK-36236
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.1.2
Reporter: Venki Korukanti


Proposing adding following new metrics to {{customMetrics}} under the 
{{stateOperators}} in {{StreamingQueryProgress}} event These metrics help have 
better visibility into the RocksDB based state store in streaming jobs.
 * {{rocksdbGetCount}} number of get calls to the DB (doesn’t include Gets from 
WriteBatch - in memory batch used for staging writes) 
 * {{rocksdbPutCount}} number of put calls to the DB (doesn’t include Puts to 
WriteBatch - in memory batch used for staging writes)
 * {{rocksdbTotalBytesReadByGet/rocksdbTotalBytesWrittenByPut}}: Number of 
uncompressed bytes read/written by get/put operations
 * {{rocksdbReadBlockCacheHitCount/rocksdbReadBlockCacheMissCount}} indicates 
how much of the block cache in RocksDB is useful or not and avoiding local disk 
reads
 * {{rocksdbTotalBytesReadByCompaction/rocksdbTotalBytesWrittenByCompaction}}: 
How many bytes the compaction process read from disk and written to disk. 
 * {{rocksdbTotalCompactionTime}}: Time (in ns) took for compactions (both 
background and the optional compaction initiated during the commit)
 * {{rocksdbWriterStallDuration}} Time (in ns) the writer has stalled due to a 
background compaction or flushing of the immutable memtables to disk. 
 * {{rocksdbTotalBytesReadThroughIterator}} Some of the stateful operations 
(such as timeout processing in FlatMapGroupsWithState and watermarking) 
requires reading entire data in DB through iterator. This metric tells the 
total size of uncompressed data read using the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384658#comment-17384658
 ] 

Apache Spark commented on SPARK-36030:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33454

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384657#comment-17384657
 ] 

Apache Spark commented on SPARK-36030:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33453

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384654#comment-17384654
 ] 

Apache Spark commented on SPARK-36030:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33452

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384653#comment-17384653
 ] 

Apache Spark commented on SPARK-36030:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33452

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384643#comment-17384643
 ] 

Apache Spark commented on SPARK-36206:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33451

> Diagnose shuffle data corruption by checksum
> 
>
> Key: SPARK-36206
> URL: https://issues.apache.org/jira/browse/SPARK-36206
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wuyi
>Priority: Major
>
> After adding checksums in SPARK-35276, we can leverage the checksums to do 
> diagnosis for shuffle data corruption now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36206:


Assignee: Apache Spark

> Diagnose shuffle data corruption by checksum
> 
>
> Key: SPARK-36206
> URL: https://issues.apache.org/jira/browse/SPARK-36206
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> After adding checksums in SPARK-35276, we can leverage the checksums to do 
> diagnosis for shuffle data corruption now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384642#comment-17384642
 ] 

Apache Spark commented on SPARK-36206:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33451

> Diagnose shuffle data corruption by checksum
> 
>
> Key: SPARK-36206
> URL: https://issues.apache.org/jira/browse/SPARK-36206
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wuyi
>Priority: Major
>
> After adding checksums in SPARK-35276, we can leverage the checksums to do 
> diagnosis for shuffle data corruption now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36206:


Assignee: (was: Apache Spark)

> Diagnose shuffle data corruption by checksum
> 
>
> Key: SPARK-36206
> URL: https://issues.apache.org/jira/browse/SPARK-36206
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wuyi
>Priority: Major
>
> After adding checksums in SPARK-35276, we can leverage the checksums to do 
> diagnosis for shuffle data corruption now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384639#comment-17384639
 ] 

Apache Spark commented on SPARK-35809:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33450

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `index_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35809:


Assignee: Apache Spark

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `index_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35809:


Assignee: (was: Apache Spark)

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `index_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384636#comment-17384636
 ] 

Apache Spark commented on SPARK-35809:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33450

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `index_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36235) Supported datatype check should check inner field

2021-07-20 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-36235.
---
Resolution: Not A Bug

> Supported datatype check should check inner field
> -
>
> Key: SPARK-36235
> URL: https://issues.apache.org/jira/browse/SPARK-36235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization (session window)

2021-07-20 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384626#comment-17384626
 ] 

Jungtaek Lim commented on SPARK-10816:
--

I don't think I could finalize this one without your dedicated help. Thanks 
[~viirya] [~XuanYuan]!

> EventTime based sessionization (session window)
> ---
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS.
> Unlike time window (tumbling window and sliding window), session window 
> doesn't have static window begin and end time. Session window creation 
> depends on defined session gap which can be static or dynamic.
> For static session gap, the events which are falling in a certain period of 
> time (gap) are considered as a session window. A session window closes when 
> it does not receive events for the gap. For dynamic gap, the gap could be 
> changed from event to event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36030) Support DS v2 metrics at writing path

2021-07-20 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36030.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33239
[https://github.com/apache/spark/pull/33239]

> Support DS v2 metrics at writing path
> -
>
> Key: SPARK-36030
> URL: https://issues.apache.org/jira/browse/SPARK-36030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We add the interface for DS v2 metrics in SPARK-34366. It is only added for 
> read path, though. We should add write path support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36235) Supported datatype check should check inner field

2021-07-20 Thread angerszhu (Jira)
angerszhu created SPARK-36235:
-

 Summary: Supported datatype check should check inner field
 Key: SPARK-36235
 URL: https://issues.apache.org/jira/browse/SPARK-36235
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384610#comment-17384610
 ] 

dgd_contributor commented on SPARK-36229:
-

thanks, I will look into this

 

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>
> SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
> inconsistency in behaviour where the returned value is different above the 64 
> char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36153) Add SQL doc about transform for current behavior

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-36153:
-
Summary: Add SQL doc about transform for current behavior  (was: Add SLQ 
doc about transform for current behavior)

> Add SQL doc about transform for current behavior
> 
>
> Key: SPARK-36153
> URL: https://issues.apache.org/jira/browse/SPARK-36153
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36153) Add SLQ doc about transform for current behavior

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36153.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33362
[https://github.com/apache/spark/pull/33362]

> Add SLQ doc about transform for current behavior
> 
>
> Key: SPARK-36153
> URL: https://issues.apache.org/jira/browse/SPARK-36153
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36153) Add SLQ doc about transform for current behavior

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36153:


Assignee: angerszhu

> Add SLQ doc about transform for current behavior
> 
>
> Key: SPARK-36153
> URL: https://issues.apache.org/jira/browse/SPARK-36153
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36153) Add SLQ doc about transform for current behavior

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-36153:
-
Priority: Minor  (was: Major)

> Add SLQ doc about transform for current behavior
> 
>
> Key: SPARK-36153
> URL: https://issues.apache.org/jira/browse/SPARK-36153
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35658) Document Parquet encryption feature in Spark

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35658:


Assignee: Gidon Gershinsky

> Document Parquet encryption feature in Spark
> 
>
> Key: SPARK-35658
> URL: https://issues.apache.org/jira/browse/SPARK-35658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the 
> encryption feature which can be called from Spark SQL. The aim of this Jira 
> is to document the use of Parquet encryption in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35658) Document Parquet encryption feature in Spark

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-35658:
-
Priority: Minor  (was: Major)

> Document Parquet encryption feature in Spark
> 
>
> Key: SPARK-35658
> URL: https://issues.apache.org/jira/browse/SPARK-35658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
> Fix For: 3.3.0
>
>
> Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the 
> encryption feature which can be called from Spark SQL. The aim of this Jira 
> is to document the use of Parquet encryption in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35658) Document Parquet encryption feature in Spark

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35658.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 32895
[https://github.com/apache/spark/pull/32895]

> Document Parquet encryption feature in Spark
> 
>
> Key: SPARK-35658
> URL: https://issues.apache.org/jira/browse/SPARK-35658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 3.3.0
>
>
> Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the 
> encryption feature which can be called from Spark SQL. The aim of this Jira 
> is to document the use of Parquet encryption in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization (session window)

2021-07-20 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384592#comment-17384592
 ] 

L. C. Hsieh commented on SPARK-10816:
-

Yea, excited to have the great work in the next 3.2 release! Thanks [~kabhwan] 
[~XuanYuan]

 

> EventTime based sessionization (session window)
> ---
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS.
> Unlike time window (tumbling window and sliding window), session window 
> doesn't have static window begin and end time. Session window creation 
> depends on defined session gap which can be static or dynamic.
> For static session gap, the events which are falling in a certain period of 
> time (gap) are considered as a session window. A session window closes when 
> it does not receive events for the gap. For dynamic gap, the gap could be 
> changed from event to event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35027) Close the inputStream in FileAppender when writing the logs failure

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-35027:
-
Priority: Minor  (was: Major)

Resolved by https://github.com/apache/spark/pull/33263

> Close the inputStream in FileAppender when writing the logs failure
> ---
>
> Key: SPARK-35027
> URL: https://issues.apache.org/jira/browse/SPARK-35027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Jack Hu
>Assignee: Jack Hu
>Priority: Minor
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> In Spark Cluster, the ExecutorRunner uses FileAppender  to redirect the 
> stdout/stderr of executors to file, when the writing processing is failure 
> due to some reasons: disk full, the FileAppender will only close the input 
> stream to file, but leave the pipe's stdout/stderr open, following writting 
> operation in executor side may be hung. 
> need to close the inputStream in FileAppender ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35027) Close the inputStream in FileAppender when writing the logs failure

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35027.
--
Fix Version/s: 3.0.4
   3.1.3
   3.2.0
   Resolution: Fixed

> Close the inputStream in FileAppender when writing the logs failure
> ---
>
> Key: SPARK-35027
> URL: https://issues.apache.org/jira/browse/SPARK-35027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Jack Hu
>Assignee: Jack Hu
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> In Spark Cluster, the ExecutorRunner uses FileAppender  to redirect the 
> stdout/stderr of executors to file, when the writing processing is failure 
> due to some reasons: disk full, the FileAppender will only close the input 
> stream to file, but leave the pipe's stdout/stderr open, following writting 
> operation in executor side may be hung. 
> need to close the inputStream in FileAppender ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35027) Close the inputStream in FileAppender when writing the logs failure

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35027:


Assignee: Jack Hu

> Close the inputStream in FileAppender when writing the logs failure
> ---
>
> Key: SPARK-35027
> URL: https://issues.apache.org/jira/browse/SPARK-35027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Jack Hu
>Assignee: Jack Hu
>Priority: Major
>
> In Spark Cluster, the ExecutorRunner uses FileAppender  to redirect the 
> stdout/stderr of executors to file, when the writing processing is failure 
> due to some reasons: disk full, the FileAppender will only close the input 
> stream to file, but leave the pipe's stdout/stderr open, following writting 
> operation in executor side may be hung. 
> need to close the inputStream in FileAppender ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35027) Close the inputStream in FileAppender when writing the logs failure

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reopened SPARK-35027:
--

> Close the inputStream in FileAppender when writing the logs failure
> ---
>
> Key: SPARK-35027
> URL: https://issues.apache.org/jira/browse/SPARK-35027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Jack Hu
>Priority: Major
>
> In Spark Cluster, the ExecutorRunner uses FileAppender  to redirect the 
> stdout/stderr of executors to file, when the writing processing is failure 
> due to some reasons: disk full, the FileAppender will only close the input 
> stream to file, but leave the pipe's stdout/stderr open, following writting 
> operation in executor side may be hung. 
> need to close the inputStream in FileAppender ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization (session window)

2021-07-20 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384586#comment-17384586
 ] 

Yuanjian Li commented on SPARK-10816:
-

Thrilled to see this issue got resolved finally! Thank you all! [~kabhwan] 
[~viirya] 

> EventTime based sessionization (session window)
> ---
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS.
> Unlike time window (tumbling window and sliding window), session window 
> doesn't have static window begin and end time. Session window creation 
> depends on defined session gap which can be static or dynamic.
> For static session gap, the events which are falling in a certain period of 
> time (gap) are considered as a session window. A session window closes when 
> it does not receive events for the gap. For dynamic gap, the gap could be 
> changed from event to event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-07-20 Thread rickcheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384576#comment-17384576
 ] 

rickcheng commented on SPARK-36088:
---

Hi, [~srowen]

Thanks for the comment. I agree that in client mode, user can access the file 
in some cases. However, my original intention to raise this question was 
because I wanted to distribute the conda packaged environment (a tar.gz file) 
through *spark.archive* and extract it to the driver and executors. In this 
way, the driver and executors will have the same python environment. And in 
K8s, the driver may run in a pod and the tar.gz file may be in a remote place 
(e.g., HDFS). So I think it's also necessary to extract the archive file to the 
driver through spark.archive. Or maybe there is a better way to achieve this 
goal?

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35310) Bump Breeze from 1.0 to 1.2

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35310:


Assignee: Apache Spark

> Bump Breeze from 1.0 to 1.2
> ---
>
> Key: SPARK-35310
> URL: https://issues.apache.org/jira/browse/SPARK-35310
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Minor
>
> Breeze 1.1 release notes
> [https://groups.google.com/g/scala-breeze/c/PwGpObQNE8Q]
> Breeze 1.2 release notes
> [https://groups.google.com/g/scala-breeze/c/u5Ge52B36u0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35310) Bump Breeze from 1.0 to 1.2

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35310:


Assignee: (was: Apache Spark)

> Bump Breeze from 1.0 to 1.2
> ---
>
> Key: SPARK-35310
> URL: https://issues.apache.org/jira/browse/SPARK-35310
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Minor
>
> Breeze 1.1 release notes
> [https://groups.google.com/g/scala-breeze/c/PwGpObQNE8Q]
> Breeze 1.2 release notes
> [https://groups.google.com/g/scala-breeze/c/u5Ge52B36u0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35310) Bump Breeze from 1.0 to 1.2

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384570#comment-17384570
 ] 

Apache Spark commented on SPARK-35310:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/33449

> Bump Breeze from 1.0 to 1.2
> ---
>
> Key: SPARK-35310
> URL: https://issues.apache.org/jira/browse/SPARK-35310
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Minor
>
> Breeze 1.1 release notes
> [https://groups.google.com/g/scala-breeze/c/PwGpObQNE8Q]
> Breeze 1.2 release notes
> [https://groups.google.com/g/scala-breeze/c/u5Ge52B36u0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35310) Bump Breeze from 1.0 to 1.2

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384571#comment-17384571
 ] 

Apache Spark commented on SPARK-35310:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/33449

> Bump Breeze from 1.0 to 1.2
> ---
>
> Key: SPARK-35310
> URL: https://issues.apache.org/jira/browse/SPARK-35310
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Minor
>
> Breeze 1.1 release notes
> [https://groups.google.com/g/scala-breeze/c/PwGpObQNE8Q]
> Breeze 1.2 release notes
> [https://groups.google.com/g/scala-breeze/c/u5Ge52B36u0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-07-20 Thread rickcheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384569#comment-17384569
 ] 

rickcheng commented on SPARK-36088:
---

Hi, [~hyukjin.kwon]

Thanks for the comment. After my test, under client mode, the archive file will 
not extract to the driver's working directory no matter if the driver is in the 
pod or not. Thanks for pointing out the code, maybe I will consider making a PR.

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10816) EventTime based sessionization (session window)

2021-07-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-10816.
--
Fix Version/s: 3.2.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

I'm resolving this issue as all sub-issues are completed.

> EventTime based sessionization (session window)
> ---
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS.
> Unlike time window (tumbling window and sliding window), session window 
> doesn't have static window begin and end time. Session window creation 
> depends on defined session gap which can be static or dynamic.
> For static session gap, the events which are falling in a certain period of 
> time (gap) are considered as a session window. A session window closes when 
> it does not receive events for the gap. For dynamic gap, the gap could be 
> changed from event to event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36188:


Assignee: Apache Spark

> Add categories setter to CategoricalAccessor and CategoricalIndex.
> --
>
> Key: SPARK-36188
> URL: https://issues.apache.org/jira/browse/SPARK-36188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384567#comment-17384567
 ] 

Apache Spark commented on SPARK-36188:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33448

> Add categories setter to CategoricalAccessor and CategoricalIndex.
> --
>
> Key: SPARK-36188
> URL: https://issues.apache.org/jira/browse/SPARK-36188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36188:


Assignee: (was: Apache Spark)

> Add categories setter to CategoricalAccessor and CategoricalIndex.
> --
>
> Key: SPARK-36188
> URL: https://issues.apache.org/jira/browse/SPARK-36188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36172) Document session window in Structured Streaming guide doc

2021-07-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-36172.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33433
[https://github.com/apache/spark/pull/33433]

> Document session window in Structured Streaming guide doc
> -
>
> Key: SPARK-36172
> URL: https://issues.apache.org/jira/browse/SPARK-36172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Given we ship the new feature "native support of session window", we should 
> also document a new feature in Structured Streaming guide doc so that end 
> users can leverage it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36172) Document session window in Structured Streaming guide doc

2021-07-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-36172:


Assignee: Jungtaek Lim

> Document session window in Structured Streaming guide doc
> -
>
> Key: SPARK-36172
> URL: https://issues.apache.org/jira/browse/SPARK-36172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
>
> Given we ship the new feature "native support of session window", we should 
> also document a new feature in Structured Streaming guide doc so that end 
> users can leverage it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.

2021-07-20 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36186.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33400
https://github.com/apache/spark/pull/33400

> Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
> 
>
> Key: SPARK-36186
> URL: https://issues.apache.org/jira/browse/SPARK-36186
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36145) Remove Python 3.6 support in codebase and CI

2021-07-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384544#comment-17384544
 ] 

Hyukjin Kwon commented on SPARK-36145:
--

Oh, this JIRA actually targets to remove Python 3.9 out e.g. in the 
documentation too. I will track and make sure to resolve it before Spark 3.3

> Remove Python 3.6 support in codebase and CI
> 
>
> Key: SPARK-36145
> URL: https://issues.apache.org/jira/browse/SPARK-36145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We should 
> remove it in Spark 3.3.
> This JIRA also target to all the changes in CI and development not only 
> user-facing changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384540#comment-17384540
 ] 

Sean R. Owen commented on SPARK-35848:
--

I think you'd get "Requested array size exceeds VM limit" if you were exceeding 
the 2GB limit; it should be close in this case (~1.8GB serialized according to 
the size of the bitmap it would allocate at 2B elements, 0.03 FPP) and could 
somehow be bigger, I suppose.

I don't think there is a general fix for OOM, as you can cause this to allocate 
100GB of bloom filter with enough elements, for example. The change I'm copying 
just avoids one instance of the copying, not all of them. I haven't thought it 
through, but if the copy that's avoided (copying 'zero' from driver to workers) 
is the only thing that goes through the JavaSerializer, it might avoid at least 
the 2GB limit, but I'm not sure.

It's an improvement in any event, one that's at least easy to make.

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Sean R. Owen
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> 

[jira] [Commented] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Sai Polisetty (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384527#comment-17384527
 ] 

Sai Polisetty commented on SPARK-35848:
---

Thanks for taking a look at it, Sean. I am using [Azure 
Standard_E8s_v3|https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series#esv3-series]
 host that has 64GB memory with 8cores. While the cluster can handle 
serialization of data beyond 2GB in size, the error in this particular case is 
coming due to the hardcoded usage of JavaSerializer for zeroValue in 
treeAggregate which has a 2GB limit. I believe this is done 
[here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L215],
 but not sure if I am pointing my finger at the right place.

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Sean R. Owen
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> 

[jira] [Created] (SPARK-36234) Consider mapper location and shuffle block size in OptimizeLocalShuffleReader

2021-07-20 Thread Michael Zhang (Jira)
Michael Zhang created SPARK-36234:
-

 Summary: Consider mapper location and shuffle block size in 
OptimizeLocalShuffleReader
 Key: SPARK-36234
 URL: https://issues.apache.org/jira/browse/SPARK-36234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Michael Zhang


This is a follow-up to SPARK-36105 (OptimizeLocalShuffleReader support reading 
data of multiple mappers in one task). We should consider using the mapper 
locations along with shuffle block size when coalescing mappers (specifically 
in events where there are more mappers than there is parallelism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384495#comment-17384495
 ] 

Dongjoon Hyun commented on SPARK-36218:
---

Oh, thank you for sharing, [~hyukjin.kwon].

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35310) Bump Breeze from 1.0 to 1.2

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384491#comment-17384491
 ] 

Sean R. Owen commented on SPARK-35310:
--

FWIW it's easy to work around the compile error, I did that.
However a few MLlib tests fail on even breeze 1.1, like the weighted least 
squares ones. The answers are fairly different.
I looked through the breeze changes and there are optimizer changes but nothing 
super big. It could be a problem in Spark. 
I don't know if I can fix it but will look.

> Bump Breeze from 1.0 to 1.2
> ---
>
> Key: SPARK-35310
> URL: https://issues.apache.org/jira/browse/SPARK-35310
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Priority: Minor
>
> Breeze 1.1 release notes
> [https://groups.google.com/g/scala-breeze/c/PwGpObQNE8Q]
> Breeze 1.2 release notes
> [https://groups.google.com/g/scala-breeze/c/u5Ge52B36u0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36000) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36000:
-
Description: 
The creation and operations of ps.Series/Index with Decimal('NaN') doesn't work 
as expected.

That might be due to the underlying PySpark limit.

Please refer to sub-tasks for issues detected.

  was:
The creation and operations of ps.Series/Index have bugs.

Please refer to sub-tasks for details.


> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> The creation and operations of ps.Series/Index with Decimal('NaN') doesn't 
> work as expected.
> That might be due to the underlying PySpark limit.
> Please refer to sub-tasks for issues detected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36000) Support creation and operations of ps.Series/Index with Decimal('NaN')

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36000:
-
Summary: Support creation and operations of ps.Series/Index with 
Decimal('NaN')  (was: Support creating a ps.Series/Index with `Decimal('NaN')` 
with Arrow disabled)

> Support creation and operations of ps.Series/Index with Decimal('NaN')
> --
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> The creation and operations of ps.Series/Index with Decimal('NaN') doesn't 
> work as expected.
> That might be due to the underlying PySpark limit.
> Please refer to sub-tasks for issues detected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36000) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36000:
-
Description: 
The creation and operations of ps.Series/Index have bugs.

Please refer to sub-tasks for details.

  was:
 
{code:java}
>>> import decimal as d
>>> import pyspark.pandas as ps
>>> import numpy as np
>>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
>>>  True)
>>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
0   1
1   2
2None
dtype: object
>>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
>>>  False)
>>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
net.razorvine.pickle.PickleException: problem construction object: 
java.lang.reflect.InvocationTargetException
...

{code}
As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
when Arrow disabled. We ought to fix that.

 


> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> The creation and operations of ps.Series/Index have bugs.
> Please refer to sub-tasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36232:
-
Description: 
 
{code:java}
>>> import decimal as d
>>> import pyspark.pandas as ps
>>> import numpy as np
>>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
>>>  True)
>>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
0   1
1   2
2None
dtype: object
>>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
>>>  False)
>>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
net.razorvine.pickle.PickleException: problem construction object: 
java.lang.reflect.InvocationTargetException
...

{code}
As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
when Arrow disabled. We ought to fix that.

 

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36233) Spark Kube Integration tests must be run from project root

2021-07-20 Thread Holden Karau (Jira)
Holden Karau created SPARK-36233:


 Summary: Spark Kube Integration tests must be run from project root
 Key: SPARK-36233
 URL: https://issues.apache.org/jira/browse/SPARK-36233
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.3.0
Reporter: Holden Karau


Otherwise they fail to resolve various configuration files. We should check 
that the PWD is in the project root and error otherwise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36141) Support arithmetic operations of Series containing Decimal(np.nan)

2021-07-20 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384485#comment-17384485
 ] 

Xinrong Meng commented on SPARK-36141:
--

It is closed because it is duplicated from 
https://issues.apache.org/jira/browse/SPARK-36231.

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36141
> URL: https://issues.apache.org/jira/browse/SPARK-36141
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> 

[jira] [Created] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-20 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36232:


 Summary: Support creating a ps.Series/Index with `Decimal('NaN')` 
with Arrow disabled
 Key: SPARK-36232
 URL: https://issues.apache.org/jira/browse/SPARK-36232
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-07-20 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36231:


 Summary: Support arithmetic operations of Series containing 
Decimal(np.nan) 
 Key: SPARK-36231
 URL: https://issues.apache.org/jira/browse/SPARK-36231
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Arithmetic operations of Series containing Decimal(np.nan) raise 
java.lang.NullPointerException in driver. An example is shown as below:
{code:java}
>>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
>>> decimal.Decimal(np.nan)])
>>> psser = ps.from_pandas(pser)
>>> pser + 1
0 2
 1 3
 2 NaN
>>> psser + 1
 Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
 at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
 at scala.Option.foreach(Option.scala:407)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
 at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
 at 
org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
 at 
org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
 at 
org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
 at 
org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
 at 
org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
 at 
org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
 at scala.util.Try$.apply(Try.scala:213)
 at 
org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
 Caused by: java.lang.NullPointerException
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
 at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:131)
 at 

[jira] [Resolved] (SPARK-36141) Support arithmetic operations of Series containing Decimal(np.nan)

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-36141.
--
Resolution: Duplicate

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36141
> URL: https://issues.apache.org/jira/browse/SPARK-36141
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Created] (SPARK-36230) hasnans for Series of Decimal(`NaN`)

2021-07-20 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36230:


 Summary: hasnans for Series of Decimal(`NaN`)
 Key: SPARK-36230
 URL: https://issues.apache.org/jira/browse/SPARK-36230
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


{code:java}
>>> import pandas as pd
>>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
>>> pser
00.1
1NaN
dtype: object
>>> psser = ps.from_pandas(pser)
>>> psser
0 0.1
1None
dtype: object
>>> psser.hasnans
False

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384469#comment-17384469
 ] 

Dongjoon Hyun commented on SPARK-25075:
---

>  It looks like scala 2.12 will still be the default, correct?
Yes, [~tgraves]. Our MiMa check is only verifying 2.12. After we have the 
official 2.13 bits, we are able to switch the default Scala version.

> Are we planning on releasing the Spark tgz artifacts for 2.13 and 2.12 or 
> only 2.12?
Both. SPARK-34218 added Scala 2.13 tgz artifacts and Maven publishing. Already, 
we have both Scala 2.12/2.13 Snapshot Maven artifacts.

cc [~Gengliang.Wang] since he is the Apache Spark 3.2.0 release manager. The 
above is the plan and if something goes out of order, we should fix it.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-20 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-36210:

Affects Version/s: 3.2.0
   3.0.3

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: koert kuipers
>Assignee: koert kuipers
>Priority: Minor
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-36229:
--
Environment: (was: SPARK-33428 fixed ArrayIndexOutofBoundsException but 
introduced a new inconsistency in behaviour where the returned value is 
different above the 64 char threshold.

 
{noformat}
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---+
|conv(repeat(?, 64), 10, 16)|
+---+
|                          0|
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
+---+
|conv(repeat(?, 65), 10, 16)|
+---+
|           |
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
++
|conv(repeat(?, 65), 10, -16)|
++
|                          -1|
++




scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
++
|conv(repeat(?, 64), 10, -16)|
++
|                           0|
++{noformat})

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384441#comment-17384441
 ] 

Tim Armstrong commented on SPARK-36229:
---

[~dgd_contributor] [~wenchen]

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>
> SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
> inconsistency in behaviour where the returned value is different above the 64 
> char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-36229:
--
Description: 
SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
inconsistency in behaviour where the returned value is different above the 64 
char threshold.

 
{noformat}
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---+
|conv(repeat(?, 64), 10, 16)|
+---+
|                          0|
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
+---+
|conv(repeat(?, 65), 10, 16)|
+---+
|           |
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
++
|conv(repeat(?, 65), 10, -16)|
++
|                          -1|
++




scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
++
|conv(repeat(?, 64), 10, -16)|
++
|                           0|
++{noformat}

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tim Armstrong
>Priority: Major
>
> SPARK-33428 fixed ArrayIndexOutofBoundsException but introduced a new 
> inconsistency in behaviour where the returned value is different above the 64 
> char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread Tim Armstrong (Jira)
Tim Armstrong created SPARK-36229:
-

 Summary: conv() inconsistently handles invalid strings with > 64 
invalid characters
 Key: SPARK-36229
 URL: https://issues.apache.org/jira/browse/SPARK-36229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
 Environment: SPARK-33428 fixed ArrayIndexOutofBoundsException but 
introduced a new inconsistency in behaviour where the returned value is 
different above the 64 char threshold.

 
{noformat}
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---+
|conv(repeat(?, 64), 10, 16)|
+---+
|                          0|
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
+---+
|conv(repeat(?, 65), 10, 16)|
+---+
|           |
+---+




scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
++
|conv(repeat(?, 65), 10, -16)|
++
|                          -1|
++




scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
++
|conv(repeat(?, 64), 10, -16)|
++
|                           0|
++{noformat}
Reporter: Tim Armstrong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36229) conv() inconsistently handles invalid strings with > 64 invalid characters

2021-07-20 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-36229:
--
Affects Version/s: (was: 3.1.2)
   3.2.0

> conv() inconsistently handles invalid strings with > 64 invalid characters
> --
>
> Key: SPARK-36229
> URL: https://issues.apache.org/jira/browse/SPARK-36229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: SPARK-33428 fixed ArrayIndexOutofBoundsException but 
> introduced a new inconsistency in behaviour where the returned value is 
> different above the 64 char threshold.
>  
> {noformat}
> scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
> +---+
> |conv(repeat(?, 64), 10, 16)|
> +---+
> |                          0|
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show
> +---+
> |conv(repeat(?, 65), 10, 16)|
> +---+
> |           |
> +---+
> scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show
> ++
> |conv(repeat(?, 65), 10, -16)|
> ++
> |                          -1|
> ++
> scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
> ++
> |conv(repeat(?, 64), 10, -16)|
> ++
> |                           0|
> ++{noformat}
>Reporter: Tim Armstrong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35546) Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way

2021-07-20 Thread Ye Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou resolved SPARK-35546.
-
   Fix Version/s: 3.2.0
Target Version/s: 3.2.0
  Resolution: Fixed

Issue resolved by pull request 33078 and merged into Branch 3.2 and Master

https://github.com/apache/spark/pull/33078

> Enable push-based shuffle when multiple app attempts are enabled and manage 
> concurrent access to the state in a better way 
> ---
>
> Key: SPARK-35546
> URL: https://issues.apache.org/jira/browse/SPARK-35546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Ye Zhou
>Priority: Major
> Fix For: 3.2.0
>
>
> In the current implementation of RemoteBlockPushResolver, two 
> ConcurrentHashmap are used to store #1 applicationId -> 
> mergedShuffleLocalDirPath #2 applicationId+attemptId+shuffleID -> 
> mergedShuffleParitionInfo. As there are four types of messages: 
> ExecutorRegister, PushBlocks, FinalizeShuffleMerge and ApplicationRemove, 
> will trigger different types of operations within these two hashmaps, it is 
> required to maintain strong consistency about the informations stored in 
> these two hashmaps. Otherwise, either there will be data 
> corruption/correctness issues or memory leak in shuffle server. 
> We should come up with systematic way to resolve this, other than spot fixing 
> the potential issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36143) Adjust astype of Series with missing values to follow pandas

2021-07-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36143:
-
Summary: Adjust astype of Series with missing values to follow pandas  
(was: Adjust astype of Series of ExtensionDtype to follow pandas)

> Adjust astype of Series with missing values to follow pandas
> 
>
> Key: SPARK-36143
> URL: https://issues.apache.org/jira/browse/SPARK-36143
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> pser = pd.Series([1, 2, np.nan], dtype=float)
> >>> psser = ps.from_pandas(pser)
> >>> pser.astype(int)
> ...
>  ValueError: Cannot convert non-finite values (NA or inf) to integer
> >>> psser.astype(int)
>  0 1.0
>  1 2.0
>  2 NaN
>  dtype: float64
> {code}
> As shown above, astype of Series of ExtensionDtype doesn't behave the same as 
> pandas for ExtensionDtype Series, we ought to adjust that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36215) Add logging for slow fetches to diagnose external shuffle service issues

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36215:


Assignee: (was: Apache Spark)

> Add logging for slow fetches to diagnose external shuffle service issues
> 
>
> Key: SPARK-36215
> URL: https://issues.apache.org/jira/browse/SPARK-36215
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
>
> Currently we can see from the metrics that a task or stage has slow fetches, 
> and the logs indicate _all_ of the shuffle servers those tasks were fetching 
> from, but often this is a big set (dozens or even hundreds) and narrowing 
> down which one caused issues can be very difficult. We should add some 
> logging when a fetch is "slow" as determined by some preconfigured thresholds.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36215) Add logging for slow fetches to diagnose external shuffle service issues

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36215:


Assignee: Apache Spark

> Add logging for slow fetches to diagnose external shuffle service issues
> 
>
> Key: SPARK-36215
> URL: https://issues.apache.org/jira/browse/SPARK-36215
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently we can see from the metrics that a task or stage has slow fetches, 
> and the logs indicate _all_ of the shuffle servers those tasks were fetching 
> from, but often this is a big set (dozens or even hundreds) and narrowing 
> down which one caused issues can be very difficult. We should add some 
> logging when a fetch is "slow" as determined by some preconfigured thresholds.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36215) Add logging for slow fetches to diagnose external shuffle service issues

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384428#comment-17384428
 ] 

Apache Spark commented on SPARK-36215:
--

User 'shardulm94' has created a pull request for this issue:
https://github.com/apache/spark/pull/33446

> Add logging for slow fetches to diagnose external shuffle service issues
> 
>
> Key: SPARK-36215
> URL: https://issues.apache.org/jira/browse/SPARK-36215
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
>
> Currently we can see from the metrics that a task or stage has slow fetches, 
> and the logs indicate _all_ of the shuffle servers those tasks were fetching 
> from, but often this is a big set (dozens or even hundreds) and narrowing 
> down which one caused issues can be very difficult. We should add some 
> logging when a fetch is "slow" as determined by some preconfigured thresholds.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36227) Remove TimestampNTZ type support in Spark 3.2

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36227:


Assignee: Gengliang Wang  (was: Apache Spark)

> Remove TimestampNTZ type support in Spark 3.2
> -
>
> Key: SPARK-36227
> URL: https://issues.apache.org/jira/browse/SPARK-36227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As of now, there are some blockers for delivering the TimestampNTZ project in 
> Spark 3.2:
> # In the Hive Thrift server, both TimestampType and TimestampNTZType are 
> mapped to the same timestamp type, which can cause confusion for users. 
> # For the Parquet data source, the new written TimestampNTZType Parquet 
> columns will be read as TimestampType in old Spark releases. Also, we need to 
> decide the merge schema for files mixed with TimestampType and TimestampNTZ 
> type.
> # The type coercion rules for TimestampNTZType are incomplete. For example, 
> what should the data type of the in clause "IN(Timestamp'2020-01-01 
> 00:00:00', TimestampNtz'2020-01-01 00:00:00') be.
> # It is tricky to support TimestampNTZType in JSON/CSV data readers. We need 
> to avoid regressions as possible as we can.
> There are 10 days left for the expected 3.2 RC date. So, I propose to release 
> the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have 
> enough time to make considerate designs for the issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35848:


Assignee: Sean R. Owen  (was: Apache Spark)

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Sean R. Owen
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36227) Remove TimestampNTZ type support in Spark 3.2

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36227:


Assignee: Apache Spark  (was: Gengliang Wang)

> Remove TimestampNTZ type support in Spark 3.2
> -
>
> Key: SPARK-36227
> URL: https://issues.apache.org/jira/browse/SPARK-36227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> As of now, there are some blockers for delivering the TimestampNTZ project in 
> Spark 3.2:
> # In the Hive Thrift server, both TimestampType and TimestampNTZType are 
> mapped to the same timestamp type, which can cause confusion for users. 
> # For the Parquet data source, the new written TimestampNTZType Parquet 
> columns will be read as TimestampType in old Spark releases. Also, we need to 
> decide the merge schema for files mixed with TimestampType and TimestampNTZ 
> type.
> # The type coercion rules for TimestampNTZType are incomplete. For example, 
> what should the data type of the in clause "IN(Timestamp'2020-01-01 
> 00:00:00', TimestampNtz'2020-01-01 00:00:00') be.
> # It is tricky to support TimestampNTZType in JSON/CSV data readers. We need 
> to avoid regressions as possible as we can.
> There are 10 days left for the expected 3.2 RC date. So, I propose to release 
> the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have 
> enough time to make considerate designs for the issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36227) Remove TimestampNTZ type support in Spark 3.2

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384403#comment-17384403
 ] 

Apache Spark commented on SPARK-36227:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33444

> Remove TimestampNTZ type support in Spark 3.2
> -
>
> Key: SPARK-36227
> URL: https://issues.apache.org/jira/browse/SPARK-36227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As of now, there are some blockers for delivering the TimestampNTZ project in 
> Spark 3.2:
> # In the Hive Thrift server, both TimestampType and TimestampNTZType are 
> mapped to the same timestamp type, which can cause confusion for users. 
> # For the Parquet data source, the new written TimestampNTZType Parquet 
> columns will be read as TimestampType in old Spark releases. Also, we need to 
> decide the merge schema for files mixed with TimestampType and TimestampNTZ 
> type.
> # The type coercion rules for TimestampNTZType are incomplete. For example, 
> what should the data type of the in clause "IN(Timestamp'2020-01-01 
> 00:00:00', TimestampNtz'2020-01-01 00:00:00') be.
> # It is tricky to support TimestampNTZType in JSON/CSV data readers. We need 
> to avoid regressions as possible as we can.
> There are 10 days left for the expected 3.2 RC date. So, I propose to release 
> the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have 
> enough time to make considerate designs for the issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35848:


Assignee: Apache Spark  (was: Sean R. Owen)

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Apache Spark
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384402#comment-17384402
 ] 

Apache Spark commented on SPARK-35848:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/33443

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Sean R. Owen
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36228) Skip splitting a reducer partition when some mapStatus is null

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384400#comment-17384400
 ] 

Apache Spark commented on SPARK-36228:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33445

> Skip splitting a reducer partition when some mapStatus is null
> --
>
> Key: SPARK-36228
> URL: https://issues.apache.org/jira/browse/SPARK-36228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36228) Skip splitting a reducer partition when some mapStatus is null

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36228:


Assignee: Apache Spark

> Skip splitting a reducer partition when some mapStatus is null
> --
>
> Key: SPARK-36228
> URL: https://issues.apache.org/jira/browse/SPARK-36228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36228) Skip splitting a reducer partition when some mapStatus is null

2021-07-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36228:


Assignee: (was: Apache Spark)

> Skip splitting a reducer partition when some mapStatus is null
> --
>
> Key: SPARK-36228
> URL: https://issues.apache.org/jira/browse/SPARK-36228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35353) Cross-building docker images to ARM64 is failing (with Ubuntu host)

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35353.
--
Resolution: Not A Problem

OK, that doesn't seem directly related to Spark as distributed by the project 
then - we don't make ARM builds and sounds like it can be achieved, just not 
for some reason on Ubuntu. If there's a specific change that enables this in 
Spark we can reopen.

> Cross-building docker images to ARM64 is failing (with Ubuntu host)
> ---
>
> Key: SPARK-35353
> URL: https://issues.apache.org/jira/browse/SPARK-35353
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Minor
>
> I was trying to cross-build Spark 3.1.1 for ARM64 so that I could deploy to a 
> Raspberry Pi Kubernetes cluster this weekend and the Docker build fails.
> Here are the commands I used:
> {code:java}
> docker buildx create --use
> ./bin/docker-image-tool.sh -n -r andygrove -t 3.1.1 -X build {code}
> The Docker build for ARM64 fails on the following command:
> {code:java}
>  apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps{code}
> The install fails with "Error while loading /usr/sbin/dpkg-split: No such 
> file or directory".
> Here is a fragment of the output showing the relevant error message.
> {code:java}
> #6 6.034 Get:35 https://deb.debian.org/debian buster/main arm64 libnss3 arm64 
> 2:3.42.1-1+deb10u3 [1082 kB]
> #6 6.102 Get:36 https://deb.debian.org/debian buster/main arm64 psmisc arm64 
> 23.2-1 [122 kB]
> #6 6.109 Get:37 https://deb.debian.org/debian buster/main arm64 tini arm64 
> 0.18.0-1 [194 kB]
> #6 6.767 debconf: delaying package configuration, since apt-utils is not 
> installed
> #6 6.883 Fetched 18.1 MB in 1s (13.4 MB/s)
> #6 6.956 Error while loading /usr/sbin/dpkg-split: No such file or directory
> #6 6.959 Error while loading /usr/sbin/dpkg-deb: No such file or directory
> #6 6.961 dpkg: error processing archive 
> /tmp/apt-dpkg-install-NdOR40/00-libncurses6_6.1+20181013-2+deb10u2_arm64.deb 
> (--unpack):
>  {code}
> My host environment details:
>  * Ubuntu 18.04.5 LTS
>  * Docker version 20.10.6, build 370c289
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35365.
--
Resolution: Invalid

This is more a question for the user@ list than a specific issue report

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.1.1_report_originalsql, 
> spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35517.
--
Resolution: Not A Problem

> Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
> htrace-core4-4.1.0-incubating.jar
> ---
>
> Key: SPARK-35517
> URL: https://issues.apache.org/jira/browse/SPARK-35517
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} 
> {{2.4.0}} :
>  * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
>  * 
> [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]
> This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35517) Critical Vulnerabilities: jackson-databind 2.4.0 shipped with htrace-core4-4.1.0-incubating.jar

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384395#comment-17384395
 ] 

Sean R. Owen commented on SPARK-35517:
--

Spark 3.2 is coming in a month or so. Spark doesn't use it directly to my 
knowledge anyway, but it's best to update, and already done.

> Critical Vulnerabilities: jackson-databind 2.4.0 shipped with 
> htrace-core4-4.1.0-incubating.jar
> ---
>
> Key: SPARK-35517
> URL: https://issues.apache.org/jira/browse/SPARK-35517
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{jackson-databind}} 
> {{2.4.0}} :
>  * [CVE-2018-7489|https://nvd.nist.gov/vuln/detail/CVE-2018-7489]
>  * 
> [CVE-2018-14718|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-14718]
> This package is shipped within {{jars/htrace-core4-4.1.0-incubating.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35519) Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384394#comment-17384394
 ] 

Sean R. Owen commented on SPARK-35519:
--

We generally do not accept reports like "my static analyzer flagged this" 
without more info. Does this affect Spark? This also does not come in from 
Spark itself, so typically it means another library we depend on needs it - the 
update should ideally go there. We can manually manage up packages, but would 
do so only if there were any plausible theory that it affects Spark.

> Critical Vulnerabilities: nimbusds_nimbus-jose-jwt 4.41.1 shipped
> -
>
> Key: SPARK-35519
> URL: https://issues.apache.org/jira/browse/SPARK-35519
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{nimbus-jose-jwt}} 
> {{4.41.1}} :
> *  [CVE-2019-17195|https://nvd.nist.gov/vuln/detail/CVE-2019-17195]
> This package is shipped within {{jars/nimbus-jose-jwt-4.41.1.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35518) Critical Vulnerabilities: log4j_log4j 1.2.17 shipped

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384387#comment-17384387
 ] 

Sean R. Owen commented on SPARK-35518:
--

I tried this a looong time ago and it was very hard. Spark doesn't really use 
it; other libraries we depend on do. I don't think we can fix this as a result.

> Critical Vulnerabilities: log4j_log4j 1.2.17 shipped
> 
>
> Key: SPARK-35518
> URL: https://issues.apache.org/jira/browse/SPARK-35518
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Louis DEFLANDRE
>Priority: Major
>
> Vulnerabilities scanner is highlighting following CRITICAL vulnerabilities in 
> {{spark-3.0.2-bin-hadoop3.2}} coming from obsolete {{log4j_log4j}} {{1.2.17}} 
> :
>  * [CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]
> This package is shipped within {{jars/log4j-1.2.17.jar}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36228) Skip splitting a reducer partition when some mapStatus is null

2021-07-20 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36228:
---

 Summary: Skip splitting a reducer partition when some mapStatus is 
null
 Key: SPARK-36228
 URL: https://issues.apache.org/jira/browse/SPARK-36228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35837) Recommendations for Common Query Problems

2021-07-20 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384375#comment-17384375
 ] 

Sean R. Owen commented on SPARK-35837:
--

What would this look like in Spark though? how do you surface recommendations?

> Recommendations for Common Query Problems
> -
>
> Key: SPARK-35837
> URL: https://issues.apache.org/jira/browse/SPARK-35837
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Teradata supports [Recommendations for Common Query 
> Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].
> We can implement a similar feature.
>  1. Detect the most skew values for join. The user decides whether these are 
> needed.
>  2. Detect the most skew values for window function. The user decides whether 
> these are needed.
>  3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
>  4. Recommend the user add a filter condition to the partition column of the 
> partition table.
>  5. Check the condition of join, for example, the result of cast string to 
> double may be incorrect.
> For example:
> {code:sql}
> 0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION
> 0: jdbc:hive2://localhost:1/default> SELECT a.*,
> 0: jdbc:hive2://localhost:1/default>CASE
> 0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
> 0: jdbc:hive2://localhost:1/default>   AND a.cobrand 
> = 6
> 0: jdbc:hive2://localhost:1/default>   AND 
> a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> 0: jdbc:hive2://localhost:1/default>   AND ( 
> a.valid_page_count = 1 ) THEN 1
> 0: jdbc:hive2://localhost:1/default>  ELSE 0
> 0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
> 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
> 0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
> 0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
> 0: jdbc:hive2://localhost:1/default>'VI' AS 
> page_type
> 0: jdbc:hive2://localhost:1/default> FROM   tbl1
> 0: jdbc:hive2://localhost:1/default> UNION
> 0: jdbc:hive2://localhost:1/default> SELECT *,
> 0: jdbc:hive2://localhost:1/default>'SRP' AS 
> page_type
> 0: jdbc:hive2://localhost:1/default> FROM   tbl2
> 0: jdbc:hive2://localhost:1/default> UNION
> 0: jdbc:hive2://localhost:1/default> SELECT *,
> 0: jdbc:hive2://localhost:1/default>'My Garage' 
> AS page_type
> 0: jdbc:hive2://localhost:1/default> FROM   tbl3
> 0: jdbc:hive2://localhost:1/default> UNION
> 0: jdbc:hive2://localhost:1/default> SELECT *,
> 0: jdbc:hive2://localhost:1/default>'Motors 
> Homepage' AS page_type
> 0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
> 0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
> BETWEEN ( '2020-01-01' ) AND (
> 0: jdbc:hive2://localhost:1/default>  
>CURRENT_DATE() - 10 )) a
> 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
> 0: jdbc:hive2://localhost:1/default>  
> item_site_id,
> 0: jdbc:hive2://localhost:1/default>  auct_end_dt,
> 0: jdbc:hive2://localhost:1/default>  
> leaf_categ_id
> 0: jdbc:hive2://localhost:1/default>   FROM   tbl5
> 0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
> >= ( '2020-01-01' )) itm
> 0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
> itm.item_id
> 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
> 0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
> ca.leaf_categ_id
> 0: jdbc:hive2://localhost:1/default>  AND 
> itm.item_site_id = ca.site_id;
> +-+--+
> | result  
> |
> +-+--+
> | 1. Detect the most skew values for join 
> |
> |   

[jira] [Assigned] (SPARK-35848) Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError

2021-07-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35848:


  Component/s: ML
 Target Version/s: 3.3.0
Affects Version/s: (was: 3.0.0)
   (was: 2.0.0)
   3.2.0
   3.0.3
   3.1.2
 Assignee: Sean R. Owen
  Summary: Spark Bloom Filter, others using treeAggregate can throw 
OutOfMemoryError  (was: Spark Bloom Filter throws OutOfMemoryError)

(I'm expanding this to include other uses of treeAggregate that could benefit 
from the same treatment)

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -
>
> Key: SPARK-35848
> URL: https://issues.apache.org/jira/browse/SPARK-35848
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Sai Polisetty
>Assignee: Sean R. Owen
>Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#55}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 20L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#55}java.lang.OutOfMemoryError{color}
> {color:#55} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira

[jira] [Created] (SPARK-36227) Remove TimestampNTZ type support in Spark 3.2

2021-07-20 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-36227:
--

 Summary: Remove TimestampNTZ type support in Spark 3.2
 Key: SPARK-36227
 URL: https://issues.apache.org/jira/browse/SPARK-36227
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


As of now, there are some blockers for delivering the TimestampNTZ project in 
Spark 3.2:
# In the Hive Thrift server, both TimestampType and TimestampNTZType are mapped 
to the same timestamp type, which can cause confusion for users. 
# For the Parquet data source, the new written TimestampNTZType Parquet columns 
will be read as TimestampType in old Spark releases. Also, we need to decide 
the merge schema for files mixed with TimestampType and TimestampNTZ type.
# The type coercion rules for TimestampNTZType are incomplete. For example, 
what should the data type of the in clause "IN(Timestamp'2020-01-01 00:00:00', 
TimestampNtz'2020-01-01 00:00:00') be.
# It is tricky to support TimestampNTZType in JSON/CSV data readers. We need to 
avoid regressions as possible as we can.

There are 10 days left for the expected 3.2 RC date. So, I propose to release 
the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have enough 
time to make considerate designs for the issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36222) Step by days in the Sequence expression for dates

2021-07-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36222.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33439
[https://github.com/apache/spark/pull/33439]

> Step by days in the Sequence expression for dates
> -
>
> Key: SPARK-36222
> URL: https://issues.apache.org/jira/browse/SPARK-36222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> Allow to generate a sequence of dates by day step in a range of dates. For 
> instance:
> {code:sql}
> spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' 
> day);
> Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE 
> '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch:
> sequence uses the wrong parameter type. The parameter type must conform to:
> 1. The start and stop expressions must resolve to the same type.
> 2. If start and stop expressions resolve to the 'date' or 'timestamp' type
> then the step expression must resolve to the 'interval' or
> 'interval year to month' or 'interval day to second' type,
> otherwise to the same type as the start and stop expressions.
>  ; line 1 pos 7;
> 'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' 
> DAY), Some(Europe/Moscow)), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36222) Step by days in the Sequence expression for dates

2021-07-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-36222:


Assignee: jiaan.geng

> Step by days in the Sequence expression for dates
> -
>
> Key: SPARK-36222
> URL: https://issues.apache.org/jira/browse/SPARK-36222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
>
> Allow to generate a sequence of dates by day step in a range of dates. For 
> instance:
> {code:sql}
> spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' 
> day);
> Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE 
> '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch:
> sequence uses the wrong parameter type. The parameter type must conform to:
> 1. The start and stop expressions must resolve to the same type.
> 2. If start and stop expressions resolve to the 'date' or 'timestamp' type
> then the step expression must resolve to the 'interval' or
> 'interval year to month' or 'interval day to second' type,
> otherwise to the same type as the start and stop expressions.
>  ; line 1 pos 7;
> 'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' 
> DAY), Some(Europe/Moscow)), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-20 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36210.
-
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33423
[https://github.com/apache/spark/pull/33423]

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Assignee: koert kuipers
>Priority: Minor
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-20 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36210:
---

Assignee: koert kuipers

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Assignee: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36020) Check logical link in remove redundant projects

2021-07-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384360#comment-17384360
 ] 

Apache Spark commented on SPARK-36020:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33442

> Check logical link in remove redundant projects
> ---
>
> Key: SPARK-36020
> URL: https://issues.apache.org/jira/browse/SPARK-36020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >