[jira] [Assigned] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40436:
-

Assignee: Yang Jie

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40436.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37892
[https://github.com/apache/spark/pull/37892]

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40471.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37914
[https://github.com/apache/spark/pull/37914]

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40471:
-

Assignee: Yang Jie

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40461) Set upperbound for pyzmq 24.0.0 for Python linter

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-40461:
---

Assignee: Hyukjin Kwon

> Set upperbound for pyzmq 24.0.0 for Python linter
> -
>
> Key: SPARK-40461
> URL: https://issues.apache.org/jira/browse/SPARK-40461
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.3, 3.3.0, 3.2.2, 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> Error: https://github.com/apache/spark/actions/runs/3063515551/jobs/4947782771
> New release (10 hours ago): 
> https://github.com/zeromq/pyzmq/commit/2d3327d2e50c2510d45db2fc51488578a737b79b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-40149:
---

Assignee: Wenchen Fan

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39656) Fix wrong namespace in DescribeNamespaceExec

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39656:

Fix Version/s: (was: 3.1.4)
   (was: 3.4.0)
   (was: 3.3.1)
   (was: 3.2.3)

> Fix wrong namespace in DescribeNamespaceExec
> 
>
> Key: SPARK-39656
> URL: https://issues.apache.org/jira/browse/SPARK-39656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>
> DescribeNamespaceExec should show whole namespace rather than last



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606023#comment-17606023
 ] 

Dongjoon Hyun commented on SPARK-39854:
---

It turns out SPARK-35194 caused this regression. I tested at the commit of 
SPARK-35194 and its parent commit and verified that this regression happens 
after SPARK-35194.

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> 

[jira] [Commented] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606020#comment-17606020
 ] 

Apache Spark commented on SPARK-40476:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37918

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40476:


Assignee: (was: Apache Spark)

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40476:


Assignee: Apache Spark

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32059) Nested Schema Pruning not Working in Window Functions

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32059.
---
Fix Version/s: 3.1.0
 Assignee: Frank Yin
   Resolution: Fixed

This was resolved via [https://github.com/apache/spark/pull/28898]

 

cc [~maropu] since this was merged via 
https://github.com/apache/spark/pull/28898#issuecomment-664715448

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Frank Yin
>Assignee: Frank Yin
>Priority: Major
> Fix For: 3.1.0
>
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40476) Reduce the shuffle size of ALS

2022-09-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40476:
--
Summary: Reduce the shuffle size of ALS  (was: Optimize the shuffle size of 
ALS)

> Reduce the shuffle size of ALS
> --
>
> Key: SPARK-40476
> URL: https://issues.apache.org/jira/browse/SPARK-40476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-16 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40477:
-

 Summary: Support `NullType` in `ColumnarBatchRow`
 Key: SPARK-40477
 URL: https://issues.apache.org/jira/browse/SPARK-40477
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


`ColumnarBatchRow.get()` does not support `NullType` currently. Support 
`NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40476) Optimize the shuffle size of ALS

2022-09-16 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40476:
-

 Summary: Optimize the shuffle size of ALS
 Key: SPARK-40476
 URL: https://issues.apache.org/jira/browse/SPARK-40476
 Project: Spark
  Issue Type: Improvement
  Components: ML, SQL
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606016#comment-17606016
 ] 

Dongjoon Hyun commented on SPARK-39854:
---

Thank you, [~jiajiwu] .

The recommended workaround would be to disable nested schema pruning only 
instead of disabling all ColumnPruning rule.
{code:java}
spark.sql.optimizer.expression.nestedPruning.enabled=false 
spark.sql.optimizer.nestedSchemaPruning.enabled=false{code}
 

For example, 
{code:java}
$ bin/spark-shell -c spark.sql.optimizer.expression.nestedPruning.enabled=false 
-c spark.sql.optimizer.nestedSchemaPruning.enabled=false
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
22/09/16 17:02:01 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1663372921582).
Spark session available as 'spark'.
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.16)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)import spark.implicits._val testJson =
  """{
    | "b": {
    |  "id": "id00",
    |  "data": [{
    |   "b1": "vb1",
    |   "b2": 101,
    |   "ex2": [
    |    { "fb1": false, "fb2": 11, "fb3": "t1" },
    |    { "fb1": true, "fb2": 12, "fb3": "t2" }
    |   ]}, {
    |   "b1": "vb2",
    |   "b2": 102,
    |   "ex2": [
    |    { "fb1": false, "fb2": 13, "fb3": "t3" },
    |    { "fb1": true, "fb2": 14, "fb3": "t4" }
    |   ]}
    |  ],
    |  "fa": "tes",
    |  "v": "1.5"
    | }
    |}
    |""".stripMargin
val df = spark.read.json((testJson :: Nil).toDS())
  .withColumn("ex_b", explode($"b.data.ex2"))
  .withColumn("ex_b2", explode($"ex_b"))
val df1 = df
  .withColumn("rt", struct(
    $"b.fa".alias("rt_fa"),
    $"b.v".alias("rt_v")
  ))
  .drop("b", "ex_b")
df1.show(false)// Exiting paste mode, now 
interpreting.+---+--+
|ex_b2          |rt        |
+---+--+
|{false, 11, t1}|{tes, 1.5}|
|{true, 12, t2} |{tes, 1.5}|
|{false, 13, t3}|{tes, 1.5}|
|{true, 14, t4} |{tes, 1.5}|
+---+--+import spark.implicits._
testJson: String =
"{
 "b": {
  "id": "id00",
  "data": [{
   "b1": "vb1",
   "b2": 101,
   "ex2": [
    { "fb1": false, "fb2": 11, "fb3": "t1" },
    { "fb1": true, "fb2": 12, "fb3": "t2" }
   ]}, {
   "b1": "vb2",
   "b2": 102,
   "ex2": [
    { "fb1": false, "fb2": 13, "fb3": "t3" },
    { "fb1": true, "fb2": 14, "fb3": "t4" }
   ]}
  ],
  "fa": "tes",
  "v": "1.5"
 }
}
"
df: org.apache.spark.sql.DataFrame = [b: struct>>>,
 fa: string ... 2 more fields>, ex_b: 
array> ... 1 more field]
df1: org.apache.spark.sql.DataFrame = [ex_b2: struct Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> 

[jira] [Updated] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39854:
--
Affects Version/s: 3.2.2
   3.2.0

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> 

[jira] [Resolved] (SPARK-40447) Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40447.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37913
[https://github.com/apache/spark/pull/37913]

> Implement `kendall` correlation in `DataFrame.corr`
> ---
>
> Key: SPARK-40447
> URL: https://issues.apache.org/jira/browse/SPARK-40447
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40445) Refactor Resampler

2022-09-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40445:
-

Assignee: Ruifeng Zheng

> Refactor Resampler
> --
>
> Key: SPARK-40445
> URL: https://issues.apache.org/jira/browse/SPARK-40445
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40445) Refactor Resampler

2022-09-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40445.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37897
[https://github.com/apache/spark/pull/37897]

> Refactor Resampler
> --
>
> Key: SPARK-40445
> URL: https://issues.apache.org/jira/browse/SPARK-40445
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2022-09-16 Thread David Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605996#comment-17605996
 ] 

David Morin edited comment on SPARK-39375 at 9/16/22 9:23 PM:
--

[https://lists.apache.org/thread/0w5kplnht26bg8bncdbnngknhdz5ko8m]

+1

Thx !


was (Author: aldu29):
[https://lists.apache.org/thread/0w5kplnht26bg8bncdbnngknhdz5ko8m]

Thx !

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2022-09-16 Thread David Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605996#comment-17605996
 ] 

David Morin edited comment on SPARK-39375 at 9/16/22 9:23 PM:
--

[https://lists.apache.org/thread/0w5kplnht26bg8bncdbnngknhdz5ko8m]

Thx !


was (Author: aldu29):
Got it [~hyukjin.kwon]  
[https://lists.apache.org/thread/0w5kplnht26bg8bncdbnngknhdz5ko8m]

Thx !

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2022-09-16 Thread David Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605996#comment-17605996
 ] 

David Morin commented on SPARK-39375:
-

Got it [~hyukjin.kwon]  
[https://lists.apache.org/thread/0w5kplnht26bg8bncdbnngknhdz5ko8m]

Thx !

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40475) Allow job status tracking with jobGroupId

2022-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605981#comment-17605981
 ] 

Dongjoon Hyun commented on SPARK-40475:
---

Thank you for filing a JIRA, [~anuragmantri] . Go for it!

> Allow job status tracking with jobGroupId
> -
>
> Key: SPARK-40475
> URL: https://issues.apache.org/jira/browse/SPARK-40475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.3.0
>Reporter: Anurag Mantripragada
>Priority: Major
>
> Spark let's us group jobs together by setting a job group id. This is useful 
> to check the job group in the web UI. For example
> {{spark.sparkContext().setJobGroup("mygroup_id")}}
> We have a use-case where we would like to have a long running Spark 
> application and have jobs submitted to it. We would like to programmatically 
> check the status of the jobs created by this group id. For example, 
> [SQLStatusStore|#L41]] has `executionList()` which returns a map of jobs to 
> the status. There is no way to filter this based on jobGroupId. 
> This Jira is to add ability to get fine grained job statues by jobGroupId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40475) Allow job status tracking with jobGroupId

2022-09-16 Thread Anurag Mantripragada (Jira)
Anurag Mantripragada created SPARK-40475:


 Summary: Allow job status tracking with jobGroupId
 Key: SPARK-40475
 URL: https://issues.apache.org/jira/browse/SPARK-40475
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.3.0
Reporter: Anurag Mantripragada


Spark let's us group jobs together by setting a job group id. This is useful to 
check the job group in the web UI. For example

{{spark.sparkContext().setJobGroup("mygroup_id")}}

We have a use-case where we would like to have a long running Spark application 
and have jobs submitted to it. We would like to programmatically check the 
status of the jobs created by this group id. For example, 
[SQLStatusStore|#L41]] has `executionList()` which returns a map of jobs to the 
status. There is no way to filter this based on jobGroupId. 

This Jira is to add ability to get fine grained job statues by jobGroupId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-40169:
-
Fix Version/s: 3.2.3

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-40169:


Assignee: Chao Sun

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-40169:
-
Fix Version/s: 3.3.1

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-16 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-40169.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37881
[https://github.com/apache/spark/pull/37881]

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Description: 
In this ticket, we introduced the support of date type in CSV schema inference. 
The schema inference behavior on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For a column containing a mixture of dates and timestamps, we will infer it 
as String type

  was:
In this ticket, we introduced the support of date type in CSV schema inference. 
The schema inference behavior on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type


> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
> Fix For: 3.4.0
>
>
> In this ticket, we introduced the support of date type in CSV schema 
> inference. The schema inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Description: 
In this ticket, we introduced the support of date type in CSV schema inference. 
The schema inference behavior on date time columns now is:
 * For a column only containing dates, we will infer it as Date type
 * For a column only containing timestamps, we will infer it as Timestamp type
 * For a column containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type

  was:
In this [ticket|https://issues.apache.org/jira/browse/SPARK-39469], we 
introduced the support of date type in CSV schema inference. The schema 
inference behavior on date time columns now is:
 * For columns only containing dates, we will infer it as Date type
 * For columns only containing timestamps, we will infer it as Timestamp type
 * For columns containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type


> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
> Fix For: 3.4.0
>
>
> In this ticket, we introduced the support of date type in CSV schema 
> inference. The schema inference behavior on date time columns now is:
>  * For a column only containing dates, we will infer it as Date type
>  * For a column only containing timestamps, we will infer it as Timestamp type
>  * For a column containing a mixture of dates and timestamps, we will infer 
> it as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Description: 
In this [ticket|https://issues.apache.org/jira/browse/SPARK-39469], we 
introduced the support of date type in CSV schema inference. The schema 
inference behavior on date time columns now is:
 * For columns only containing dates, we will infer it as Date type
 * For columns only containing timestamps, we will infer it as Timestamp type
 * For columns containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type

  was:
In ticket, we introduced the support of date type in CSV schema inference. The 
schema inference behavior on date time columns now is:
 * For columns only containing dates, we will infer it as Date type
 * For columns only containing timestamps, we will infer it as Timestamp type
 * For columns containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type


> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
> Fix For: 3.4.0
>
>
> In this [ticket|https://issues.apache.org/jira/browse/SPARK-39469], we 
> introduced the support of date type in CSV schema inference. The schema 
> inference behavior on date time columns now is:
>  * For columns only containing dates, we will infer it as Date type
>  * For columns only containing timestamps, we will infer it as Timestamp type
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Shepherd:   (was: Xiaonan Yang)

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
> Fix For: 3.4.0
>
>
> In ticket, we introduced the support of date type in CSV schema inference. 
> The schema inference behavior on date time columns now is:
>  * For columns only containing dates, we will infer it as Date type
>  * For columns only containing timestamps, we will infer it as Timestamp type
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaonan Yang updated SPARK-40474:
-
Shepherd: Xiaonan Yang

> Infer columns with mixed date and timestamp as String in CSV schema inference
> -
>
> Key: SPARK-40474
> URL: https://issues.apache.org/jira/browse/SPARK-40474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xiaonan Yang
>Priority: Major
> Fix For: 3.4.0
>
>
> In ticket, we introduced the support of date type in CSV schema inference. 
> The schema inference behavior on date time columns now is:
>  * For columns only containing dates, we will infer it as Date type
>  * For columns only containing timestamps, we will infer it as Timestamp type
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as Timestamp type
> However, we found that we are too ambitious on the last scenario, to support 
> which we have introduced much complexity in code and caused a lot of 
> performance concerns. Thus, we want to simplify the behavior of the last 
> scenario as:
>  * For columns containing a mixture of dates and timestamps, we will infer it 
> as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-16 Thread Xiaonan Yang (Jira)
Xiaonan Yang created SPARK-40474:


 Summary: Infer columns with mixed date and timestamp as String in 
CSV schema inference
 Key: SPARK-40474
 URL: https://issues.apache.org/jira/browse/SPARK-40474
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Xiaonan Yang
 Fix For: 3.4.0


In ticket, we introduced the support of date type in CSV schema inference. The 
schema inference behavior on date time columns now is:
 * For columns only containing dates, we will infer it as Date type
 * For columns only containing timestamps, we will infer it as Timestamp type
 * For columns containing a mixture of dates and timestamps, we will infer it 
as Timestamp type

However, we found that we are too ambitious on the last scenario, to support 
which we have introduced much complexity in code and caused a lot of 
performance concerns. Thus, we want to simplify the behavior of the last 
scenario as:
 * For columns containing a mixture of dates and timestamps, we will infer it 
as String type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40466:


Assignee: (was: Apache Spark)

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Priority: Minor
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40466:


Assignee: Apache Spark

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Assignee: Apache Spark
>Priority: Minor
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605921#comment-17605921
 ] 

Apache Spark commented on SPARK-40466:
--

User 'huanliwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37917

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Priority: Minor
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605920#comment-17605920
 ] 

Apache Spark commented on SPARK-40466:
--

User 'huanliwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37917

> Improve the error message if the DSv2 source is disabled but DSv1 streaming 
> source is not available
> ---
>
> Key: SPARK-40466
> URL: https://issues.apache.org/jira/browse/SPARK-40466
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Huanli Wang
>Priority: Minor
>
> If the V2 data source is disabled, current behavior will fallback to use V1 
> data source. But it will throw error when the DSv1 is not available. Update 
> the error message to indicate what config variable 
> (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in 
> order to enable the V2 data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40473) Migrate parsing errors onto error classes

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40473:


Assignee: Apache Spark  (was: Max Gekk)

> Migrate parsing errors onto error classes
> -
>
> Key: SPARK-40473
> URL: https://issues.apache.org/jira/browse/SPARK-40473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Use temporary error classes in ParseException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40473) Migrate parsing errors onto error classes

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605874#comment-17605874
 ] 

Apache Spark commented on SPARK-40473:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37916

> Migrate parsing errors onto error classes
> -
>
> Key: SPARK-40473
> URL: https://issues.apache.org/jira/browse/SPARK-40473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Use temporary error classes in ParseException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40473) Migrate parsing errors onto error classes

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40473:


Assignee: Max Gekk  (was: Apache Spark)

> Migrate parsing errors onto error classes
> -
>
> Key: SPARK-40473
> URL: https://issues.apache.org/jira/browse/SPARK-40473
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Use temporary error classes in ParseException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40473) Migrate parsing errors onto error classes

2022-09-16 Thread Max Gekk (Jira)
Max Gekk created SPARK-40473:


 Summary: Migrate parsing errors onto error classes
 Key: SPARK-40473
 URL: https://issues.apache.org/jira/browse/SPARK-40473
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Use temporary error classes in ParseException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40398) Use Loop instead of Arrays.stream api

2022-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40398.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37843
[https://github.com/apache/spark/pull/37843]

> Use Loop instead of Arrays.stream api
> -
>
> Key: SPARK-40398
> URL: https://issues.apache.org/jira/browse/SPARK-40398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> When the logic of stream pipe is relatively simple, using Arrays.stream is 
> always slower than using loop directly
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40398) Use Loop instead of Arrays.stream api

2022-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40398:


Assignee: Yang Jie

> Use Loop instead of Arrays.stream api
> -
>
> Key: SPARK-40398
> URL: https://issues.apache.org/jira/browse/SPARK-40398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> When the logic of stream pipe is relatively simple, using Arrays.stream is 
> always slower than using loop directly
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40470.
--
Fix Version/s: 3.3.1
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37911
[https://github.com/apache/spark/pull/37911]

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40470:


Assignee: Ivan Sadikov

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40453) Improve error handling for GRPC server

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40453:
-
Summary: Improve error handling for GRPC server  (was: Improve Error 
handling for GRPC server)

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40452) Developer documentation

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40452:
-
Summary: Developer documentation  (was: Developer Documentation)

> Developer documentation
> ---
>
> Key: SPARK-40452
> URL: https://issues.apache.org/jira/browse/SPARK-40452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> Extend the existing sparse developer documentation for Spark Connect so that 
> other developers can more easily contribute.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40449) Extend test coverage of Catalyst optimizer

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40449:
-
Summary: Extend test coverage of Catalyst optimizer  (was: Extend test 
coverage of Planner)

> Extend test coverage of Catalyst optimizer
> --
>
> Key: SPARK-40449
> URL: https://issues.apache.org/jira/browse/SPARK-40449
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.3.0
>Reporter: Martin Grund
>Priority: Major
>
> Extend the coverage of the proto -> Spark Logical Plan to cover all cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40449) Extend test coverage of Analyzer

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40449:
-
Summary: Extend test coverage of Analyzer  (was: Extend test coverage of 
Catalyst optimizer)

> Extend test coverage of Analyzer
> 
>
> Key: SPARK-40449
> URL: https://issues.apache.org/jira/browse/SPARK-40449
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.3.0
>Reporter: Martin Grund
>Priority: Major
>
> Extend the coverage of the proto -> Spark Logical Plan to cover all cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39673) High-Level design doc for Spark Connect

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39673:
-
Summary: High-Level design doc for Spark Connect  (was: High-Level Design 
Doc for Spark Connect)

> High-Level design doc for Spark Connect
> ---
>
> Key: SPARK-39673
> URL: https://issues.apache.org/jira/browse/SPARK-39673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>
> Please find the HLD for Spark Connect here: 
> https://s.apache.org/hld-spark-connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40448) Prototype implementation

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40448:
-
Summary: Prototype implementation  (was: Prototype Implementation)

> Prototype implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40449) Extend test coverage of Planner

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40449:
-
Summary: Extend test coverage of Planner  (was: Extend Test Coverage of 
Planner)

> Extend test coverage of Planner
> ---
>
> Key: SPARK-40449
> URL: https://issues.apache.org/jira/browse/SPARK-40449
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.3.0
>Reporter: Martin Grund
>Priority: Major
>
> Extend the coverage of the proto -> Spark Logical Plan to cover all cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39674) Initial protobuf definition for Spark Connect API

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39674:
-
Summary: Initial protobuf definition for Spark Connect API  (was: Initial 
Protobuf Definition for Spark Connect API)

> Initial protobuf definition for Spark Connect API
> -
>
> Key: SPARK-39674
> URL: https://issues.apache.org/jira/browse/SPARK-39674
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>
> Provide the initial revision of the Spark Connect protobuf definitions of the 
> service and the relation types base don the first prototype.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40448) Prototype Implementation

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40448:


Assignee: (was: Apache Spark)

> Prototype Implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40448) Prototype Implementation

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605781#comment-17605781
 ] 

Apache Spark commented on SPARK-40448:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/37710

> Prototype Implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40448) Prototype Implementation

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40448:


Assignee: Apache Spark

> Prototype Implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40448) Prototype Implementation

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605782#comment-17605782
 ] 

Apache Spark commented on SPARK-40448:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/37710

> Prototype Implementation
> 
>
> Key: SPARK-40448
> URL: https://issues.apache.org/jira/browse/SPARK-40448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> In [https://github.com/apache/spark/pull/37710] we created a prototype that 
> shows the end to end integration of Spark Connect with the rest of the system.
>  
> Since the PR is quite large, we will track follow up items as children of 
> SPARK-39375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40472) Improve pyspark.sql.function example experience

2022-09-16 Thread deshanxiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-40472:
---
Description: 
There are many exanple in pyspark.sql.function:
{code:java}
    Examples
    
    >>> df = spark.range(1)
    >>> df.select(lit(5).alias('height'), df.id).show()
    +--+---+
    |height| id|
    +--+---+
    |     5|  0|
    +--+---+ {code}
We can add import statements so that the user can directly run it.

> Improve pyspark.sql.function example experience
> ---
>
> Key: SPARK-40472
> URL: https://issues.apache.org/jira/browse/SPARK-40472
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>
> There are many exanple in pyspark.sql.function:
> {code:java}
>     Examples
>     
>     >>> df = spark.range(1)
>     >>> df.select(lit(5).alias('height'), df.id).show()
>     +--+---+
>     |height| id|
>     +--+---+
>     |     5|  0|
>     +--+---+ {code}
> We can add import statements so that the user can directly run it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40472) Improve pyspark.sql.function example experience

2022-09-16 Thread deshanxiao (Jira)
deshanxiao created SPARK-40472:
--

 Summary: Improve pyspark.sql.function example experience
 Key: SPARK-40472
 URL: https://issues.apache.org/jira/browse/SPARK-40472
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: deshanxiao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40441) With PANDAS_UDF, data from tasks on the same physical node is aggregated into one task execution, resulting in concurrency not being fully utilized

2022-09-16 Thread SimonAries (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605755#comment-17605755
 ] 

SimonAries commented on SPARK-40441:


Let me try that

> With PANDAS_UDF, data from tasks on the same physical node is aggregated into 
> one task execution, resulting in concurrency not being fully utilized
> ---
>
> Key: SPARK-40441
> URL: https://issues.apache.org/jira/browse/SPARK-40441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: SimonAries
>Priority: Major
> Attachments: image-2022-09-15-14-28-04-332.png, 
> image-2022-09-15-14-29-35-004.png
>
>
> {code:java}
> //代码占位符
> import json
> import pandas as pd
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> import torch
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> import argparse
> torch.set_num_threads(1)
> schema = T.StructType([T.StructField("topic_id", T.StringType(), True),
>T.StructField("topic_ht", T.StringType(), True)
>])
> def parse_args():
> parser = argparse.ArgumentParser()
> parser.add_argument('--input_path', help='输入路径',
> default="./*", type=str)
> parser.add_argument('--output_path', help='输出路径',
> default="./tmp_output", type=str)
> parser.add_argument('--project_name', help='项目名',
> default="tmp", type=str)
> parser.add_argument('--calc_date', help='数据分区',
> default="2022-06-21", type=str)
> parser.add_argument('--edition_codes', help='code',
> default="01,19", type=str)
> parser.add_argument('--subject_code', help='code',
> default="02", type=str)
> parser.add_argument('--phase_code', help='code',
> default="03", type=str)
> return parser.parse_args()
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def generate_topic_ht(df):
> from pycc.topic_ht_extractor import model_inference_engine
> torch.set_num_threads(1)
> engine = model_inference_engine("./pycc/model/", "./pycc/resource/")
> df_res = pd.DataFrame(columns=["topic_id", "topic_ht"])
> for i in range(len(df)):
> topic_json_str = df.iloc[i:i + 1]["question_info"].values[0]
> topic = json.loads(topic_json_str.strip())
> topic_ht = engine.predict(topic)
> df_res = df_res.append({"topic_id": topic["id"], "topic_ht": 
> str(topic_ht)}, ignore_index=True)
> return df_res
> if "__main__" == __name__:
> conf = SparkConf() \
> .setAppName("generate_topic_ht") \
> .set("spark.sql.execution.arrow.enabled", "true")
> args = parse_args()
> spark = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> spark.sql("select * from {}.dwd_rencently_month_topic_theme_incr where 
> part = '{}' "
>   "and 
> business_type='xxj_map_topic_ht'".format(args.project_name, args.calc_date)) \
> .repartition(20).groupby(F.spark_partition_id()) \
> .apply(generate_topic_ht) \
> .write.mode("overwrite") \
> 
> .parquet("/project/{}/{}/db/dws/dws_topic_ht_incr/part={}/business_type=xxj_map_topic_ht"
>  .format(args.project_name, args.project_name, 
> args.calc_date))
> spark.sql("alter table {}.dws_topic_ht_incr drop if exists partition 
> (part='{}',business_type='xxj_map_topic_ht')".format(args.project_name, 
> args.calc_date))
> spark.sql("alter table {}.dws_topic_ht_incr add partition 
> (part='{}',business_type='xxj_map_topic_ht')".format(args.project_name, 
> args.calc_date)) {code}
>  
> This caused the data skew to be very serious, and I did repartition operation 
> before executing the data
> !image-2022-09-15-14-29-35-004.png!
> !image-2022-09-15-14-28-04-332.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40467) Split FlatMapGroupsWithState down to multiple test suites

2022-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40467.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37907
[https://github.com/apache/spark/pull/37907]

> Split FlatMapGroupsWithState down to multiple test suites
> -
>
> Key: SPARK-40467
> URL: https://issues.apache.org/jira/browse/SPARK-40467
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.4.0
>
>
> The suite now exceeds 1800+ lines which is too huge to maintain further.
> After quick glance, it looks like the suite can be broken down to three parts 
> (suites) as each part doesn't couple with other.
> 1. GroupState tests
> 2. E2E tests which don't test initial state
> 3. E2E tests which test initial state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40467) Split FlatMapGroupsWithState down to multiple test suites

2022-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40467:


Assignee: Jungtaek Lim

> Split FlatMapGroupsWithState down to multiple test suites
> -
>
> Key: SPARK-40467
> URL: https://issues.apache.org/jira/browse/SPARK-40467
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> The suite now exceeds 1800+ lines which is too huge to maintain further.
> After quick glance, it looks like the suite can be broken down to three parts 
> (suites) as each part doesn't couple with other.
> 1. GroupState tests
> 2. E2E tests which don't test initial state
> 3. E2E tests which test initial state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40465) Refactor Decimal so as we can use Int128 as underlying implementation

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605751#comment-17605751
 ] 

Apache Spark commented on SPARK-40465:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37915

> Refactor Decimal so as we can use Int128 as underlying implementation
> -
>
> Key: SPARK-40465
> URL: https://issues.apache.org/jira/browse/SPARK-40465
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40465) Refactor Decimal so as we can use Int128 as underlying implementation

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40465:


Assignee: Apache Spark

> Refactor Decimal so as we can use Int128 as underlying implementation
> -
>
> Key: SPARK-40465
> URL: https://issues.apache.org/jira/browse/SPARK-40465
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40465) Refactor Decimal so as we can use Int128 as underlying implementation

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40465:


Assignee: (was: Apache Spark)

> Refactor Decimal so as we can use Int128 as underlying implementation
> -
>
> Key: SPARK-40465
> URL: https://issues.apache.org/jira/browse/SPARK-40465
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40465) Refactor Decimal so as we can use Int128 as underlying implementation

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605749#comment-17605749
 ] 

Apache Spark commented on SPARK-40465:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37915

> Refactor Decimal so as we can use Int128 as underlying implementation
> -
>
> Key: SPARK-40465
> URL: https://issues.apache.org/jira/browse/SPARK-40465
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38618) Implement JDBCDataSourceV2

2022-09-16 Thread shengkui leng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605700#comment-17605700
 ] 

shengkui leng commented on SPARK-38618:
---

Any update for this task?

> Implement JDBCDataSourceV2
> --
>
> Key: SPARK-38618
> URL: https://issues.apache.org/jira/browse/SPARK-38618
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Andrew Murphy
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605693#comment-17605693
 ] 

Apache Spark commented on SPARK-40471:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37914

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40471:


Assignee: Apache Spark

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605692#comment-17605692
 ] 

Apache Spark commented on SPARK-40471:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37914

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40471:


Assignee: (was: Apache Spark)

> Upgrade RoaringBitmap to 0.9.32
> ---
>
> Key: SPARK-40471
> URL: https://issues.apache.org/jira/browse/SPARK-40471
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]
> Two bug fix:
>  * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
> RoaringBatchIterator.clone() 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
>  * [Roaring64Bitmap.forInRange correctness issues 
> (|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
> [)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40471) Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread Yang Jie (Jira)
Yang Jie created SPARK-40471:


 Summary: Upgrade RoaringBitmap to 0.9.32
 Key: SPARK-40471
 URL: https://issues.apache.org/jira/browse/SPARK-40471
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


[https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32]

Two bug fix:
 * [#574|https://github.com/RoaringBitmap/RoaringBitmap/issues/574] [Bug in 
RoaringBatchIterator.clone() 
(|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
 [#575|https://github.com/RoaringBitmap/RoaringBitmap/pull/575] 
[)|https://github.com/RoaringBitmap/RoaringBitmap/commit/6a57f8cdcef3694a0724f597e36534f31e27d8ac]
 * [Roaring64Bitmap.forInRange correctness issues 
(|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
 [#578|https://github.com/RoaringBitmap/RoaringBitmap/pull/578] 
[)|https://github.com/RoaringBitmap/RoaringBitmap/commit/d53aeb5af83e15d8a841d9448c3ecc60d2402f76]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40447) Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40447:


Assignee: Apache Spark  (was: Ruifeng Zheng)

> Implement `kendall` correlation in `DataFrame.corr`
> ---
>
> Key: SPARK-40447
> URL: https://issues.apache.org/jira/browse/SPARK-40447
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40447) Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40447:


Assignee: Ruifeng Zheng  (was: Apache Spark)

> Implement `kendall` correlation in `DataFrame.corr`
> ---
>
> Key: SPARK-40447
> URL: https://issues.apache.org/jira/browse/SPARK-40447
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40447) Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605683#comment-17605683
 ] 

Apache Spark commented on SPARK-40447:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37913

> Implement `kendall` correlation in `DataFrame.corr`
> ---
>
> Key: SPARK-40447
> URL: https://issues.apache.org/jira/browse/SPARK-40447
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40463) Update gpg's keyserver

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40463:


Assignee: Yuming Wang

> Update gpg's keyserver
> --
>
> Key: SPARK-40463
> URL: https://issues.apache.org/jira/browse/SPARK-40463
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {noformat}
> yumwang@LM-SHC-16508156 create-release % gpg --keyserver keyserver.ubuntu.com 
> --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9
> gpg: keyserver receive failed: End of file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40463) Update gpg's keyserver

2022-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40463.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37906
[https://github.com/apache/spark/pull/37906]

> Update gpg's keyserver
> --
>
> Key: SPARK-40463
> URL: https://issues.apache.org/jira/browse/SPARK-40463
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> {noformat}
> yumwang@LM-SHC-16508156 create-release % gpg --keyserver keyserver.ubuntu.com 
> --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9
> gpg: keyserver receive failed: End of file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605658#comment-17605658
 ] 

Apache Spark commented on SPARK-40196:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37912

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605657#comment-17605657
 ] 

Apache Spark commented on SPARK-40196:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37912

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40470:


Assignee: Apache Spark

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605651#comment-17605651
 ] 

Apache Spark commented on SPARK-40470:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37911

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Priority: Major
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40470:


Assignee: (was: Apache Spark)

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Priority: Major
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40470) arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields

2022-09-16 Thread Ivan Sadikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated SPARK-40470:
-
Summary: arrays_zip output unexpected alias column names when using 
GetMapValue and GetArrayStructFields  (was: arrays_zip output unexpected alias 
column names when using Map)

> arrays_zip output unexpected alias column names when using GetMapValue and 
> GetArrayStructFields
> ---
>
> Key: SPARK-40470
> URL: https://issues.apache.org/jira/browse/SPARK-40470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.2
>Reporter: Ivan Sadikov
>Priority: Major
>
> This is a follow-up for https://issues.apache.org/jira/browse/SPARK-40292. 
> I forgot to fix the case when GetMapValue and GetArrayStructFields are used 
> in arrays_zip function instead of structs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2022-09-16 Thread David Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605637#comment-17605637
 ] 

David Morin commented on SPARK-39375:
-

Hi [~hyukjin.kwon] Good idea about the discussion on this topic into the dev 
mailing list !

Just to be sure: are you talking about 
[d...@spark.apache.org|mailto:d...@spark.apache.org] ?

 

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40469:


Assignee: Apache Spark

> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> java.nio.file.NoSuchFileException: 
> /hadoop/3/yarn/local/usercache/b_carmel/appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
>   at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
>   at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:132)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605633#comment-17605633
 ] 

Apache Spark commented on SPARK-40469:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37910

> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> java.nio.file.NoSuchFileException: 
> /hadoop/3/yarn/local/usercache/b_carmel/appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
>   at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
>   at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:132)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40469:


Assignee: (was: Apache Spark)

> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> java.nio.file.NoSuchFileException: 
> /hadoop/3/yarn/local/usercache/b_carmel/appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
>   at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
>   at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:132)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40469:

Component/s: Spark Core
 (was: Build)

> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40469:

Description: 
{noformat}
java.nio.file.NoSuchFileException: 
/hadoop/3/yarn/local/usercache/b_carmel/appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
at java.nio.file.Files.createDirectory(Files.java:674)
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:132)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}


  was:https://github.com/scala/scala/releases/tag/v2.12.17


> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> java.nio.file.NoSuchFileException: 
> /hadoop/3/yarn/local/usercache/b_carmel/appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>   at java.nio.file.Files.createDirectory(Files.java:674)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
>   at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
>   at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
>   at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>   at 
> 

[jira] [Updated] (SPARK-40469) Avoid creating directory failures

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40469:

Summary: Avoid creating directory failures  (was: Upgrade Scala to 2.12.17)

> Avoid creating directory failures
> -
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40469) Upgrade Scala to 2.12.17

2022-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-40469:
-

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40469
> URL: https://issues.apache.org/jira/browse/SPARK-40469
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org