[jira] [Resolved] (SPARK-43502) DataFrame.drop should support empty column

2023-05-15 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43502.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41180
[https://github.com/apache/spark/pull/41180]

> DataFrame.drop should support empty column
> --
>
> Key: SPARK-43502
> URL: https://issues.apache.org/jira/browse/SPARK-43502
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Summary: [K8S] Mounts the hadoop config map on the executor pod  (was: 
[K8S] Mount hadoop config map in executor side)

> [K8S] Mounts the hadoop config map on the executor pod
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map will not be mounted on the executor pod.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mounts the hadoop config map on the executor pod
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map will not be mounted on the executor pod.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33

2023-05-15 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43520:
---

 Summary: Upgrade mysql-connector-java from 8.0.32 to 8.0.33
 Key: SPARK-43520
 URL: https://issues.apache.org/jira/browse/SPARK-43520
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43488) bitmap function

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43488:
-
Labels:   (was: patch)

> bitmap function
> ---
>
> Key: SPARK-43488
> URL: https://issues.apache.org/jira/browse/SPARK-43488
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yiku123
>Priority: Major
>
> maybe spark need to have some bitmap functions? example  like bitmapBuild 
> 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。
> This is often used in user profiling applications but i don't find in spark
>  
>  
> h2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43501) `test_ops_on_diff_frames_*` are much slower on Spark Connect

2023-05-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722969#comment-17722969
 ] 

Hyukjin Kwon commented on SPARK-43501:
--

cc [~itholic]

> `test_ops_on_diff_frames_*` are much slower on Spark Connect
> 
>
> Key: SPARK-43501
> URL: https://issues.apache.org/jira/browse/SPARK-43501
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43483) Adds SQL references for OFFSET clause.

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43483.
--
Fix Version/s: 3.5.0
 Assignee: jiaan.geng
   Resolution: Fixed

> Adds SQL references for OFFSET clause.
> --
>
> Key: SPARK-43483
> URL: https://issues.apache.org/jira/browse/SPARK-43483
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Spark 3.4.0 released the new syntax: OFFSET clause.
> But the SQL reference missing the description for it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43483) Adds SQL references for OFFSET clause.

2023-05-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722968#comment-17722968
 ] 

Hyukjin Kwon commented on SPARK-43483:
--

Fixed in https://github.com/apache/spark/pull/41151

> Adds SQL references for OFFSET clause.
> --
>
> Key: SPARK-43483
> URL: https://issues.apache.org/jira/browse/SPARK-43483
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark 3.4.0 released the new syntax: OFFSET clause.
> But the SQL reference missing the description for it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43482) Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43482.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41150
[https://github.com/apache/spark/pull/41150]

> Expand QueryTerminatedEvent to contain error class if it exists in exception
> 
>
> Key: SPARK-43482
> URL: https://issues.apache.org/jira/browse/SPARK-43482
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.5.0
>
>
> As of now, the event QueryTerminatedEvent has a field about exception which 
> is just a string representation of underlying exception, mostly stack trace. 
> That said, people relying on streaming query listener can't get any benefit 
> from the movement for employing error class framework. This is a bad 
> observability once we migrate exceptions in streaming query to error class 
> framework where exceptions are well categorized.
> We can expand the event of QueryTerminatedEvent - if the exception contains 
> error class information, we can ship this along with existing string version 
> of exception (message).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43482) Expand QueryTerminatedEvent to contain error class if it exists in exception

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43482:


Assignee: Jungtaek Lim

> Expand QueryTerminatedEvent to contain error class if it exists in exception
> 
>
> Key: SPARK-43482
> URL: https://issues.apache.org/jira/browse/SPARK-43482
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> As of now, the event QueryTerminatedEvent has a field about exception which 
> is just a string representation of underlying exception, mostly stack trace. 
> That said, people relying on streaming query listener can't get any benefit 
> from the movement for employing error class framework. This is a bad 
> observability once we migrate exceptions in streaming query to error class 
> framework where exceptions are well categorized.
> We can expand the event of QueryTerminatedEvent - if the exception contains 
> error class information, we can ship this along with existing string version 
> of exception (message).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43519) Bump Parquet 1.13.1

2023-05-15 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-43519:
-

 Summary: Bump Parquet 1.13.1
 Key: SPARK-43519
 URL: https://issues.apache.org/jira/browse/SPARK-43519
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43517) Add a migration guide for namedtuple monkey patch

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43517.
--
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41177
[https://github.com/apache/spark/pull/41177]

> Add a migration guide for namedtuple monkey patch
> -
>
> Key: SPARK-43517
> URL: https://issues.apache.org/jira/browse/SPARK-43517
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.5.0, 3.4.1
>
>
> Should add a migration guide for SPARK-41189 workaround.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43517) Add a migration guide for namedtuple monkey patch

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43517:


Assignee: Hyukjin Kwon

> Add a migration guide for namedtuple monkey patch
> -
>
> Key: SPARK-43517
> URL: https://issues.apache.org/jira/browse/SPARK-43517
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Should add a migration guide for SPARK-41189 workaround.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43473) Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43473:


Assignee: Takuya Ueshin

> Support struct type in createDataFrame from pandas DataFrame
> 
>
> Key: SPARK-43473
> URL: https://issues.apache.org/jira/browse/SPARK-43473
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Support struct type in createDataFrame from pandas DataFrame with {{Row}} 
> object or {{dict}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43473) Support struct type in createDataFrame from pandas DataFrame

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43473.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41149
[https://github.com/apache/spark/pull/41149]

> Support struct type in createDataFrame from pandas DataFrame
> 
>
> Key: SPARK-43473
> URL: https://issues.apache.org/jira/browse/SPARK-43473
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>
> Support struct type in createDataFrame from pandas DataFrame with {{Row}} 
> object or {{dict}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR

2023-05-15 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43518:
---

 Summary: Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
 Key: SPARK-43518
 URL: https://issues.apache.org/jira/browse/SPARK-43518
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input data frames have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a data frame to the column of resulting 
dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one data frame:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions._


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM ")),
)
}

   

[jira] [Resolved] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing

2023-05-15 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-43300.

Fix Version/s: 3.5.0
   Resolution: Fixed

> Cascade failure in Guava cache due to fate-sharing
> --
>
> Key: SPARK-43300
> URL: https://issues.apache.org/jira/browse/SPARK-43300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Assignee: Ziqi Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> Guava cache is widely used in spark, however, it suffers from fate-sharing 
> behavior: If there are multiple requests trying to access the same key in the 
> {{cache}} at the same time when the key is not in the cache, Guava cache will 
> block all requests and create the object only once. If the creation fails, 
> all requests will fail immediately without retry. So we might see task 
> failure due to irrelevant failure in other queries due to fate sharing.
> This fate sharing behavior might lead to unexpected results in some situation.
> We can wrap around Guava cache with a KeyLock to synchronize all requests 
> with the same key, so they will run individually and fail as if they come one 
> at a time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing

2023-05-15 Thread Josh Rosen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722959#comment-17722959
 ] 

Josh Rosen commented on SPARK-43300:


Fixed in https://github.com/apache/spark/pull/40982

> Cascade failure in Guava cache due to fate-sharing
> --
>
> Key: SPARK-43300
> URL: https://issues.apache.org/jira/browse/SPARK-43300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> Guava cache is widely used in spark, however, it suffers from fate-sharing 
> behavior: If there are multiple requests trying to access the same key in the 
> {{cache}} at the same time when the key is not in the cache, Guava cache will 
> block all requests and create the object only once. If the creation fails, 
> all requests will fail immediately without retry. So we might see task 
> failure due to irrelevant failure in other queries due to fate sharing.
> This fate sharing behavior might lead to unexpected results in some situation.
> We can wrap around Guava cache with a KeyLock to synchronize all requests 
> with the same key, so they will run individually and fail as if they come one 
> at a time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing

2023-05-15 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-43300:
--

Assignee: Ziqi Liu

> Cascade failure in Guava cache due to fate-sharing
> --
>
> Key: SPARK-43300
> URL: https://issues.apache.org/jira/browse/SPARK-43300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Assignee: Ziqi Liu
>Priority: Major
>
> Guava cache is widely used in spark, however, it suffers from fate-sharing 
> behavior: If there are multiple requests trying to access the same key in the 
> {{cache}} at the same time when the key is not in the cache, Guava cache will 
> block all requests and create the object only once. If the creation fails, 
> all requests will fail immediately without retry. So we might see task 
> failure due to irrelevant failure in other queries due to fate sharing.
> This fate sharing behavior might lead to unexpected results in some situation.
> We can wrap around Guava cache with a KeyLock to synchronize all requests 
> with the same key, so they will run individually and fail as if they come one 
> at a time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43281) Fix concurrent writer does not update file metrics

2023-05-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43281.
-
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40952
[https://github.com/apache/spark/pull/40952]

> Fix concurrent writer does not update file metrics
> --
>
> Key: SPARK-43281
> URL: https://issues.apache.org/jira/browse/SPARK-43281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> It uses temp file path to get file status after commit task. However, the 
> temp file has already moved to new path during commit task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43281) Fix concurrent writer does not update file metrics

2023-05-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43281:
---

Assignee: XiDuo You

> Fix concurrent writer does not update file metrics
> --
>
> Key: SPARK-43281
> URL: https://issues.apache.org/jira/browse/SPARK-43281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>
> It uses temp file path to get file status after commit task. However, the 
> temp file has already moved to new path during commit task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43413:
---

Assignee: Jack Chen

> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly marked as non-nullable, even when 
> they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Summary: Unexpected NullPointerException or IllegalArgumentException inside 
UDFs of ML features caused by certain SQL functions  (was: Unexpected 
NullPointerException or IllegalArgumentException inside UDFs of ML features on 
empty data frames caused by certain SQL functions)

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 
> at least 1 non zero entry.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> 

[jira] [Resolved] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43413.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41094
[https://github.com/apache/spark/pull/41094]

> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> IN subquery expressions are incorrectly marked as non-nullable, even when 
> they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43517) Add a migration guide for namedtuple monkey patch

2023-05-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43517:
-
Priority: Minor  (was: Major)

> Add a migration guide for namedtuple monkey patch
> -
>
> Key: SPARK-43517
> URL: https://issues.apache.org/jira/browse/SPARK-43517
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Should add a migration guide for SPARK-41189 workaround.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43517) Add a migration guide for namedtuple monkey patch

2023-05-15 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-43517:


 Summary: Add a migration guide for namedtuple monkey patch
 Key: SPARK-43517
 URL: https://issues.apache.org/jira/browse/SPARK-43517
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Should add a migration guide for SPARK-41189 workaround.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43516) Basic estimator / transformer / model / evaluator interfaces

2023-05-15 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43516:
--

Assignee: Weichen Xu

> Basic estimator / transformer / model / evaluator interfaces
> 
>
> Key: SPARK-43516
> URL: https://issues.apache.org/jira/browse/SPARK-43516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43516) Basic estimator / transformer / model / evaluator interfaces

2023-05-15 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-43516:
--

 Summary: Basic estimator / transformer / model / evaluator 
interfaces
 Key: SPARK-43516
 URL: https://issues.apache.org/jira/browse/SPARK-43516
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Environment: 
Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue on real data.

  was:
Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue with real data.


> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features on empty data frames caused by certain SQL functions
> ---
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement 

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Environment: 
Scala version: 2.12.17

Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue on real data.

  was:
Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue on real data.


> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features on empty data frames caused by certain SQL functions
> ---
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement 

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Environment: 
Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue with real data.

  was:
Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on 
cluster was used to check the issue with real data.


> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features on empty data frames caused by certain SQL functions
> ---
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue with real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must 

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Summary: Unexpected NullPointerException or IllegalArgumentException inside 
UDFs of ML features on empty data frames caused by certain SQL functions  (was: 
Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
features caused by certain SQL functions)

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features on empty data frames caused by certain SQL functions
> ---
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on 
> cluster was used to check the issue with real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 
> at least 1 non zero entry.
>   at scala.Predef$.require(Predef.scala:281)
>  

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input data frames have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a data frame to the column of resulting 
dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one data frame:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions.{transform, to_timestamp}


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM 

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input data frames have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a data frame to the column of resulting 
dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one data frame:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions.{transform, to_timestamp}


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM 

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input data frames have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a data frame to the column of resulting 
dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one data frame:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions.{transform, to_timestamp}


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM 

[jira] [Created] (SPARK-43515) Add a benchmark to evaluate the performance of migrating a large number of tiny blocks during decommission

2023-05-15 Thread Xingbo Jiang (Jira)
Xingbo Jiang created SPARK-43515:


 Summary: Add a benchmark to evaluate the performance of migrating 
a large number of tiny blocks during decommission
 Key: SPARK-43515
 URL: https://issues.apache.org/jira/browse/SPARK-43515
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Xingbo Jiang
Assignee: Xingbo Jiang


This is a followup work of SPARK-43043, we should add a benchmark to evaluate 
the performance of migrating a large number of tiny blocks during decommission. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Description: 
We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added {{show}} calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input dataframes have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a dataframe to the column of resulting dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one dataframe:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions.{transform, to_timestamp}


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val toBeMatched2 = directors
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "directorId",
)

similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6)
}

def process2(df: DataFrame): Unit = {
def process_financials(column: Column): Column = {
transform(
column,
x => x.withField("date", to_timestamp(x("date"), "dd MMM ")),

[jira] [Created] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-15 Thread Svyatoslav Semenyuk (Jira)
Svyatoslav Semenyuk created SPARK-43514:
---

 Summary: Unexpected NullPointerException or 
IllegalArgumentException inside UDFs of ML features caused by certain SQL 
functions
 Key: SPARK-43514
 URL: https://issues.apache.org/jira/browse/SPARK-43514
 Project: Spark
  Issue Type: Bug
  Components: ML, SQL
Affects Versions: 3.4.0, 3.3.1
 Environment: Scala version: 2.12.17

Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on 
cluster was used to check the issue with real data.
Reporter: Svyatoslav Semenyuk


We designed a function that joins two DFs on common column with some 
similarity. All next code will be on Scala 2.12.

I've added `show` calls for demonstration purposes.

{code:scala}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
RegexTokenizer, MinHashLSHModel}
import org.apache.spark.sql.{DataFrame, Column}

/**
 * Joins two data frames on a string column using LSH algorithm
 * for similarity computation.
 *
 * If input dataframes have columns with identical names,
 * the resulting dataframe will have columns from them both
 * with prefixes `datasetA` and `datasetB` respectively.
 *
 * For example, if both dataframes have a column with name `myColumn`,
 * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`.
 */
def similarityJoin(
df: DataFrame,
anotherDf: DataFrame,
joinExpr: String,
threshold: Double = 0.8,
): DataFrame = {
df.show(false)
anotherDf.show(false)

val pipeline = new Pipeline().setStages(Array(
new RegexTokenizer()
.setPattern("")
.setMinTokenLength(1)
.setInputCol(joinExpr)
.setOutputCol("tokens"),
new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
)
)

val model = pipeline.fit(df)

val storedHashed = model.transform(df)
val landedHashed = model.transform(anotherDf)

val commonColumns = df.columns.toSet & anotherDf.columns.toSet

/**
 * Converts column name from a dataframe to the column of resulting dataset.
 */
def convertColumn(datasetName: String)(columnName: String): Column = {
val newName =
if (commonColumns.contains(columnName)) 
s"$datasetName${columnName.capitalize}"
else columnName

col(s"$datasetName.$columnName") as newName
}

val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
  anotherDf.columns.map(convertColumn("datasetB"))

val result = model
.stages
.last
.asInstanceOf[MinHashLSHModel]
.approxSimilarityJoin(storedHashed, landedHashed, threshold, 
"confidence")
.select(columnsToSelect.toSeq: _*)

result.show(false)

result
}
{code}


Now consider such simple example:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example runs with no errors and outputs 3 empty DFs. Let's add 
{{distinct}} method to one dataframe:

{code:scala}
val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
2) as "df1"
val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"

similarityJoin(inputDF1, inputDF2, "name", 0.6)
{code}

This example outputs two empty DFs and then fails at {{result.show(false)}}. 
Error:

{code:none}
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
defined function (LSHModel$$Lambda$3769/0x000101804840: 
(struct,values:array>) => 
array,values:array>>).
  ... many elided
Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at 
least 1 non zero entry.
  at scala.Predef$.require(Predef.scala:281)
  at 
org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
  at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
  ... many more
{code}



Now let's take a look on the example which is close to our application code. 
Define some helper functions:

{code:scala}
import org.apache.spark.sql.functions.{transform, to_timestamp}


def process1(df: DataFrame): Unit = {
val companies = df.select($"id", $"name")

val directors = df
.select(explode($"directors"))
.select($"col.name", $"col.id")
.dropDuplicates("id")

val toBeMatched1 = companies
.filter(length($"name") > 2)
.select(
$"name",
$"id" as "sourceLegalEntityId",
)

val 

[jira] [Updated] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists

2023-05-15 Thread Frederik Paradis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Paradis updated SPARK-43513:
-
Summary: withColumnRenamed duplicates columns if new column already exists  
(was: withColumnRenamed duplicates columns if new column already exist)

> withColumnRenamed duplicates columns if new column already exists
> -
>
> Key: SPARK-43513
> URL: https://issues.apache.org/jira/browse/SPARK-43513
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Frederik Paradis
>Priority: Major
>
> withColumnRenamed should either replace the column when new column already 
> exists or should specify the specificity in the documentation. See the code 
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r)  # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
> be: score, score.
> print(r.select("score"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43513) withColumnRenamed duplicate columns if new column already exist

2023-05-15 Thread Frederik Paradis (Jira)
Frederik Paradis created SPARK-43513:


 Summary: withColumnRenamed duplicate columns if new column already 
exist
 Key: SPARK-43513
 URL: https://issues.apache.org/jira/browse/SPARK-43513
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Frederik Paradis


withColumnRenamed should either replace the column when new column already 
exists or should specify the specificity in the documentation. See the code 
below as an example of the current state.

{code:python}
from pyspark.sql import SparkSession

spark = 
SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()

df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
"test_score"])
r = df.withColumnRenamed("test_score", "score")
print(r)  # DataFrame[id: bigint, score: double, score: double]

# pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
be: score, score.
print(r.select("score"))
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43513) withColumnRenamed duplicates columns if new column already exist

2023-05-15 Thread Frederik Paradis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Paradis updated SPARK-43513:
-
Summary: withColumnRenamed duplicates columns if new column already exist  
(was: withColumnRenamed duplicate columns if new column already exist)

> withColumnRenamed duplicates columns if new column already exist
> 
>
> Key: SPARK-43513
> URL: https://issues.apache.org/jira/browse/SPARK-43513
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Frederik Paradis
>Priority: Major
>
> withColumnRenamed should either replace the column when new column already 
> exists or should specify the specificity in the documentation. See the code 
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r)  # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
> be: score, score.
> print(r.select("score"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade

2023-05-15 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-43512:
--

 Summary: Update stateStoreOperationsBenchmark to allow rocksdb jni 
upgrade
 Key: SPARK-43512
 URL: https://issues.apache.org/jira/browse/SPARK-43512
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Anish Shrigondekar


Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier

2023-05-15 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722894#comment-17722894
 ] 

Hudson commented on SPARK-43205:


User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/41007

> Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
> ---
>
> Key: SPARK-43205
> URL: https://issues.apache.org/jira/browse/SPARK-43205
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Priority: Major
>
> There is a requirement for SQL templates, where the table and or column names 
> are provided through substitution. This can be done today using variable 
> substitution:
> SET hivevar:tabname = mytab;
> SELECT * FROM ${ hivevar:tabname };
> A straight variable substitution is dangerous since it does allow for SQL 
> injection:
> SET hivevar:tabname = mytab, someothertab;
> SELECT * FROM ${ hivevar:tabname };
> A way to get around this problem is to wrap the variable substitution with a 
> clause that limits the scope t produce an identifier.
> This approach is taken by Snowflake:
>  
> [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql]
> SET hivevar:tabname = 'tabname';
> SELECT * FROM IDENTIFIER(${ hivevar:tabname })



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-05-15 Thread Bo Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722891#comment-17722891
 ] 

Bo Gao commented on SPARK-43511:


Created PR https://github.com/apache/spark/pull/40959

> Implemented State APIs for Spark Connect Scala
> --
>
> Key: SPARK-43511
> URL: https://issues.apache.org/jira/browse/SPARK-43511
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Priority: Major
>
> Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
> Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-05-15 Thread Bo Gao (Jira)
Bo Gao created SPARK-43511:
--

 Summary: Implemented State APIs for Spark Connect Scala
 Key: SPARK-43511
 URL: https://issues.apache.org/jira/browse/SPARK-43511
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Bo Gao


Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43442) Split test module `pyspark_pandas_connect`

2023-05-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43442.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41127
[https://github.com/apache/spark/pull/41127]

> Split test module `pyspark_pandas_connect`
> --
>
> Key: SPARK-43442
> URL: https://issues.apache.org/jira/browse/SPARK-43442
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43442) Split test module `pyspark_pandas_connect`

2023-05-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43442:
-

Assignee: Ruifeng Zheng

> Split test module `pyspark_pandas_connect`
> --
>
> Key: SPARK-43442
> URL: https://issues.apache.org/jira/browse/SPARK-43442
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Summary: Spark application hangs when YarnAllocator adds running executors 
after processing completed containers  (was: Spark application hangs when 
YarnAllocator adding running executors after processing completed containers)

> Spark application hangs when YarnAllocator adds running executors after 
> processing completed containers
> ---
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Description: 
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator added the running executor after processing completed containers. 
The former happens in a separate thread after executor launch.

YarnAllocator believes there are still running executors, although they are 
already lost due to preemption. Hence, the application hangs without any 
running executors.

  was:
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.

YarnAllocator thought


> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Summary: Spark application hangs when YarnAllocator adding running 
executors after processing completed containers  (was: Spark application hangs 
when YarnAllocator processing completed containers before updating internal 
state)

> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator processing completed container before updating internal state 
> and adding the executorId. The latter happens in a separate thread after 
> executor launch.
> YarnAllocator thought



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Description: 
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.

YarnAllocator thought

  was:
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted. {code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.


> Spark application hangs when YarnAllocator processing completed containers 
> before updating internal state
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator processing completed container before updating internal state 
> and adding the executorId. The latter happens in a separate thread after 
> executor launch.
> YarnAllocator thought



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43494:


Assignee: Yang Jie

> Directly call `replicate()` instead of reflection in 
> `SparkHadoopUtil#createFile`
> -
>
> Key: SPARK-43494
> URL: https://issues.apache.org/jira/browse/SPARK-43494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`

2023-05-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43494.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41164
[https://github.com/apache/spark/pull/41164]

> Directly call `replicate()` instead of reflection in 
> `SparkHadoopUtil#createFile`
> -
>
> Key: SPARK-43494
> URL: https://issues.apache.org/jira/browse/SPARK-43494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state

2023-05-15 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-43510:
--

 Summary: Spark application hangs when YarnAllocator processing 
completed containers before updating internal state
 Key: SPARK-43510
 URL: https://issues.apache.org/jira/browse/SPARK-43510
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.4.0
Reporter: Manu Zhang


I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted. {code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43223) KeyValueGroupedDataset#agg

2023-05-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-43223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-43223.
---
Fix Version/s: 3.5.0
 Assignee: Zhen Li
   Resolution: Fixed

> KeyValueGroupedDataset#agg
> --
>
> Key: SPARK-43223
> URL: https://issues.apache.org/jira/browse/SPARK-43223
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.5.0
>
>
> Adding missing agg functions in the KVGDS API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43461) Skip compiling useless files when making distribution

2023-05-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43461:

Summary: Skip compiling useless files when making distribution  (was: Skip 
compiling javadoc.jar and sources.jar when making distribution)

> Skip compiling useless files when making distribution
> -
>
> Key: SPARK-43461
> URL: https://issues.apache.org/jira/browse/SPARK-43461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> -Dmaven.javadoc.skip=true to skip java doc
> -Dskip=true to skip scala doc. Please see: 
> https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip
> -Dmaven.source.skip to skip build sources.jar
> -Dmaven.test.skip to skip build test-jar
> -Dcyclonedx.skip=true to skip making bom. Please see: 
> https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-43508:


Assignee: BingKun Pan

> Replace the link related to hadoop version 2 with hadoop version 3
> --
>
> Key: SPARK-43508
> URL: https://issues.apache.org/jira/browse/SPARK-43508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43508:
-
Priority: Trivial  (was: Minor)

> Replace the link related to hadoop version 2 with hadoop version 3
> --
>
> Key: SPARK-43508
> URL: https://issues.apache.org/jira/browse/SPARK-43508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43508.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41171
[https://github.com/apache/spark/pull/41171]

> Replace the link related to hadoop version 2 with hadoop version 3
> --
>
> Key: SPARK-43508
> URL: https://issues.apache.org/jira/browse/SPARK-43508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark

2023-05-15 Thread Martin Grund (Jira)
Martin Grund created SPARK-43509:


 Summary: Support creating multiple sessions for Spark Connect in 
PySpark
 Key: SPARK-43509
 URL: https://issues.apache.org/jira/browse/SPARK-43509
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43495:
-
Priority: Minor  (was: Major)

> Upgrade RoaringBitmap to 0.9.44
> ---
>
> Key: SPARK-43495
> URL: https://issues.apache.org/jira/browse/SPARK-43495
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43495.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41165
[https://github.com/apache/spark/pull/41165]

> Upgrade RoaringBitmap to 0.9.44
> ---
>
> Key: SPARK-43495
> URL: https://issues.apache.org/jira/browse/SPARK-43495
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44

2023-05-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-43495:


Assignee: Yang Jie

> Upgrade RoaringBitmap to 0.9.44
> ---
>
> Key: SPARK-43495
> URL: https://issues.apache.org/jira/browse/SPARK-43495
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41]
>  * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43359) DELETE from Hive table result in INTERNAL error

2023-05-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722759#comment-17722759
 ] 

BingKun Pan commented on SPARK-43359:
-

Let me fix it.

> DELETE from Hive table result in INTERNAL error
> ---
>
> Key: SPARK-43359
> URL: https://issues.apache.org/jira/browse/SPARK-43359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Minor
>
> spark-sql (default)> CREATE TABLE T1(c1 INT);
> spark-sql (default)> DELETE FROM T1 WHERE c1 = 1;
> [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation 
> [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
> org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: 
> HiveTableRelation [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3

2023-05-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43508:

Summary: Replace the link related to hadoop version 2 with hadoop version 3 
 (was: Replace the link related to hadoop version 2 with Hadoop version 3)

> Replace the link related to hadoop version 2 with hadoop version 3
> --
>
> Key: SPARK-43508
> URL: https://issues.apache.org/jira/browse/SPARK-43508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with Hadoop version 3

2023-05-15 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43508:

Summary: Replace the link related to hadoop version 2 with Hadoop version 3 
 (was: Replace the link related to Hadoop version 2 with Hadoop version 3)

> Replace the link related to hadoop version 2 with Hadoop version 3
> --
>
> Key: SPARK-43508
> URL: https://issues.apache.org/jira/browse/SPARK-43508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43508) Replace the link related to Hadoop version 2 with Hadoop version 3

2023-05-15 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43508:
---

 Summary: Replace the link related to Hadoop version 2 with Hadoop 
version 3
 Key: SPARK-43508
 URL: https://issues.apache.org/jira/browse/SPARK-43508
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43485) Confused errors from the DATEADD function

2023-05-15 Thread Nikita Awasthi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722749#comment-17722749
 ] 

Nikita Awasthi commented on SPARK-43485:


User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/41143

> Confused errors from the DATEADD function
> -
>
> Key: SPARK-43485
> URL: https://issues.apache.org/jira/browse/SPARK-43485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0
>
>
> The code example portraits the issue:
> {code:sql}
> spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11');
> [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but 
> the actual number is 3. Please, refer to 
> 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; 
> line 1 pos 7
> {code}
> The error says about number of arguments passed to DATEADD but the issue is 
> about the type of the first argument.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43485) Confused errors from the DATEADD function

2023-05-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43485.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41143
[https://github.com/apache/spark/pull/41143]

> Confused errors from the DATEADD function
> -
>
> Key: SPARK-43485
> URL: https://issues.apache.org/jira/browse/SPARK-43485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0
>
>
> The code example portraits the issue:
> {code:sql}
> spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11');
> [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but 
> the actual number is 3. Please, refer to 
> 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; 
> line 1 pos 7
> {code}
> The error says about number of arguments passed to DATEADD but the issue is 
> about the type of the first argument.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43493) Add a max distance argument to the levenshtein() function

2023-05-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722720#comment-17722720
 ] 

BingKun Pan edited comment on SPARK-43493 at 5/15/23 9:46 AM:
--

Let me implements it at `sql` first, I will implements it at `connect` later.


was (Author: panbingkun):
Let me implements it at `sql` first, I will implements it ad `connect` later.

> Add a max distance argument to the levenshtein() function
> -
>
> Key: SPARK-43493
> URL: https://issues.apache.org/jira/browse/SPARK-43493
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Currently, Spark's levenshtein(str1, str2) function can be very inefficient 
> for long strings. Many other databases which support this type of built-in 
> function also take a third argument which signifies a maximum distance after 
> which it is okay to terminate the algorithm.
> For example something like
> {code:sql}
> levenshtein(str1, str2[, max_distance])
> {code}
> the function stops computing the distant once the max values is reached.
> See postgresql for an example of a 3 argument 
> [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43493) Add a max distance argument to the levenshtein() function

2023-05-15 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722720#comment-17722720
 ] 

BingKun Pan commented on SPARK-43493:
-

Let me implements it at `sql` first, I will implements it ad `connect` later.

> Add a max distance argument to the levenshtein() function
> -
>
> Key: SPARK-43493
> URL: https://issues.apache.org/jira/browse/SPARK-43493
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Currently, Spark's levenshtein(str1, str2) function can be very inefficient 
> for long strings. Many other databases which support this type of built-in 
> function also take a third argument which signifies a maximum distance after 
> which it is okay to terminate the algorithm.
> For example something like
> {code:sql}
> levenshtein(str1, str2[, max_distance])
> {code}
> the function stops computing the distant once the max values is reached.
> See postgresql for an example of a 3 argument 
> [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43493) Add a max distance argument to the levenshtein() function

2023-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722713#comment-17722713
 ] 

ASF GitHub Bot commented on SPARK-43493:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41169

> Add a max distance argument to the levenshtein() function
> -
>
> Key: SPARK-43493
> URL: https://issues.apache.org/jira/browse/SPARK-43493
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Currently, Spark's levenshtein(str1, str2) function can be very inefficient 
> for long strings. Many other databases which support this type of built-in 
> function also take a third argument which signifies a maximum distance after 
> which it is okay to terminate the algorithm.
> For example something like
> {code:sql}
> levenshtein(str1, str2[, max_distance])
> {code}
> the function stops computing the distant once the max values is reached.
> See postgresql for an example of a 3 argument 
> [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43500) Test `DataFrame.drop` with empty column list and names containing dot

2023-05-15 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43500.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41167
[https://github.com/apache/spark/pull/41167]

> Test `DataFrame.drop` with empty column list and names containing dot
> -
>
> Key: SPARK-43500
> URL: https://issues.apache.org/jira/browse/SPARK-43500
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43500) Test `DataFrame.drop` with empty column list and names containing dot

2023-05-15 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43500:
-

Assignee: Ruifeng Zheng

> Test `DataFrame.drop` with empty column list and names containing dot
> -
>
> Key: SPARK-43500
> URL: https://issues.apache.org/jira/browse/SPARK-43500
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43507) The status of free pages in allocatedPages mismatch with pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory

2023-05-15 Thread qian heng (Jira)
qian heng created SPARK-43507:
-

 Summary: The status of free pages in allocatedPages mismatch with 
pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory
 Key: SPARK-43507
 URL: https://issues.apache.org/jira/browse/SPARK-43507
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: qian heng


For TaskMemoryManager,  `allocatePags`  is a Bitmap for tracking free pages in 
`pageTables`.

But inner TaskMemoryManager::cleanUpAllAllocatedMemory, it fills up 
`pageTables` with null but not clear up the `allocatePags`. It leads to the 
status of free pages mismatch between them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43507) The status of free pages in allocatedPages mismatch with pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory

2023-05-15 Thread qian heng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian heng updated SPARK-43507:
--
External issue URL:   (was: https://github.com/apache/spark/pull/40974)

> The status of free pages in allocatedPages mismatch with pageTable after 
> invoking TaskMemoryManager::cleanUpAllAllocatedMemory
> --
>
> Key: SPARK-43507
> URL: https://issues.apache.org/jira/browse/SPARK-43507
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: qian heng
>Priority: Major
>
> For TaskMemoryManager,  `allocatePags`  is a Bitmap for tracking free pages 
> in `pageTables`.
> But inner TaskMemoryManager::cleanUpAllAllocatedMemory, it fills up 
> `pageTables` with null but not clear up the `allocatePags`. It leads to the 
> status of free pages mismatch between them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43506) Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0.

2023-05-15 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43506:
---

 Summary: Enable ArrowTests.test_toPandas_empty_columns for pandas 
2.0.0.
 Key: SPARK-43506
 URL: https://issues.apache.org/jira/browse/SPARK-43506
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43505) spark on k8s supports spark.executor.extraLibraryPath

2023-05-15 Thread YE (Jira)
YE created SPARK-43505:
--

 Summary: spark on k8s supports spark.executor.extraLibraryPath 
 Key: SPARK-43505
 URL: https://issues.apache.org/jira/browse/SPARK-43505
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0, 3.3.2, 3.2.3
Reporter: YE


For extraLibraryPath to work as expected, the executor JVM should be started 
with prefixed  env(assumed on Linux) such as:

LD_LIBRARY_PATH=LD_LIBRARY_PATH:${spark.executor.extraLibraryPath} bin/java 

 

However there are two problems for running on K8S:
 # spark on k8s doesn't handle `spark.executor.extraLibraryPath` at all
 # k8s currently don't have a way to expand environment variable correctly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  [https://github.com/apache/spark/pull/22911] description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not mounted in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not mounted in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not mounted in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722674#comment-17722674
 ] 

Fei Wang commented on SPARK-43504:
--

gentle ping [~vanzin]  [~dongjoon] [~ifilonenko] 

> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43504:
-
Description: 
Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

  was:
Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.


> [K8S] Mount hadoop config map in executor side
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map is not in executor side.
> Per the  
> [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
>  description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43504) [K8S] Mount hadoop config map in executor side

2023-05-15 Thread Fei Wang (Jira)
Fei Wang created SPARK-43504:


 Summary: [K8S] Mount hadoop config map in executor side
 Key: SPARK-43504
 URL: https://issues.apache.org/jira/browse/SPARK-43504
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Fei Wang


Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop 
config map is not in executor side.

Per the  
[https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,]
 description:
{code:java}
The main two things that don't need to happen in executors anymore are:
1. adding the Hadoop config to the executor pods: this is not needed
since the Spark driver will serialize the Hadoop config and send
it to executors when running tasks. {code}
But in fact, the executor still need the hadoop configuration.

 

!https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!

 

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org