[jira] [Resolved] (SPARK-43502) DataFrame.drop should support empty column
[ https://issues.apache.org/jira/browse/SPARK-43502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43502. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41180 [https://github.com/apache/spark/pull/41180] > DataFrame.drop should support empty column > -- > > Key: SPARK-43502 > URL: https://issues.apache.org/jira/browse/SPARK-43502 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43504: - Summary: [K8S] Mounts the hadoop config map on the executor pod (was: [K8S] Mount hadoop config map in executor side) > [K8S] Mounts the hadoop config map on the executor pod > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map is not in executor side. > Per the [https://github.com/apache/spark/pull/22911] description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43504: - Description: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map will not be mounted on the executor pod. Per the [https://github.com/apache/spark/pull/22911] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. was: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. > [K8S] Mounts the hadoop config map on the executor pod > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map will not be mounted on the executor pod. > Per the [https://github.com/apache/spark/pull/22911] description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33
BingKun Pan created SPARK-43520: --- Summary: Upgrade mysql-connector-java from 8.0.32 to 8.0.33 Key: SPARK-43520 URL: https://issues.apache.org/jira/browse/SPARK-43520 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43488) bitmap function
[ https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43488: - Labels: (was: patch) > bitmap function > --- > > Key: SPARK-43488 > URL: https://issues.apache.org/jira/browse/SPARK-43488 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: yiku123 >Priority: Major > > maybe spark need to have some bitmap functions? example like bitmapBuild > 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。 > This is often used in user profiling applications but i don't find in spark > > > h2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43501) `test_ops_on_diff_frames_*` are much slower on Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722969#comment-17722969 ] Hyukjin Kwon commented on SPARK-43501: -- cc [~itholic] > `test_ops_on_diff_frames_*` are much slower on Spark Connect > > > Key: SPARK-43501 > URL: https://issues.apache.org/jira/browse/SPARK-43501 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43483) Adds SQL references for OFFSET clause.
[ https://issues.apache.org/jira/browse/SPARK-43483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43483. -- Fix Version/s: 3.5.0 Assignee: jiaan.geng Resolution: Fixed > Adds SQL references for OFFSET clause. > -- > > Key: SPARK-43483 > URL: https://issues.apache.org/jira/browse/SPARK-43483 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > Spark 3.4.0 released the new syntax: OFFSET clause. > But the SQL reference missing the description for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43483) Adds SQL references for OFFSET clause.
[ https://issues.apache.org/jira/browse/SPARK-43483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722968#comment-17722968 ] Hyukjin Kwon commented on SPARK-43483: -- Fixed in https://github.com/apache/spark/pull/41151 > Adds SQL references for OFFSET clause. > -- > > Key: SPARK-43483 > URL: https://issues.apache.org/jira/browse/SPARK-43483 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > Spark 3.4.0 released the new syntax: OFFSET clause. > But the SQL reference missing the description for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43482) Expand QueryTerminatedEvent to contain error class if it exists in exception
[ https://issues.apache.org/jira/browse/SPARK-43482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43482. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41150 [https://github.com/apache/spark/pull/41150] > Expand QueryTerminatedEvent to contain error class if it exists in exception > > > Key: SPARK-43482 > URL: https://issues.apache.org/jira/browse/SPARK-43482 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.5.0 > > > As of now, the event QueryTerminatedEvent has a field about exception which > is just a string representation of underlying exception, mostly stack trace. > That said, people relying on streaming query listener can't get any benefit > from the movement for employing error class framework. This is a bad > observability once we migrate exceptions in streaming query to error class > framework where exceptions are well categorized. > We can expand the event of QueryTerminatedEvent - if the exception contains > error class information, we can ship this along with existing string version > of exception (message). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43482) Expand QueryTerminatedEvent to contain error class if it exists in exception
[ https://issues.apache.org/jira/browse/SPARK-43482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43482: Assignee: Jungtaek Lim > Expand QueryTerminatedEvent to contain error class if it exists in exception > > > Key: SPARK-43482 > URL: https://issues.apache.org/jira/browse/SPARK-43482 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > As of now, the event QueryTerminatedEvent has a field about exception which > is just a string representation of underlying exception, mostly stack trace. > That said, people relying on streaming query listener can't get any benefit > from the movement for employing error class framework. This is a bad > observability once we migrate exceptions in streaming query to error class > framework where exceptions are well categorized. > We can expand the event of QueryTerminatedEvent - if the exception contains > error class information, we can ship this along with existing string version > of exception (message). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43519) Bump Parquet 1.13.1
Cheng Pan created SPARK-43519: - Summary: Bump Parquet 1.13.1 Key: SPARK-43519 URL: https://issues.apache.org/jira/browse/SPARK-43519 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43517) Add a migration guide for namedtuple monkey patch
[ https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43517. -- Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41177 [https://github.com/apache/spark/pull/41177] > Add a migration guide for namedtuple monkey patch > - > > Key: SPARK-43517 > URL: https://issues.apache.org/jira/browse/SPARK-43517 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.5.0, 3.4.1 > > > Should add a migration guide for SPARK-41189 workaround. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43517) Add a migration guide for namedtuple monkey patch
[ https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43517: Assignee: Hyukjin Kwon > Add a migration guide for namedtuple monkey patch > - > > Key: SPARK-43517 > URL: https://issues.apache.org/jira/browse/SPARK-43517 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > Should add a migration guide for SPARK-41189 workaround. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43473) Support struct type in createDataFrame from pandas DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43473: Assignee: Takuya Ueshin > Support struct type in createDataFrame from pandas DataFrame > > > Key: SPARK-43473 > URL: https://issues.apache.org/jira/browse/SPARK-43473 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Support struct type in createDataFrame from pandas DataFrame with {{Row}} > object or {{dict}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43473) Support struct type in createDataFrame from pandas DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43473. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41149 [https://github.com/apache/spark/pull/41149] > Support struct type in createDataFrame from pandas DataFrame > > > Key: SPARK-43473 > URL: https://issues.apache.org/jira/browse/SPARK-43473 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > > Support struct type in createDataFrame from pandas DataFrame with {{Row}} > object or {{dict}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
BingKun Pan created SPARK-43518: --- Summary: Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR Key: SPARK-43518 URL: https://issues.apache.org/jira/browse/SPARK-43518 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input data frames have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a data frame to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one data frame: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions._ def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM ")), ) }
[jira] [Resolved] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing
[ https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-43300. Fix Version/s: 3.5.0 Resolution: Fixed > Cascade failure in Guava cache due to fate-sharing > -- > > Key: SPARK-43300 > URL: https://issues.apache.org/jira/browse/SPARK-43300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Assignee: Ziqi Liu >Priority: Major > Fix For: 3.5.0 > > > Guava cache is widely used in spark, however, it suffers from fate-sharing > behavior: If there are multiple requests trying to access the same key in the > {{cache}} at the same time when the key is not in the cache, Guava cache will > block all requests and create the object only once. If the creation fails, > all requests will fail immediately without retry. So we might see task > failure due to irrelevant failure in other queries due to fate sharing. > This fate sharing behavior might lead to unexpected results in some situation. > We can wrap around Guava cache with a KeyLock to synchronize all requests > with the same key, so they will run individually and fail as if they come one > at a time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing
[ https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722959#comment-17722959 ] Josh Rosen commented on SPARK-43300: Fixed in https://github.com/apache/spark/pull/40982 > Cascade failure in Guava cache due to fate-sharing > -- > > Key: SPARK-43300 > URL: https://issues.apache.org/jira/browse/SPARK-43300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Priority: Major > > Guava cache is widely used in spark, however, it suffers from fate-sharing > behavior: If there are multiple requests trying to access the same key in the > {{cache}} at the same time when the key is not in the cache, Guava cache will > block all requests and create the object only once. If the creation fails, > all requests will fail immediately without retry. So we might see task > failure due to irrelevant failure in other queries due to fate sharing. > This fate sharing behavior might lead to unexpected results in some situation. > We can wrap around Guava cache with a KeyLock to synchronize all requests > with the same key, so they will run individually and fail as if they come one > at a time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43300) Cascade failure in Guava cache due to fate-sharing
[ https://issues.apache.org/jira/browse/SPARK-43300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-43300: -- Assignee: Ziqi Liu > Cascade failure in Guava cache due to fate-sharing > -- > > Key: SPARK-43300 > URL: https://issues.apache.org/jira/browse/SPARK-43300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Assignee: Ziqi Liu >Priority: Major > > Guava cache is widely used in spark, however, it suffers from fate-sharing > behavior: If there are multiple requests trying to access the same key in the > {{cache}} at the same time when the key is not in the cache, Guava cache will > block all requests and create the object only once. If the creation fails, > all requests will fail immediately without retry. So we might see task > failure due to irrelevant failure in other queries due to fate sharing. > This fate sharing behavior might lead to unexpected results in some situation. > We can wrap around Guava cache with a KeyLock to synchronize all requests > with the same key, so they will run individually and fail as if they come one > at a time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43281) Fix concurrent writer does not update file metrics
[ https://issues.apache.org/jira/browse/SPARK-43281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43281. - Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 40952 [https://github.com/apache/spark/pull/40952] > Fix concurrent writer does not update file metrics > -- > > Key: SPARK-43281 > URL: https://issues.apache.org/jira/browse/SPARK-43281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0, 3.4.1 > > > It uses temp file path to get file status after commit task. However, the > temp file has already moved to new path during commit task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43281) Fix concurrent writer does not update file metrics
[ https://issues.apache.org/jira/browse/SPARK-43281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43281: --- Assignee: XiDuo You > Fix concurrent writer does not update file metrics > -- > > Key: SPARK-43281 > URL: https://issues.apache.org/jira/browse/SPARK-43281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > > It uses temp file path to get file status after commit task. However, the > temp file has already moved to new path during commit task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43413: --- Assignee: Jack Chen > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Summary: Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions (was: Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions) > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.ml.
[jira] [Resolved] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43413. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41094 [https://github.com/apache/spark/pull/41094] > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43517) Add a migration guide for namedtuple monkey patch
[ https://issues.apache.org/jira/browse/SPARK-43517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43517: - Priority: Minor (was: Major) > Add a migration guide for namedtuple monkey patch > - > > Key: SPARK-43517 > URL: https://issues.apache.org/jira/browse/SPARK-43517 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Should add a migration guide for SPARK-41189 workaround. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43517) Add a migration guide for namedtuple monkey patch
Hyukjin Kwon created SPARK-43517: Summary: Add a migration guide for namedtuple monkey patch Key: SPARK-43517 URL: https://issues.apache.org/jira/browse/SPARK-43517 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Should add a migration guide for SPARK-41189 workaround. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43516) Basic estimator / transformer / model / evaluator interfaces
[ https://issues.apache.org/jira/browse/SPARK-43516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-43516: -- Assignee: Weichen Xu > Basic estimator / transformer / model / evaluator interfaces > > > Key: SPARK-43516 > URL: https://issues.apache.org/jira/browse/SPARK-43516 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43516) Basic estimator / transformer / model / evaluator interfaces
Weichen Xu created SPARK-43516: -- Summary: Basic estimator / transformer / model / evaluator interfaces Key: SPARK-43516 URL: https://issues.apache.org/jira/browse/SPARK-43516 Project: Spark Issue Type: Sub-task Components: Connect, ML, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Environment: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue on real data. was: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue with real data. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features on empty data frames caused by certain SQL functions > --- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement fail
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Environment: Scala version: 2.12.17 Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue on real data. was: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue on real data. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features on empty data frames caused by certain SQL functions > --- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement fail
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Environment: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue with real data. was: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on cluster was used to check the issue with real data. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features on empty data frames caused by certain SQL functions > --- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples was executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue with real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Summary: Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features on empty data frames caused by certain SQL functions (was: Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions) > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features on empty data frames caused by certain SQL functions > --- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on > cluster was used to check the issue with real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) >
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input data frames have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a data frame to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one data frame: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions.{transform, to_timestamp} def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM "
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input data frames have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a data frame to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one data frame: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions.{transform, to_timestamp} def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM "
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input data frames have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a data frame to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one data frame: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions.{transform, to_timestamp} def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM "
[jira] [Created] (SPARK-43515) Add a benchmark to evaluate the performance of migrating a large number of tiny blocks during decommission
Xingbo Jiang created SPARK-43515: Summary: Add a benchmark to evaluate the performance of migrating a large number of tiny blocks during decommission Key: SPARK-43515 URL: https://issues.apache.org/jira/browse/SPARK-43515 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.4.0 Reporter: Xingbo Jiang Assignee: Xingbo Jiang This is a followup work of SPARK-43043, we should add a benchmark to evaluate the performance of migrating a large number of tiny blocks during decommission. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Description: We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added {{show}} calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input dataframes have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a dataframe to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one dataframe: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions.{transform, to_timestamp} def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toBeMatched2 = directors .filter(length($"name") > 2) .select( $"name", $"id" as "directorId", ) similarityJoin(toBeMatched1, toBeMatched2, "name", 0.6) } def process2(df: DataFrame): Unit = { def process_financials(column: Column): Column = { transform( column, x => x.withField("date", to_timestamp(x("date"), "dd MMM ")),
[jira] [Created] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
Svyatoslav Semenyuk created SPARK-43514: --- Summary: Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions Key: SPARK-43514 URL: https://issues.apache.org/jira/browse/SPARK-43514 Project: Spark Issue Type: Bug Components: ML, SQL Affects Versions: 3.4.0, 3.3.1 Environment: Scala version: 2.12.17 Test examples was executed inside Zeppelin 0.10.1; Spark 3.3.1 deployed on cluster was used to check the issue with real data. Reporter: Svyatoslav Semenyuk We designed a function that joins two DFs on common column with some similarity. All next code will be on Scala 2.12. I've added `show` calls for demonstration purposes. {code:scala} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer, MinHashLSHModel} import org.apache.spark.sql.{DataFrame, Column} /** * Joins two data frames on a string column using LSH algorithm * for similarity computation. * * If input dataframes have columns with identical names, * the resulting dataframe will have columns from them both * with prefixes `datasetA` and `datasetB` respectively. * * For example, if both dataframes have a column with name `myColumn`, * then the result will have columns `datasetAMyColumn` and `datasetBMyColumn`. */ def similarityJoin( df: DataFrame, anotherDf: DataFrame, joinExpr: String, threshold: Double = 0.8, ): DataFrame = { df.show(false) anotherDf.show(false) val pipeline = new Pipeline().setStages(Array( new RegexTokenizer() .setPattern("") .setMinTokenLength(1) .setInputCol(joinExpr) .setOutputCol("tokens"), new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), ) ) val model = pipeline.fit(df) val storedHashed = model.transform(df) val landedHashed = model.transform(anotherDf) val commonColumns = df.columns.toSet & anotherDf.columns.toSet /** * Converts column name from a dataframe to the column of resulting dataset. */ def convertColumn(datasetName: String)(columnName: String): Column = { val newName = if (commonColumns.contains(columnName)) s"$datasetName${columnName.capitalize}" else columnName col(s"$datasetName.$columnName") as newName } val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ anotherDf.columns.map(convertColumn("datasetB")) val result = model .stages .last .asInstanceOf[MinHashLSHModel] .approxSimilarityJoin(storedHashed, landedHashed, threshold, "confidence") .select(columnsToSelect.toSeq: _*) result.show(false) result } {code} Now consider such simple example: {code:scala} val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example runs with no errors and outputs 3 empty DFs. Let's add {{distinct}} method to one dataframe: {code:scala} val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 2) as "df1" val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" similarityJoin(inputDF1, inputDF2, "name", 0.6) {code} This example outputs two empty DFs and then fails at {{result.show(false)}}. Error: {code:none} org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (LSHModel$$Lambda$3769/0x000101804840: (struct,values:array>) => array,values:array>>). ... many elided Caused by: java.lang.IllegalArgumentException: requirement failed: Must have at least 1 non zero entry. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) ... many more {code} Now let's take a look on the example which is close to our application code. Define some helper functions: {code:scala} import org.apache.spark.sql.functions.{transform, to_timestamp} def process1(df: DataFrame): Unit = { val companies = df.select($"id", $"name") val directors = df .select(explode($"directors")) .select($"col.name", $"col.id") .dropDuplicates("id") val toBeMatched1 = companies .filter(length($"name") > 2) .select( $"name", $"id" as "sourceLegalEntityId", ) val toB
[jira] [Updated] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists
[ https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frederik Paradis updated SPARK-43513: - Summary: withColumnRenamed duplicates columns if new column already exists (was: withColumnRenamed duplicates columns if new column already exist) > withColumnRenamed duplicates columns if new column already exists > - > > Key: SPARK-43513 > URL: https://issues.apache.org/jira/browse/SPARK-43513 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Frederik Paradis >Priority: Major > > withColumnRenamed should either replace the column when new column already > exists or should specify the specificity in the documentation. See the code > below as an example of the current state. > {code:python} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() > df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", > "test_score"]) > r = df.withColumnRenamed("test_score", "score") > print(r) # DataFrame[id: bigint, score: double, score: double] > # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could > be: score, score. > print(r.select("score")) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43513) withColumnRenamed duplicate columns if new column already exist
Frederik Paradis created SPARK-43513: Summary: withColumnRenamed duplicate columns if new column already exist Key: SPARK-43513 URL: https://issues.apache.org/jira/browse/SPARK-43513 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0 Reporter: Frederik Paradis withColumnRenamed should either replace the column when new column already exists or should specify the specificity in the documentation. See the code below as an example of the current state. {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", "test_score"]) r = df.withColumnRenamed("test_score", "score") print(r) # DataFrame[id: bigint, score: double, score: double] # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could be: score, score. print(r.select("score")) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43513) withColumnRenamed duplicates columns if new column already exist
[ https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frederik Paradis updated SPARK-43513: - Summary: withColumnRenamed duplicates columns if new column already exist (was: withColumnRenamed duplicate columns if new column already exist) > withColumnRenamed duplicates columns if new column already exist > > > Key: SPARK-43513 > URL: https://issues.apache.org/jira/browse/SPARK-43513 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Frederik Paradis >Priority: Major > > withColumnRenamed should either replace the column when new column already > exists or should specify the specificity in the documentation. See the code > below as an example of the current state. > {code:python} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() > df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", > "test_score"]) > r = df.withColumnRenamed("test_score", "score") > print(r) # DataFrame[id: bigint, score: double, score: double] > # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could > be: score, score. > print(r.select("score")) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
Anish Shrigondekar created SPARK-43512: -- Summary: Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade Key: SPARK-43512 URL: https://issues.apache.org/jira/browse/SPARK-43512 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Anish Shrigondekar Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
[ https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722894#comment-17722894 ] Hudson commented on SPARK-43205: User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/41007 > Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier > --- > > Key: SPARK-43205 > URL: https://issues.apache.org/jira/browse/SPARK-43205 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Priority: Major > > There is a requirement for SQL templates, where the table and or column names > are provided through substitution. This can be done today using variable > substitution: > SET hivevar:tabname = mytab; > SELECT * FROM ${ hivevar:tabname }; > A straight variable substitution is dangerous since it does allow for SQL > injection: > SET hivevar:tabname = mytab, someothertab; > SELECT * FROM ${ hivevar:tabname }; > A way to get around this problem is to wrap the variable substitution with a > clause that limits the scope t produce an identifier. > This approach is taken by Snowflake: > > [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql] > SET hivevar:tabname = 'tabname'; > SELECT * FROM IDENTIFIER(${ hivevar:tabname }) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala
[ https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722891#comment-17722891 ] Bo Gao commented on SPARK-43511: Created PR https://github.com/apache/spark/pull/40959 > Implemented State APIs for Spark Connect Scala > -- > > Key: SPARK-43511 > URL: https://issues.apache.org/jira/browse/SPARK-43511 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Bo Gao >Priority: Major > > Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark > Connect Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43511) Implemented State APIs for Spark Connect Scala
Bo Gao created SPARK-43511: -- Summary: Implemented State APIs for Spark Connect Scala Key: SPARK-43511 URL: https://issues.apache.org/jira/browse/SPARK-43511 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Bo Gao Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark Connect Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43442) Split test module `pyspark_pandas_connect`
[ https://issues.apache.org/jira/browse/SPARK-43442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43442. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41127 [https://github.com/apache/spark/pull/41127] > Split test module `pyspark_pandas_connect` > -- > > Key: SPARK-43442 > URL: https://issues.apache.org/jira/browse/SPARK-43442 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43442) Split test module `pyspark_pandas_connect`
[ https://issues.apache.org/jira/browse/SPARK-43442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43442: - Assignee: Ruifeng Zheng > Split test module `pyspark_pandas_connect` > -- > > Key: SPARK-43442 > URL: https://issues.apache.org/jira/browse/SPARK-43442 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Summary: Spark application hangs when YarnAllocator adds running executors after processing completed containers (was: Spark application hangs when YarnAllocator adding running executors after processing completed containers) > Spark application hangs when YarnAllocator adds running executors after > processing completed containers > --- > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator added the running executor after processing completed > containers. The former happens in a separate thread after executor launch. > YarnAllocator believes there are still running executors, although they are > already lost due to preemption. Hence, the application hangs without any > running executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Description: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator added the running executor after processing completed containers. The former happens in a separate thread after executor launch. YarnAllocator believes there are still running executors, although they are already lost due to preemption. Hence, the application hangs without any running executors. was: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. YarnAllocator thought > Spark application hangs when YarnAllocator adding running executors after > processing completed containers > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator added the running executor after processing completed > containers. The former happens in a separate thread after executor launch. > YarnAllocator believes there are still running executors, although they are > already lost due to preemption. Hence, the application hangs without any > running executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Summary: Spark application hangs when YarnAllocator adding running executors after processing completed containers (was: Spark application hangs when YarnAllocator processing completed containers before updating internal state) > Spark application hangs when YarnAllocator adding running executors after > processing completed containers > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator processing completed container before updating internal state > and adding the executorId. The latter happens in a separate thread after > executor launch. > YarnAllocator thought -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Description: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. YarnAllocator thought was: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted. {code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. > Spark application hangs when YarnAllocator processing completed containers > before updating internal state > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator processing completed container before updating internal state > and adding the executorId. The latter happens in a separate thread after > executor launch. > YarnAllocator thought -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43494: Assignee: Yang Jie > Directly call `replicate()` instead of reflection in > `SparkHadoopUtil#createFile` > - > > Key: SPARK-43494 > URL: https://issues.apache.org/jira/browse/SPARK-43494 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43494) Directly call `replicate()` instead of reflection in `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43494. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41164 [https://github.com/apache/spark/pull/41164] > Directly call `replicate()` instead of reflection in > `SparkHadoopUtil#createFile` > - > > Key: SPARK-43494 > URL: https://issues.apache.org/jira/browse/SPARK-43494 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state
Manu Zhang created SPARK-43510: -- Summary: Spark application hangs when YarnAllocator processing completed containers before updating internal state Key: SPARK-43510 URL: https://issues.apache.org/jira/browse/SPARK-43510 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.4.0 Reporter: Manu Zhang I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted. {code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43223) KeyValueGroupedDataset#agg
[ https://issues.apache.org/jira/browse/SPARK-43223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-43223. --- Fix Version/s: 3.5.0 Assignee: Zhen Li Resolution: Fixed > KeyValueGroupedDataset#agg > -- > > Key: SPARK-43223 > URL: https://issues.apache.org/jira/browse/SPARK-43223 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.5.0 > > > Adding missing agg functions in the KVGDS API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43461) Skip compiling useless files when making distribution
[ https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43461: Summary: Skip compiling useless files when making distribution (was: Skip compiling javadoc.jar and sources.jar when making distribution) > Skip compiling useless files when making distribution > - > > Key: SPARK-43461 > URL: https://issues.apache.org/jira/browse/SPARK-43461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > -Dmaven.javadoc.skip=true to skip java doc > -Dskip=true to skip scala doc. Please see: > https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip > -Dmaven.source.skip to skip build sources.jar > -Dmaven.test.skip to skip build test-jar > -Dcyclonedx.skip=true to skip making bom. Please see: > https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3
[ https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-43508: Assignee: BingKun Pan > Replace the link related to hadoop version 2 with hadoop version 3 > -- > > Key: SPARK-43508 > URL: https://issues.apache.org/jira/browse/SPARK-43508 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3
[ https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43508: - Priority: Trivial (was: Minor) > Replace the link related to hadoop version 2 with hadoop version 3 > -- > > Key: SPARK-43508 > URL: https://issues.apache.org/jira/browse/SPARK-43508 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3
[ https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43508. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41171 [https://github.com/apache/spark/pull/41171] > Replace the link related to hadoop version 2 with hadoop version 3 > -- > > Key: SPARK-43508 > URL: https://issues.apache.org/jira/browse/SPARK-43508 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark
Martin Grund created SPARK-43509: Summary: Support creating multiple sessions for Spark Connect in PySpark Key: SPARK-43509 URL: https://issues.apache.org/jira/browse/SPARK-43509 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44
[ https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43495: - Priority: Minor (was: Major) > Upgrade RoaringBitmap to 0.9.44 > --- > > Key: SPARK-43495 > URL: https://issues.apache.org/jira/browse/SPARK-43495 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44
[ https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43495. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41165 [https://github.com/apache/spark/pull/41165] > Upgrade RoaringBitmap to 0.9.44 > --- > > Key: SPARK-43495 > URL: https://issues.apache.org/jira/browse/SPARK-43495 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43495) Upgrade RoaringBitmap to 0.9.44
[ https://issues.apache.org/jira/browse/SPARK-43495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-43495: Assignee: Yang Jie > Upgrade RoaringBitmap to 0.9.44 > --- > > Key: SPARK-43495 > URL: https://issues.apache.org/jira/browse/SPARK-43495 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.40] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.41] > * [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.44] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43359) DELETE from Hive table result in INTERNAL error
[ https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722759#comment-17722759 ] BingKun Pan commented on SPARK-43359: - Let me fix it. > DELETE from Hive table result in INTERNAL error > --- > > Key: SPARK-43359 > URL: https://issues.apache.org/jira/browse/SPARK-43359 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Minor > > spark-sql (default)> CREATE TABLE T1(c1 INT); > spark-sql (default)> DELETE FROM T1 WHERE c1 = 1; > [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation > [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: > HiveTableRelation [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > at > org.apache.spark.SparkException$.internalError(SparkException.scala:77) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:81) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with hadoop version 3
[ https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43508: Summary: Replace the link related to hadoop version 2 with hadoop version 3 (was: Replace the link related to hadoop version 2 with Hadoop version 3) > Replace the link related to hadoop version 2 with hadoop version 3 > -- > > Key: SPARK-43508 > URL: https://issues.apache.org/jira/browse/SPARK-43508 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43508) Replace the link related to hadoop version 2 with Hadoop version 3
[ https://issues.apache.org/jira/browse/SPARK-43508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43508: Summary: Replace the link related to hadoop version 2 with Hadoop version 3 (was: Replace the link related to Hadoop version 2 with Hadoop version 3) > Replace the link related to hadoop version 2 with Hadoop version 3 > -- > > Key: SPARK-43508 > URL: https://issues.apache.org/jira/browse/SPARK-43508 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43508) Replace the link related to Hadoop version 2 with Hadoop version 3
BingKun Pan created SPARK-43508: --- Summary: Replace the link related to Hadoop version 2 with Hadoop version 3 Key: SPARK-43508 URL: https://issues.apache.org/jira/browse/SPARK-43508 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43485) Confused errors from the DATEADD function
[ https://issues.apache.org/jira/browse/SPARK-43485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722749#comment-17722749 ] Nikita Awasthi commented on SPARK-43485: User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/41143 > Confused errors from the DATEADD function > - > > Key: SPARK-43485 > URL: https://issues.apache.org/jira/browse/SPARK-43485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.5.0 > > > The code example portraits the issue: > {code:sql} > spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11'); > [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but > the actual number is 3. Please, refer to > 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; > line 1 pos 7 > {code} > The error says about number of arguments passed to DATEADD but the issue is > about the type of the first argument. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43485) Confused errors from the DATEADD function
[ https://issues.apache.org/jira/browse/SPARK-43485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43485. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41143 [https://github.com/apache/spark/pull/41143] > Confused errors from the DATEADD function > - > > Key: SPARK-43485 > URL: https://issues.apache.org/jira/browse/SPARK-43485 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.5.0 > > > The code example portraits the issue: > {code:sql} > spark-sql (default)> select dateadd('MONTH', 1, date'2023-05-11'); > [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `dateadd` requires 2 parameters but > the actual number is 3. Please, refer to > 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix.; > line 1 pos 7 > {code} > The error says about number of arguments passed to DATEADD but the issue is > about the type of the first argument. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43493) Add a max distance argument to the levenshtein() function
[ https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722720#comment-17722720 ] BingKun Pan edited comment on SPARK-43493 at 5/15/23 9:46 AM: -- Let me implements it at `sql` first, I will implements it at `connect` later. was (Author: panbingkun): Let me implements it at `sql` first, I will implements it ad `connect` later. > Add a max distance argument to the levenshtein() function > - > > Key: SPARK-43493 > URL: https://issues.apache.org/jira/browse/SPARK-43493 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Currently, Spark's levenshtein(str1, str2) function can be very inefficient > for long strings. Many other databases which support this type of built-in > function also take a third argument which signifies a maximum distance after > which it is okay to terminate the algorithm. > For example something like > {code:sql} > levenshtein(str1, str2[, max_distance]) > {code} > the function stops computing the distant once the max values is reached. > See postgresql for an example of a 3 argument > [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43493) Add a max distance argument to the levenshtein() function
[ https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722720#comment-17722720 ] BingKun Pan commented on SPARK-43493: - Let me implements it at `sql` first, I will implements it ad `connect` later. > Add a max distance argument to the levenshtein() function > - > > Key: SPARK-43493 > URL: https://issues.apache.org/jira/browse/SPARK-43493 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Currently, Spark's levenshtein(str1, str2) function can be very inefficient > for long strings. Many other databases which support this type of built-in > function also take a third argument which signifies a maximum distance after > which it is okay to terminate the algorithm. > For example something like > {code:sql} > levenshtein(str1, str2[, max_distance]) > {code} > the function stops computing the distant once the max values is reached. > See postgresql for an example of a 3 argument > [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43493) Add a max distance argument to the levenshtein() function
[ https://issues.apache.org/jira/browse/SPARK-43493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722713#comment-17722713 ] ASF GitHub Bot commented on SPARK-43493: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41169 > Add a max distance argument to the levenshtein() function > - > > Key: SPARK-43493 > URL: https://issues.apache.org/jira/browse/SPARK-43493 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Currently, Spark's levenshtein(str1, str2) function can be very inefficient > for long strings. Many other databases which support this type of built-in > function also take a third argument which signifies a maximum distance after > which it is okay to terminate the algorithm. > For example something like > {code:sql} > levenshtein(str1, str2[, max_distance]) > {code} > the function stops computing the distant once the max values is reached. > See postgresql for an example of a 3 argument > [levenshtein|https://www.postgresql.org/docs/current/fuzzystrmatch.html#id-1.11.7.26.7]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43500) Test `DataFrame.drop` with empty column list and names containing dot
[ https://issues.apache.org/jira/browse/SPARK-43500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43500. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41167 [https://github.com/apache/spark/pull/41167] > Test `DataFrame.drop` with empty column list and names containing dot > - > > Key: SPARK-43500 > URL: https://issues.apache.org/jira/browse/SPARK-43500 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43500) Test `DataFrame.drop` with empty column list and names containing dot
[ https://issues.apache.org/jira/browse/SPARK-43500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43500: - Assignee: Ruifeng Zheng > Test `DataFrame.drop` with empty column list and names containing dot > - > > Key: SPARK-43500 > URL: https://issues.apache.org/jira/browse/SPARK-43500 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43507) The status of free pages in allocatedPages mismatch with pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory
qian heng created SPARK-43507: - Summary: The status of free pages in allocatedPages mismatch with pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory Key: SPARK-43507 URL: https://issues.apache.org/jira/browse/SPARK-43507 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: qian heng For TaskMemoryManager, `allocatePags` is a Bitmap for tracking free pages in `pageTables`. But inner TaskMemoryManager::cleanUpAllAllocatedMemory, it fills up `pageTables` with null but not clear up the `allocatePags`. It leads to the status of free pages mismatch between them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43507) The status of free pages in allocatedPages mismatch with pageTable after invoking TaskMemoryManager::cleanUpAllAllocatedMemory
[ https://issues.apache.org/jira/browse/SPARK-43507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qian heng updated SPARK-43507: -- External issue URL: (was: https://github.com/apache/spark/pull/40974) > The status of free pages in allocatedPages mismatch with pageTable after > invoking TaskMemoryManager::cleanUpAllAllocatedMemory > -- > > Key: SPARK-43507 > URL: https://issues.apache.org/jira/browse/SPARK-43507 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: qian heng >Priority: Major > > For TaskMemoryManager, `allocatePags` is a Bitmap for tracking free pages > in `pageTables`. > But inner TaskMemoryManager::cleanUpAllAllocatedMemory, it fills up > `pageTables` with null but not clear up the `allocatePags`. It leads to the > status of free pages mismatch between them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43506) Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0.
Haejoon Lee created SPARK-43506: --- Summary: Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0. Key: SPARK-43506 URL: https://issues.apache.org/jira/browse/SPARK-43506 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43505) spark on k8s supports spark.executor.extraLibraryPath
YE created SPARK-43505: -- Summary: spark on k8s supports spark.executor.extraLibraryPath Key: SPARK-43505 URL: https://issues.apache.org/jira/browse/SPARK-43505 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.4.0, 3.3.2, 3.2.3 Reporter: YE For extraLibraryPath to work as expected, the executor JVM should be started with prefixed env(assumed on Linux) such as: LD_LIBRARY_PATH=LD_LIBRARY_PATH:${spark.executor.extraLibraryPath} bin/java However there are two problems for running on K8S: # spark on k8s doesn't handle `spark.executor.extraLibraryPath` at all # k8s currently don't have a way to expand environment variable correctly -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43504: - Description: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. was: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not mounted in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. > [K8S] Mount hadoop config map in executor side > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map is not in executor side. > Per the [https://github.com/apache/spark/pull/22911] description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43504: - Description: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not mounted in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. was: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. > [K8S] Mount hadoop config map in executor side > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map is not mounted in executor side. > Per the > [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] > description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43504) [K8S] Mount hadoop config map in executor side
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722674#comment-17722674 ] Fei Wang commented on SPARK-43504: -- gentle ping [~vanzin] [~dongjoon] [~ifilonenko] > [K8S] Mount hadoop config map in executor side > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map is not in executor side. > Per the > [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] > description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43504) [K8S] Mount hadoop config map in executor side
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43504: - Description: Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. so we still need to mount the hadoop config map in executor side. was: Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. > [K8S] Mount hadoop config map in executor side > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map is not in executor side. > Per the > [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] > description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43504) [K8S] Mount hadoop config map in executor side
Fei Wang created SPARK-43504: Summary: [K8S] Mount hadoop config map in executor side Key: SPARK-43504 URL: https://issues.apache.org/jira/browse/SPARK-43504 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.4.0 Reporter: Fei Wang Since SPARK-25815[,|https://github.com/apache/spark/pull/22911,] the hadoop config map is not in executor side. Per the [https://github.com/apache/spark/pull/22911|https://github.com/apache/spark/pull/22911,] description: {code:java} The main two things that don't need to happen in executors anymore are: 1. adding the Hadoop config to the executor pods: this is not needed since the Spark driver will serialize the Hadoop config and send it to executors when running tasks. {code} But in fact, the executor still need the hadoop configuration. !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! As shown in above picture, the driver can resolve `hdfs://zeus`, but the executor can not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org