[jira] [Updated] (SPARK-48595) Cleanup deprecated api usage related to commons-compress

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48595:
---
Labels: pull-request-available  (was: )

> Cleanup deprecated api usage related to commons-compress
> 
>
> Key: SPARK-48595
> URL: https://issues.apache.org/jira/browse/SPARK-48595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48595) Cleanup deprecated api usage related to commons-compress

2024-06-11 Thread Yang Jie (Jira)
Yang Jie created SPARK-48595:


 Summary: Cleanup deprecated api usage related to commons-compress
 Key: SPARK-48595
 URL: https://issues.apache.org/jira/browse/SPARK-48595
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-06-11 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48411.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46740
[https://github.com/apache/spark/pull/46740]

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-06-11 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48411:


Assignee: Yuchen Liu

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Junqing Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854223#comment-17854223
 ] 

Junqing Li commented on SPARK-48562:


Thanks for your reply. Since we do not restrict writes to views, users have the 
flexibility to utilize them in their scenarios. Additionally, temporary tables 
serve various purposes, such as mitigating data conflicts and minimizing 
metadata impact. Therefore, I believe it is essential to maintain support for 
this behavior unless a specific rule is implemented to explicitly prohibit such 
operations.

> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48582) Bump `braces` from 3.0.2 to 3.0.3 in /ui-test

2024-06-11 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48582:


Assignee: Yang Jie

> Bump `braces` from 3.0.2 to 3.0.3 in /ui-test
> -
>
> Key: SPARK-48582
> URL: https://issues.apache.org/jira/browse/SPARK-48582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48582) Bump `braces` from 3.0.2 to 3.0.3 in /ui-test

2024-06-11 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48582.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46933
[https://github.com/apache/spark/pull/46933]

> Bump `braces` from 3.0.2 to 3.0.3 in /ui-test
> -
>
> Key: SPARK-48582
> URL: https://issues.apache.org/jira/browse/SPARK-48582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48594) Rename `parent` field to `child` in `ColumnAlias`

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48594:
---
Labels: pull-request-available  (was: )

> Rename `parent` field to `child` in `ColumnAlias`
> -
>
> Key: SPARK-48594
> URL: https://issues.apache.org/jira/browse/SPARK-48594
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48593) Fix the string representation of lambda function

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48593:
---
Labels: pull-request-available  (was: )

> Fix the string representation of lambda function
> 
>
> Key: SPARK-48593
> URL: https://issues.apache.org/jira/browse/SPARK-48593
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48593) Fix the string representation of lambda function

2024-06-11 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48593:
-

 Summary: Fix the string representation of lambda function
 Key: SPARK-48593
 URL: https://issues.apache.org/jira/browse/SPARK-48593
 Project: Spark
  Issue Type: Bug
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48592) Add scala style check for logging message inline variables

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48592:
---
Labels: pull-request-available  (was: )

> Add scala style check for logging message inline variables
> --
>
> Key: SPARK-48592
> URL: https://issues.apache.org/jira/browse/SPARK-48592
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Amanda Liu
>Priority: Minor
>  Labels: pull-request-available
>
> Ban logging messages using logInfo, logWarning, logError containing variables 
> without {{MDC}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48592) Add scala style check for logging message inline variables

2024-06-11 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48592:
--

 Summary: Add scala style check for logging message inline variables
 Key: SPARK-48592
 URL: https://issues.apache.org/jira/browse/SPARK-48592
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu


Ban logging messages using logInfo, logWarning, logError containing variables 
without {{MDC}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data

2024-06-11 Thread Weichen Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854214#comment-17854214
 ] 

Weichen Xu commented on SPARK-48463:


ah got it. then it is not supported :) 

 

as a workaround, I think you can flatten the original dataframe and rename the 
new column like `location_longitude`(avoid using `.` in column name), then it 
should work.

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48591) Simplify the if-else branches with `F.lit`

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48591:
---
Labels: pull-request-available  (was: )

> Simplify the if-else branches with `F.lit`
> --
>
> Key: SPARK-48591
> URL: https://issues.apache.org/jira/browse/SPARK-48591
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48591) Simplify the if-else branches with `F.lit`

2024-06-11 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48591:
-

 Summary: Simplify the if-else branches with `F.lit`
 Key: SPARK-48591
 URL: https://issues.apache.org/jira/browse/SPARK-48591
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-06-11 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854211#comment-17854211
 ] 

Bruce Robbins commented on SPARK-47193:
---

I took a look at this today. This issue happens even with 
{{{}Dataset.toLocalIterator{}}}.

Assume {{/tmp/test.csv}} contains:
{noformat}
1,2021-11-22 11:27:01
2,2021-11-22 11:27:02
3,2021-11-22 11:27:03
{noformat}
Then the following produces incorrect results:
{noformat}
sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

val test = {
  spark
  .read
  .option("header", "false")
  .schema("id int, ts timestamp")
  .csv("/tmp/test.csv")
}

import scala.collection.JavaConverters._
test.toLocalIterator.asScala.toSeq
{noformat}
The incorrect results are:
{noformat}
val res1: Seq[org.apache.spark.sql.Row] = List([1,null], [2,null], [3,null])
{noformat}
However, {{Dataset.collect}} works as expected:
{noformat}
scala> test.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res2: Array[org.apache.spark.sql.Row] = Array([1,2021-11-22 11:27:01.0], 
[2,2021-11-22 11:27:02.0], [3,2021-11-22 11:27:03.0])

scala> 
{noformat}
The problem has to do with the lazy nature of the rdd (in the case of 
{{Dataset.rdd}}) or iterator (in the case of {{Dataset.toLocalIterator}}).

{{Dataset}} actions like {{count}} and {{collect}} are wrapped with the 
function {{withSQLConfPropagated}}, which ensures that the user-specified SQL 
config is propagated to the executors while the jobs associated with the query 
run. Actions like {{count}} and {{collect}} don't return until those jobs 
complete, so the SQL config is propagated during the entire execution of the 
query.

{{Dataset.toLocalIterator}} is also wrapped by {{withSQLConfPropagated}}, but 
due to the lazy nature of iterators, the method returns before the jobs 
associated with the query actually run. Those jobs don't run until someone 
calls {{next}} on the returned iterator, at which point the SQL conf is no 
longer propagated to the executors. So the jobs get run without the 
user-specified config and just assume default settings.

In the reporter's CSV case, the user's setting of 
{{spark.sql.legacy.timeParserPolicy}} is respected during planning on the 
driver, but not respected on the executors. This mix of settings results in 
null timestamps in the resulting rows.

I'll take a look at a possible fix.

> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> 

[jira] [Updated] (SPARK-48590) Upgrade netty to `4.1.111.Final`

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48590:
---
Labels: pull-request-available  (was: )

> Upgrade netty to `4.1.111.Final`
> 
>
> Key: SPARK-48590
> URL: https://issues.apache.org/jira/browse/SPARK-48590
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48590) Upgrade netty to `4.1.111.Final`

2024-06-11 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48590:
---

 Summary: Upgrade netty to `4.1.111.Final`
 Key: SPARK-48590
 URL: https://issues.apache.org/jira/browse/SPARK-48590
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48588) Fine-grained State Data Source

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48588:
---
Labels: pull-request-available  (was: )

> Fine-grained State Data Source
> --
>
> Key: SPARK-48588
> URL: https://issues.apache.org/jira/browse/SPARK-48588
> Project: Spark
>  Issue Type: Epic
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>
> The current state reader API replays the state store rows from the latest 
> snapshot and newer delta files if any. The issue with this mechanism is that 
> sometimes, the snapshot files could be wrongly constructed, or user want to 
> know the change of state across batches. We need to improve the State Reader 
> so that it can handle a variety of fine-grained requirements. For example, 
> reconstruct a state based on arbitrary snapshot; support CDC mode for state 
> evolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48589) Add option snapshotStartBatchId and snapshotPartitionId to state data source

2024-06-11 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48589:
--

 Summary: Add option snapshotStartBatchId and snapshotPartitionId 
to state data source
 Key: SPARK-48589
 URL: https://issues.apache.org/jira/browse/SPARK-48589
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


Define two new options, _snapshotStartBatchId_ and _snapshotPartitionId_, for 
the existing state reader. Both of them should be provided at the same time.
 # When there is no snapshot file at that batch (note there is an off-by-one 
issue between version and batch Id), throw an exception.
 # Otherwise, the reader should continue to rebuild the state by reading delta 
files only, and ignore all snapshot files afterwards.
 # Note that if a batchId option is already specified. That batchId is the 
ending batchId, we should then end at that batchId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2024-06-11 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854182#comment-17854182
 ] 

Jonathan Boarman commented on SPARK-24497:
--

There's a lot of folks wondering when this will get merged.  I see from Peter 
the issue relates to getting reviewers?  How do we get reviewers to review that 
PR?

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31561) Add QUALIFY Clause

2024-06-11 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854181#comment-17854181
 ] 

Jonathan Boarman commented on SPARK-31561:
--

How do we revive this PR and add reviewers to get this promoted?

> Add QUALIFY Clause
> --
>
> Key: SPARK-31561
> URL: https://issues.apache.org/jira/browse/SPARK-31561
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> In a SELECT statement, the QUALIFY clause filters the results of window 
> functions.
> QUALIFY does with window functions what HAVING does with aggregate functions 
> and GROUP BY clauses.
> In the execution order of a query, QUALIFY is therefore evaluated after 
> window functions are computed.
> Examples:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples
> More details:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
> https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48588) Fine-grained State Data Source

2024-06-11 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48588:
--

 Summary: Fine-grained State Data Source
 Key: SPARK-48588
 URL: https://issues.apache.org/jira/browse/SPARK-48588
 Project: Spark
  Issue Type: Epic
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Yuchen Liu


The current state reader API replays the state store rows from the latest 
snapshot and newer delta files if any. The issue with this mechanism is that 
sometimes, the snapshot files could be wrongly constructed, or user want to 
know the change of state across batches. We need to improve the State Reader so 
that it can handle a variety of fine-grained requirements. For example, 
reconstruct a state based on arbitrary snapshot; support CDC mode for state 
evolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48572) Fix DateSub, DateAdd, WindowTime, TimeWindow and SessionWindow expressions

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48572:
---
Labels: pull-request-available  (was: )

> Fix DateSub, DateAdd, WindowTime, TimeWindow and SessionWindow expressions
> --
>
> Key: SPARK-48572
> URL: https://issues.apache.org/jira/browse/SPARK-48572
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> While adding Expression Walker testing, these expression were found to be 
> faulty. These expressions need to be fixed to work with collated strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48463) MLLib function unable to handle nested data

2024-06-11 Thread Chhavi Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854153#comment-17854153
 ] 

Chhavi Bansal edited comment on SPARK-48463 at 6/11/24 6:32 PM:


[~weichenxu123] I tried using 
{code:java}
new 
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") 
{code}
without flattening the dataset, but it fails before going to *getSelectedCols() 
function,* inside the 
{code:java}
validateAndTransformSchema$2(StringIndexer.scala:128) {code}
code itself. did it work for you ?


was (Author: JIRAUSER304338):
[~weichenxu123] I tried using 
{code:java}
new 
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") 
{code}
without flattening the dataset, but it fails before going to *getSelectedCols() 
function,* inside the 
{code:java}
validateAndTransformSchema$2(StringIndexer.scala:128) {code}
code itself.

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data

2024-06-11 Thread Chhavi Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854153#comment-17854153
 ] 

Chhavi Bansal commented on SPARK-48463:
---

[~weichenxu123] I tried using 
{code:java}
new 
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") 
{code}
without flattening the dataset, but it fails before going to *getSelectedCols() 
function,* inside the 
{code:java}
validateAndTransformSchema$2(StringIndexer.scala:128) {code}
code itself.

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48586) Remove lock contention between maintenance and task threads

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48586:
---
Labels: pull-request-available  (was: )

> Remove lock contention between maintenance and task threads
> ---
>
> Key: SPARK-48586
> URL: https://issues.apache.org/jira/browse/SPARK-48586
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Priority: Major
>  Labels: pull-request-available
>
> Currently the lock of the *RocksDB* state store is acquired when uploading 
> the snapshot inside maintenance tasks when change log checkpointing is 
> enabled, which causes lock contention between query processing tasks and 
> state maintenance thread. To eliminate the lock contention, lock acquisition 
> inside maintenance tasks should be avoided. To prevent race conditions 
> between task and maintenance threads, we can ensure that *RocksDBFileManager* 
> has a linear history by ensuring a deep copy of *RocksDBFileManager* every 
> time a previous version is loaded. The original file manager is not affected 
> by future state update. The new file manager is not affected by background 
> snapshot uploading tasks that attempt to upload a snapshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-06-11 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
Labels: SPIP pull-request-available  (was: SPIP)

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP, pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan apparently is behaving correctly only due to the preceding unrelated 
> 

[jira] [Updated] (SPARK-48587) Avoid storage amplification when accessing sub-Variant

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48587:
---
Labels: pull-request-available  (was: )

> Avoid storage amplification when accessing sub-Variant
> --
>
> Key: SPARK-48587
> URL: https://issues.apache.org/jira/browse/SPARK-48587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Priority: Major
>  Labels: pull-request-available
>
> When a variant_get expression returns a Variant, or a nested type containing 
> Variant, we just return the sub-slice of the Variant value along with the 
> full metadata, even though most of the metadata is probably unnecessary to 
> represent the value. This may be very inefficient if the value is then 
> written to disk (e.g. shuffle file or parquet). We should instead rebuild the 
> value with minimal metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48587) Avoid storage amplification when accessing sub-Variant

2024-06-11 Thread David Cashman (Jira)
David Cashman created SPARK-48587:
-

 Summary: Avoid storage amplification when accessing sub-Variant
 Key: SPARK-48587
 URL: https://issues.apache.org/jira/browse/SPARK-48587
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: David Cashman


When a variant_get expression returns a Variant, or a nested type containing 
Variant, we just return the sub-slice of the Variant value along with the full 
metadata, even though most of the metadata is probably unnecessary to represent 
the value. We should instead rebuild the value with minimal metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48587) Avoid storage amplification when accessing sub-Variant

2024-06-11 Thread David Cashman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Cashman updated SPARK-48587:
--
Description: When a variant_get expression returns a Variant, or a nested 
type containing Variant, we just return the sub-slice of the Variant value 
along with the full metadata, even though most of the metadata is probably 
unnecessary to represent the value. This may be very inefficient if the value 
is then written to disk (e.g. shuffle file or parquet). We should instead 
rebuild the value with minimal metadata.  (was: When a variant_get expression 
returns a Variant, or a nested type containing Variant, we just return the 
sub-slice of the Variant value along with the full metadata, even though most 
of the metadata is probably unnecessary to represent the value. We should 
instead rebuild the value with minimal metadata.)

> Avoid storage amplification when accessing sub-Variant
> --
>
> Key: SPARK-48587
> URL: https://issues.apache.org/jira/browse/SPARK-48587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Priority: Major
>
> When a variant_get expression returns a Variant, or a nested type containing 
> Variant, we just return the sub-slice of the Variant value along with the 
> full metadata, even though most of the metadata is probably unnecessary to 
> represent the value. This may be very inefficient if the value is then 
> written to disk (e.g. shuffle file or parquet). We should instead rebuild the 
> value with minimal metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48586) Remove lock contention between maintenance and task threads

2024-06-11 Thread Riya Verma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riya Verma updated SPARK-48586:
---
Description: Currently the lock of the *RocksDB* state store is acquired 
when uploading the snapshot inside maintenance tasks when change log 
checkpointing is enabled, which causes lock contention between query processing 
tasks and state maintenance thread. To eliminate the lock contention, lock 
acquisition inside maintenance tasks should be avoided. To prevent race 
conditions between task and maintenance threads, we can ensure that 
*RocksDBFileManager* has a linear history by ensuring a deep copy of 
*RocksDBFileManager* every time a previous version is loaded. The original file 
manager is not affected by future state update. The new file manager is not 
affected by background snapshot uploading tasks that attempt to upload a 
snapshot.  (was: Currently the lock of the RocksDB state store is acquired when 
uploading the snapshot inside maintenance tasks when change log checkpointing 
is enabled, which causes lock contention between query processing tasks and 
state maintenance thread. To eliminate the lock contention, lock acquisition 
inside maintenance tasks should be avoided. To prevent race conditions between 
task and maintenance threads, we can ensure that RocksDBFileManager has a 
linear history by ensuring a deep copy of RocksDBFileManager every time a 
previous version is loaded. The original file manager is not affected by future 
state update. The new file manager is not affected by background snapshot 
uploading tasks that attempt to upload a snapshot.)

> Remove lock contention between maintenance and task threads
> ---
>
> Key: SPARK-48586
> URL: https://issues.apache.org/jira/browse/SPARK-48586
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Priority: Major
>
> Currently the lock of the *RocksDB* state store is acquired when uploading 
> the snapshot inside maintenance tasks when change log checkpointing is 
> enabled, which causes lock contention between query processing tasks and 
> state maintenance thread. To eliminate the lock contention, lock acquisition 
> inside maintenance tasks should be avoided. To prevent race conditions 
> between task and maintenance threads, we can ensure that *RocksDBFileManager* 
> has a linear history by ensuring a deep copy of *RocksDBFileManager* every 
> time a previous version is loaded. The original file manager is not affected 
> by future state update. The new file manager is not affected by background 
> snapshot uploading tasks that attempt to upload a snapshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48586) Remove lock contention between maintenance and task threads

2024-06-11 Thread Riya Verma (Jira)
Riya Verma created SPARK-48586:
--

 Summary: Remove lock contention between maintenance and task 
threads
 Key: SPARK-48586
 URL: https://issues.apache.org/jira/browse/SPARK-48586
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.4.3
Reporter: Riya Verma


Currently the lock of the RocksDB state store is acquired when uploading the 
snapshot inside maintenance tasks when change log checkpointing is enabled, 
which causes lock contention between query processing tasks and state 
maintenance thread. To eliminate the lock contention, lock acquisition inside 
maintenance tasks should be avoided. To prevent race conditions between task 
and maintenance threads, we can ensure that RocksDBFileManager has a linear 
history by ensuring a deep copy of RocksDBFileManager every time a previous 
version is loaded. The original file manager is not affected by future state 
update. The new file manager is not affected by background snapshot uploading 
tasks that attempt to upload a snapshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48576) Rename UTF8_BINARY_LCASE to UTF8_LCASE

2024-06-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48576.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46924
[https://github.com/apache/spark/pull/46924]

> Rename UTF8_BINARY_LCASE to UTF8_LCASE
> --
>
> Key: SPARK-48576
> URL: https://issues.apache.org/jira/browse/SPARK-48576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48556) Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION

2024-06-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48556.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46900
[https://github.com/apache/spark/pull/46900]

> Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION
> ---
>
> Key: SPARK-48556
> URL: https://issues.apache.org/jira/browse/SPARK-48556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Following sequence of queries produces UNSUPPORTED_GROUPING_EXPRESSION error:
> {code:java}
> create table t1(a int, b int) using parquet;
> select grouping(a), dummy from t1 group by a with rollup; {code}
> However, the appropriate error should point the user to the invalid dummy 
> column name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48556) Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION

2024-06-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48556:
---

Assignee: Nikola Mandic

> Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION
> ---
>
> Key: SPARK-48556
> URL: https://issues.apache.org/jira/browse/SPARK-48556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> Following sequence of queries produces UNSUPPORTED_GROUPING_EXPRESSION error:
> {code:java}
> create table t1(a int, b int) using parquet;
> select grouping(a), dummy from t1 group by a with rollup; {code}
> However, the appropriate error should point the user to the invalid dummy 
> column name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47415) Levenshtein (all collations)

2024-06-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47415.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46788
[https://github.com/apache/spark/pull/46788]

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47415) Levenshtein (all collations)

2024-06-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47415:
---

Assignee: Uroš Bojanić

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48286) Analyze 'exists' default expression instead of 'current' default expression in structField to v2 column conversion

2024-06-11 Thread Uros Stankovic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854100#comment-17854100
 ] 

Uros Stankovic commented on SPARK-48286:


[~melin] It should be defined here 
[https://github.com/apache/spark/blob/df4156aa3217cf0f58b4c6cbf33c967bb43f7155/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala#L3603]

Compile would not pass without existence of method.

> Analyze 'exists' default expression instead of 'current' default expression 
> in structField to v2 column conversion
> --
>
> Key: SPARK-48286
> URL: https://issues.apache.org/jira/browse/SPARK-48286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze method 
> accepts 3 parameter
> 1) Field to analyze
> 2) Statement type - String
> 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT
> Method 
> org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column
> pass fieldToAnalyze and EXISTS_DEFAULT as second parameter, so it is not 
> metadata key, instead of that, it is statement type, so bad expression is 
> analyzed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854095#comment-17854095
 ] 

Wenchen Fan commented on SPARK-48562:
-

writing to a temp view is an ill pattern IMO...

> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48286) Analyze 'exists' default expression instead of 'current' default expression in structField to v2 column conversion

2024-06-11 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852984#comment-17852984
 ] 

melin edited comment on SPARK-48286 at 6/11/24 2:51 PM:


defaultValueNotConstantError method not exists

[~cloud_fan] 


was (Author: melin):
defaultValueNotConstantError method not exists

> Analyze 'exists' default expression instead of 'current' default expression 
> in structField to v2 column conversion
> --
>
> Key: SPARK-48286
> URL: https://issues.apache.org/jira/browse/SPARK-48286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze method 
> accepts 3 parameter
> 1) Field to analyze
> 2) Statement type - String
> 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT
> Method 
> org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column
> pass fieldToAnalyze and EXISTS_DEFAULT as second parameter, so it is not 
> metadata key, instead of that, it is statement type, so bad expression is 
> analyzed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48286) Analyze 'exists' default expression instead of 'current' default expression in structField to v2 column conversion

2024-06-11 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852984#comment-17852984
 ] 

melin edited comment on SPARK-48286 at 6/11/24 2:51 PM:


defaultValueNotConstantError method not exists

[~cloud_fan]  [~uros.stankovic] 


was (Author: melin):
defaultValueNotConstantError method not exists

[~cloud_fan] 

> Analyze 'exists' default expression instead of 'current' default expression 
> in structField to v2 column conversion
> --
>
> Key: SPARK-48286
> URL: https://issues.apache.org/jira/browse/SPARK-48286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze method 
> accepts 3 parameter
> 1) Field to analyze
> 2) Statement type - String
> 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT
> Method 
> org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column
> pass fieldToAnalyze and EXISTS_DEFAULT as second parameter, so it is not 
> metadata key, instead of that, it is statement type, so bad expression is 
> analyzed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data

2024-06-11 Thread Weichen Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854053#comment-17854053
 ] 

Weichen Xu commented on SPARK-48463:


I think you don’t need to flatten the original dataframe.

according to StringIndexer code,

```

private def getSelectedCols(dataset: Dataset[_], inputCols: Seq[String]): 
Seq[Column] = {
  inputCols.map { colName =>
    val col = dataset.col(colName)
    ...
  }
}

.setInputCol("location.longitude")

should be able to work on original dataframe with nested column. But if you 
flatten it, the code is broken

 

```

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48571) Reduce the number of accesses to S3 object storage

2024-06-11 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854030#comment-17854030
 ] 

Steve Loughran commented on SPARK-48571:


The hadoop openFile() code came with HADOOP-15229 ; spark master can depend on 
it. I've pretty much given up trying to get patches into spark myself -maybe 
you can have more luck.

> Reduce the number of accesses to S3 object storage
> --
>
> Key: SPARK-48571
> URL: https://issues.apache.org/jira/browse/SPARK-48571
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Oliver Caballero Alvarez
>Priority: Major
> Attachments: Spark 3.2 Hadoop-aws 3.1.PNG, Spark 3.2 Hadoop-aws 
> 3.4.PNG, Spark 3.5 Hadoop-aws 3.1.PNG
>
>
> If we access a Spark table on an object storage file system with parquet 
> files, the object storage suffers many requests that seem to be unnecessary. 
> To explain this I will do it with an example:
> I have created a simple table, with 3 files:
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-0-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
> *business/t_filter/country=ES/data_date_part=2023-06-01/part-0-0f52aae9-2db8-415e-93f3-8331539c0ead.c000*
>     
> *business/t_filter/country=ES/data_date_part=2023-09-27/part-0-f10096c1-53bc-4e2f-bc56-eba65acfa44a.c000*
>     
> and I have put a table that represents business/t_filter with country and 
> data_date_part partitions, you have the following requests.
> If you use versions prior to Spark 3.5 or Hadoop 3.4, in my case it is 
> exactly Spark 3.2 and Hadoop 3.1, the number of requests you have are the 
> following -> IMAGE Spark 3.2 Hadoop 3.1
> In this image we can see all the requests where we can find the following 
> errors:
>  * Two HEAD and two LIST are made with the implementation of S3, of the 
> folders where the files are located, which could only be resolved with a 
> single list. This bug has already been resolved in -> 
> https://issues.apache.org/jira/browse/HADOOP-18073 -> Result : IMAGE 2 Spark 
> 3.2 Hadoop 3.4
>  * For each file, the parquet footing is listed twice. This bug is resolved 
> in -> https://issues.apache.org/jira/browse/SPARK-42388 -> Result : IMAGE 
> Spark 3.5 Hadoop 3.1
>  * A Head Object is launched twice each time a file is read, this could be 
> reduced by implementing the FileSystem interface so that it could receive the 
> FileStatus that has already been calculated above.
>  ** https://issues.apache.org/jira/browse/HADOOP-19199
>  ** https://issues.apache.org/jira/browse/PARQUET-2493
>  ** https://issues.apache.org/jira/browse/HADOOP-19200
>  * The requests could be reduced when reading the parquet footer, since first 
> you have to read the size of the schema and then the schema, which implies 
> two HTTP/HTTPS requests to S3. It would be nice if there was a minimum 
> threshold, for example 100KB, in which, if the file is smaller than that, it 
> would not have to make two requests, and the entire file would be brought, 
> since bringing 100 KB will take less time in one request to bring 8 B in a 
> request and then another request for x KB. Even so, I don't know if this task 
> makes sense.
>  ** It would be to change this implementation, with an environment variable, 
> that if it is set to -1 it does the same, but if it has a threshold set, up 
> to that threshold you do not have to call the seek function twice, which 
> repeats a GET Object.  
> [https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java]
>  
> With all these improvements, updating to the latest version of Spark and 
> Hadoop would go from more than 30 requests to 11 in the proposed example.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48585) Make `JdbcDialect.classifyException` throw out the original exception

2024-06-11 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48585:
---

 Summary: Make `JdbcDialect.classifyException` throw out the 
original exception
 Key: SPARK-48585
 URL: https://issues.apache.org/jira/browse/SPARK-48585
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48584) Perf improvement for unescapePathName

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48584:
---
Labels: pull-request-available  (was: )

> Perf improvement for unescapePathName
> -
>
> Key: SPARK-48584
> URL: https://issues.apache.org/jira/browse/SPARK-48584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48584) Perf improvement for unescapePathName

2024-06-11 Thread Kent Yao (Jira)
Kent Yao created SPARK-48584:


 Summary: Perf improvement for unescapePathName
 Key: SPARK-48584
 URL: https://issues.apache.org/jira/browse/SPARK-48584
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48582) Bump `braces` from 3.0.2 to 3.0.3 in /ui-test

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48582:
--

Assignee: (was: Apache Spark)

> Bump `braces` from 3.0.2 to 3.0.3 in /ui-test
> -
>
> Key: SPARK-48582
> URL: https://issues.apache.org/jira/browse/SPARK-48582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48580:
--

Assignee: (was: Apache Spark)

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48580:
--

Assignee: Apache Spark

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48582) Bump `braces` from 3.0.2 to 3.0.3 in /ui-test

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48582:
--

Assignee: Apache Spark

> Bump `braces` from 3.0.2 to 3.0.3 in /ui-test
> -
>
> Key: SPARK-48582
> URL: https://issues.apache.org/jira/browse/SPARK-48582
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48580:
---
Labels: pull-request-available  (was: )

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Parent: SPARK-33235
Issue Type: Sub-task  (was: Bug)

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48580) Add consistency check and fallback for mapIds in push-merged block meta

2024-06-11 Thread gaoyajun02 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-48580:
---
Summary: Add consistency check and fallback for mapIds in push-merged block 
meta  (was: MergedBlock read by reduce have missing chunks, leading to 
inconsistent shuffle data)

> Add consistency check and fallback for mapIds in push-merged block meta
> ---
>
> Key: SPARK-48580
> URL: https://issues.apache.org/jira/browse/SPARK-48580
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0
>Reporter: gaoyajun02
>Priority: Major
> Attachments: image-2024-06-11-10-19-57-227.png
>
>
> When push-based shuffle enabled, 0.03% of the spark application in our 
> cluster experienced shuffle data loss. The metrics of Exchange as follows:
> !image-2024-06-11-10-19-57-227.png|width=405,height=170!
> We eventually found some WARN logs on the shuffle server:
>  
> {code:java}
> WARN shuffle-server-8-216 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application 
> application_ shuffleId 0 shuffleMergeId 0 reduceId 133 update to 
> index/meta failed{code}
>  
> And analyzed the cause from the code:
> The merge metadata obtained by the reduce side from the driver comes from the 
> {{mapTracker}} in the server's memory, while the actual reading of chunk data 
> is based on the records in the shuffle server's {{{}metaFile{}}}. There is no 
> consistency check between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48551) Perf improvement for escapePathName

2024-06-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48551.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46894
[https://github.com/apache/spark/pull/46894]

> Perf improvement for escapePathName
> ---
>
> Key: SPARK-48551
> URL: https://issues.apache.org/jira/browse/SPARK-48551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48583:
---
Labels: pull-request-available  (was: )

> Replace deprecated `FileUtils#writeStringToFile` 
> -
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> Method `writeStringToFile(final File file, final String data)` in class 
> `FileUtils` is deprecated, use `writeStringToFile(final File file, final 
> String data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`

2024-06-11 Thread Wei Guo (Jira)
Wei Guo created SPARK-48583:
---

 Summary: Replace deprecated `FileUtils#writeStringToFile` 
 Key: SPARK-48583
 URL: https://issues.apache.org/jira/browse/SPARK-48583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Junqing Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853926#comment-17853926
 ] 

Junqing Li edited comment on SPARK-48562 at 6/11/24 7:31 AM:
-

[~cloud_fan] Maybe we need to move 
*{color:#172b4d}ApplyCharTypePaddin{color}{{{}g{}}}* from *Anlyzer* to *Planer* 
to solve this bug.

The *ApplyCharTypePadding* is currently defined in the *Analyzer* layer, it 
conflicts with other ResolveRule rules in the {*}Analyzer{*}, as other 
ResolveRule rules are one-to-one modification rules, while 
*ApplyCharTypePadding* exhibits different behavior.

Therefore, we need to consider refactoring the *ApplyCharTypePadding* rule and 
moving it to the *Planner* layer. This can avoid inconsistent behavior in the 
Analyzer layer without affecting other logic.

 

Correct me if I'm wrong. Or maybe any better idea to solve this problem?

also cc [~yao] 


was (Author: JIRAUSER304040):
[~cloud_fan] Maybe we need to move 
*{color:#172b4d}ApplyCharTypePaddin{color}{{{}g{}}}* from *Anlyzer* to *Planer* 
to solve this bug.

The *ApplyCharTypePadding* is currently defined in the *Analyzer* layer, it 
conflicts with other ResolveRule rules in the {*}Analyzer{*}, as other 
ResolveRule rules are one-to-one modification rules, while 
*ApplyCharTypePadding* exhibits different behavior.

Therefore, we need to consider refactoring the *ApplyCharTypePadding* rule and 
moving it to the *Planner* layer. This can avoid inconsistent behavior in the 
Analyzer layer without affecting other logic.

 

Correct me if I'm wrong. Or maybe any better idea to solve this problem?

> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Junqing Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853926#comment-17853926
 ] 

Junqing Li edited comment on SPARK-48562 at 6/11/24 7:10 AM:
-

[~cloud_fan] Maybe we need to move 
*{color:#172b4d}ApplyCharTypePaddin{color}{{{}g{}}}* from *Anlyzer* to *Planer* 
to solve this bug.

The *ApplyCharTypePadding* is currently defined in the *Analyzer* layer, it 
conflicts with other ResolveRule rules in the {*}Analyzer{*}, as other 
ResolveRule rules are one-to-one modification rules, while 
*ApplyCharTypePadding* exhibits different behavior.

Therefore, we need to consider refactoring the *ApplyCharTypePadding* rule and 
moving it to the *Planner* layer. This can avoid inconsistent behavior in the 
Analyzer layer without affecting other logic.

 

Correct me if I'm wrong. Or maybe any better idea to solve this problem?


was (Author: JIRAUSER304040):
[~cloud_fan] Maybe we need to move 
*{color:#172b4d}ApplyCharTypePaddin{color}{{{}g{}}}*{{ from *Anlyzer* to 
{*}Planer{*}.}}

The *ApplyCharTypePadding* is currently defined in the *Analyzer* layer, it 
conflicts with other ResolveRule rules in the {*}Analyzer{*}, as other 
ResolveRule rules are one-to-one modification rules, while 
*ApplyCharTypePadding* exhibits different behavior.

Therefore, we need to consider refactoring the *ApplyCharTypePadding* rule and 
moving it to the *Planner* layer. This can avoid inconsistent behavior in the 
Analyzer layer without affecting other logic.

 

Correct me if I'm wrong. Or maybe any better idea to solve this problem?

> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Junqing Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853926#comment-17853926
 ] 

Junqing Li commented on SPARK-48562:


[~cloud_fan] Maybe we need to move 
*{color:#172b4d}ApplyCharTypePaddin{color}{{{}g{}}}*{{ from *Anlyzer* to 
{*}Planer{*}.}}

The *ApplyCharTypePadding* is currently defined in the *Analyzer* layer, it 
conflicts with other ResolveRule rules in the {*}Analyzer{*}, as other 
ResolveRule rules are one-to-one modification rules, while 
*ApplyCharTypePadding* exhibits different behavior.

Therefore, we need to consider refactoring the *ApplyCharTypePadding* rule and 
moving it to the *Planner* layer. This can avoid inconsistent behavior in the 
Analyzer layer without affecting other logic.

 

Correct me if I'm wrong. Or maybe any better idea to solve this problem?

> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48562) Writing to JDBC Temporary View Failed

2024-06-11 Thread Junqing Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junqing Li updated SPARK-48562:
---
Description: 
When creating a JDBC temporary view, *ApplyCharTypePadding* would add a Project 
before LogicalRelation if CHAR/VARCHAR column exists and Spark would save it as 
a view plan. Then if we try to write this view, Spark would put this view plan 
to *InsertintoStatement* in *ResolveRelations* which would fall 
{*}PrewriteCheck{*}.

Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
{code:java}
test("test writing temporary jdbc view") {
    withConnection { conn =>
      conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
CHAR)""").executeUpdate()
    }
    sql(
      s"""
        CREATE TEMPORARY TABLE jdbcTable
        USING jdbc
        OPTIONS (
          url='$url',
          dbtable='"test"."to_drop"');""")
    sql("INSERT INTO jdbcTable values(1),(2)")
    sql("select * from test.to_drop").show()
    withConnection { conn =>
      conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
    }
  } {code}
 

Then we would get the following error.
{code:java}
[UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based table 
is not allowed. SQLSTATE: 42809;
'InsertIntoStatement Project [staticinvoke(class 
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
+- LocalRelation [col1#3] {code}

  was:
When creating a JDBC temporary view, `ApplyCharTypePadding` would add a Project 
before LogicalRelation if CHAR/VARCHAR column exists and Spark would save it as 
a view plan. Then if we try to write this view, Spark would put this view plan 
to `InsertintoStatement` in `ResolveRelations` which would fall `PrewriteCheck`.

Adding the following code to `JDBCTableCatalogSuite` would meet this problem.
{code:java}
test("test writing temporary jdbc view") {
    withConnection { conn =>
      conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
CHAR)""").executeUpdate()
    }
    sql(
      s"""
        CREATE TEMPORARY TABLE jdbcTable
        USING jdbc
        OPTIONS (
          url='$url',
          dbtable='"test"."to_drop"');""")
    sql("INSERT INTO jdbcTable values(1),(2)")
    sql("select * from test.to_drop").show()
    withConnection { conn =>
      conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
    }
  } {code}


> Writing to JDBC Temporary View Failed
> -
>
> Key: SPARK-48562
> URL: https://issues.apache.org/jira/browse/SPARK-48562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: Junqing Li
>Priority: Major
>
> When creating a JDBC temporary view, *ApplyCharTypePadding* would add a 
> Project before LogicalRelation if CHAR/VARCHAR column exists and Spark would 
> save it as a view plan. Then if we try to write this view, Spark would put 
> this view plan to *InsertintoStatement* in *ResolveRelations* which would 
> fall {*}PrewriteCheck{*}.
> Adding the following code to *JDBCTableCatalogSuite* would meet this problem.
> {code:java}
> test("test writing temporary jdbc view") {
>     withConnection { conn =>
>       conn.prepareStatement("""CREATE TABLE "test"."to_drop" (id 
> CHAR)""").executeUpdate()
>     }
>     sql(
>       s"""
>         CREATE TEMPORARY TABLE jdbcTable
>         USING jdbc
>         OPTIONS (
>           url='$url',
>           dbtable='"test"."to_drop"');""")
>     sql("INSERT INTO jdbcTable values(1),(2)")
>     sql("select * from test.to_drop").show()
>     withConnection { conn =>
>       conn.prepareStatement("""DROP TABLE "test"."to_drop).executeUpdate()
>     }
>   } {code}
>  
> Then we would get the following error.
> {code:java}
> [UNSUPPORTED_INSERT.RDD_BASED] Can't insert into the target. An RDD-based 
> table is not allowed. SQLSTATE: 42809;
> 'InsertIntoStatement Project [staticinvoke(class 
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, 
> readSidePadding, ID#0, 1, true, false, true) AS ID#1], false, false, false
> +- LocalRelation [col1#3] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org