[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-10-02 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886449#comment-17886449
 ] 

Bruce Robbins commented on SPARK-47193:
---

PR https://github.com/apache/spark/pull/48325


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val timeZoneLookup = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
> val userLocation = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
> val user = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage]
> val location = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage]
> val result = user
>   .join(userProfile, user("UserId") === userProfile("UserId"), "inner")
>   .join(language, userProfile("LanguageId") === language("LanguageId"), 
> "left")
>   .join(userLocation, user("UserId") === userLocation("UserId"), 

[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-09-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879050#comment-17879050
 ] 

Bruce Robbins edited comment on SPARK-47193 at 9/13/24 5:13 PM:


[~dongjoon] 

I started to work on it, but it's not done.

By the way, as far as I know, this is not a regression. I am pretty sure it's 
always been this way.

If someone else wants to work on it to get a fix into 3.5.3, that's fine with 
me. Otherwise, I will keep banging at it.

Here's what I've done so far:

I came up with an approach, which is mostly coded but still needs some 
finishing touches, to ensure that the SQL config is propagated to the executor. 
This approach also handles the case where the config changes after the RDD was 
obtained from Dataset#rdd. However, it includes changes to core, which I would 
prefer not to touch:

[bigger 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:action_wrapper?expand=1]

Then I came up with a smaller solution, which also ensures that the SQL config 
is propagated to the executor, but it doesn't handle the case where the config 
changes after the RDD was obtained from Dataset#rdd (i.e., it uses a snapshot 
of the config at the time Dataset#rdd is called).

[smaller 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:iterator_wrapper?expand=1]

Edit: Also, as it turns out, neither of my solutions seem to solve the 
reporter's specific example, only my "distilled" example. I need to track down 
why that is.



was (Author: bersprockets):
[~dongjoon] 

I started to work on it, but it's not done.

By the way, as far as I know, this is not a regression. I am pretty sure it's 
always been this way.

If someone else wants to work on it to get a fix into 3.5.3, that's fine with 
me. Otherwise, I will keep banging at it.

Here's what I've done so far:

I came up with an approach, which is mostly coded but still needs some 
finishing touches, to ensure that the SQL config is propagated to the executor. 
This approach also handles the case where the config changes after the RDD was 
obtained from Dataset#rdd. However, it includes changes to core, which I would 
prefer not to touch:

[bigger 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:action_wrapper?expand=1]

Then I came up with a smaller solution, which also ensures that the SQL config 
is propagated to the executor, but it doesn't handle the case where the config 
changes after the RDD was obtained from Dataset#rdd (i.e., it uses a snapshot 
of the config at the time Dataset#rdd is called).

[smaller 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:iterator_wrapper?expand=1]


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus:

[jira] [Commented] (SPARK-49529) Incorrect results from from_utc_timestamp function

2024-09-06 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879956#comment-17879956
 ] 

Bruce Robbins commented on SPARK-49529:
---

This actually matches Java 17's behavior.

Try this code in the REPL:
{noformat}
import java.time._
import java.time.temporal.ChronoUnit._

val utcZid = ZoneId.of("UTC")
val istZid = ZoneId.of("Asia/Calcutta")

val utcLdt = ZonedDateTime.of(LocalDateTime.of(1, 1, 1, 0, 0, 0, 0), utcZid)

val istLdt = utcLdt.withZoneSameInstant(istZid)
println(istLdt)
{noformat}
The code prints the following:
{noformat}
0001-01-01T05:53:28+05:53:28[Asia/Calcutta]
{noformat}
Note that Java thinks the timezone offset is +05:53:28 for that period of time 
(maybe because 0001-01-01 was before some official codification of the 
timezones in India?).


> Incorrect results from from_utc_timestamp function
> --
>
> Key: SPARK-49529
> URL: https://issues.apache.org/jira/browse/SPARK-49529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1, 4.0.0, 3.5.2, 3.4.4, 3.5.3
>Reporter: Ankit Prakash Gupta
>Priority: Major
>
> The values returned as output from_utc_timestamp are erratic and are not 
> consistent in case of values before year 1850
>  
> {code:java}
> ❯ JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-17.jdk/Contents/Home/ 
> bin/spark-shell --master local --conf spark.sql.session.timeZone=UTC
> WARNING: Using incubator modules: jdk.incubator.vector
> Using Spark's default log4j profile: 
> org/apache/spark/log4j2-defaults.properties
> {"ts":"2024-09-06T02:42:45.333Z","level":"WARN","msg":"Your hostname, 
> RINMAC2772, resolves to a loopback address: 127.0.0.1; using 192.168.28.3 
> instead (on interface 
> en0)","context":{"host":"RINMAC2772","host_port":"127.0.0.1","host_port2":"192.168.28.3","network_if":"en0"},"logger":"Utils"}
> {"ts":"2024-09-06T02:42:45.336Z","level":"WARN","msg":"Set SPARK_LOCAL_IP if 
> you need to bind to another address","logger":"Utils"}
> {"ts":"2024-09-06T02:42:45.536Z","level":"INFO","msg":"Registering signal 
> handler for INT","logger":"SignalUtils"}
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-preview1
>       /_/Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 17.0.11)
> Type in expressions to have them evaluated.
> Type :help for more information.
> {"ts":"2024-09-06T02:42:48.551Z","level":"INFO","msg":"Found configuration 
> file null","logger":"HiveConf"}
> {"ts":"2024-09-06T02:42:48.597Z","level":"INFO","msg":"Running Spark version 
> 4.0.0-preview1","logger":"SparkContext"}
> {"ts":"2024-09-06T02:42:48.598Z","level":"INFO","msg":"OS info Mac OS X, 
> 14.5, aarch64","logger":"SparkContext"}
> {"ts":"2024-09-06T02:42:48.598Z","level":"INFO","msg":"Java version 
> 17.0.11","logger":"SparkContext"}
> {"ts":"2024-09-06T02:42:48.647Z","level":"WARN","msg":"Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable","logger":"NativeCodeLoader"}
> {"ts":"2024-09-06T02:42:48.683Z","level":"INFO","msg":"==","logger":"ResourceUtils"}
> {"ts":"2024-09-06T02:42:48.683Z","level":"INFO","msg":"No custom resources 
> configured for spark.driver.","logger":"ResourceUtils"}
> {"ts":"2024-09-06T02:42:48.683Z","level":"INFO","msg":"==","logger":"ResourceUtils"}
> {"ts":"2024-09-06T02:42:48.684Z","level":"INFO","msg":"Submitted application: 
> Spark shell","logger":"SparkContext"}
> {"ts":"2024-09-06T02:42:48.697Z","level":"INFO","msg":"Default 
> ResourceProfile created, executor resources: Map(cores -> name: cores, 
> amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: 
> , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task 
> resources: Map(cpus -> name: cpus, amount: 1.0)","logger":"ResourceProfile"}
> {"ts":"2024-09-06T02:42:48.698Z","level":"INFO","msg":"Limiting resource is 
> cpu","logger":"ResourceProfile"}
> {"ts":"2024-09-06T02:42:48.698Z","level":"INFO","msg":"Added ResourceProfile 
> id: 0","logger":"ResourceProfileManager"}
> {"ts":"2024-09-06T02:42:48.724Z","level":"INFO","msg":"Changing view acls to: 
> ankit.gupta","logger":"SecurityManager"}
> {"ts":"2024-09-06T02:42:48.724Z","level":"INFO","msg":"Changing modify acls 
> to: ankit.gupta","logger":"SecurityManager"}
> {"ts":"2024-09-06T02:42:48.725Z","level":"INFO","msg":"Changing view acls 
> groups to: ","logger":"SecurityManager"}
> {"ts":"2024-09-06T02:42:48.725Z","level":"INFO","msg":"Changing modify acls 
> groups to: ","logger":"SecurityManager"}
> {"ts":"2024-09-06T02:42:48.727Z","le

[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

2024-09-04 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879306#comment-17879306
 ] 

Bruce Robbins commented on SPARK-48950:
---

By the way, there was a vectorization-related correctness issue in 3.5.0 and 
3.5.1:
{noformat}
spark-sql (default)> select version();
3.5.1 fd86f85e181fc2dc0f50a096855acf83a6cc5d9c
Time taken: 0.043 seconds, Fetched 1 row(s)
spark-sql (default)> drop table if exists t1;
Time taken: 0.127 seconds
spark-sql (default)> create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, null, 2)))
as (value);
Time taken: 0.197 seconds
spark-sql (default)> select cast(value as 
struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[1,0,2]}
Time taken: 0.112 seconds, Fetched 1 row(s)
spark-sql (default)> 
{noformat}
The 0 in the 2nd slot of field f2 is wrong (should be null).

I believe this was fixed by SPARK-48019 for 3.5.2:

As far as I know,  this correctness bug affected only nested values.

I see that the reporter reverted SPARK-42388 and the problem seemed to 
disappear, but maybe that just changed timing or memory layout such that the 
bug was less noticeable? Either that or the reporter's issue really is tied to 
SPARK-42388.

Hopefully I have not muddied the waters.

> Corrupt data from parquet scans
> ---
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>Reporter: Thomas Newton
>Priority: Major
>  Labels: correctness
> Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-09-03 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879050#comment-17879050
 ] 

Bruce Robbins commented on SPARK-47193:
---

[~dongjoon] 

I started to work on it, but it's not done.

By the way, as far as I know, this is not a regression. I am pretty sure it's 
always been this way.

If someone else wants to work on it to get a fix into 3.5.3, that's fine with 
me. Otherwise, I will keep banging at it.

Here's what I've done so far:

I came up with an approach, which is mostly coded but still needs some 
finishing touches, to ensure that the SQL config is propagated to the executor. 
This approach also handles the case where the config changes after the RDD was 
obtained from Dataset#rdd. However, it includes changes to core, which I would 
prefer not to touch:

[bigger 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:action_wrapper?expand=1]

Then I came up with a smaller solution, which also ensures that the SQL config 
is propagated to the executor, but it doesn't handle the case where the config 
changes after the RDD was obtained from Dataset#rdd (i.e., it uses a snapshot 
of the config at the time Dataset#rdd is called).

[smaller 
change|https://github.com/apache/spark/compare/master...bersprockets:spark:iterator_wrapper?expand=1]


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv(

[jira] [Commented] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

2024-09-02 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878658#comment-17878658
 ] 

Bruce Robbins commented on SPARK-48965:
---

[~LDVSoft] are you looking to fix this? If not, I could submit a fix.

> toJSON produces wrong values if DecimalType information is lost in as[Product]
> --
>
> Key: SPARK-48965
> URL: https://issues.apache.org/jira/browse/SPARK-48965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.5.1
>Reporter: Dmitry Lapshin
>Priority: Major
>  Labels: correctness
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
>   case class Example(x: BigDecimal)
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .master("local[1]")
>   .getOrCreate()
> import spark.implicits._
> val originalRaw = BigDecimal("123.456")
> val original = Example(originalRaw)
> val ds1 = spark.createDataset(Seq(original))
> val ds2 = ds1
>   .withColumn("x", $"x" cast DecimalType(12, 6))
> val ds3 = ds2
>   .as[Example]
> println(s"DS1: schema=${ds1.schema}, 
> encoder.schema=${ds1.encoder.schema}")
> println(s"DS2: schema=${ds1.schema}, 
> encoder.schema=${ds2.encoder.schema}")
> println(s"DS3: schema=${ds1.schema}, 
> encoder.schema=${ds3.encoder.schema}")
> val json1 = ds1.toJSON.collect().head
> val json2 = ds2.toJSON.collect().head
> val json3 = ds3.toJSON.collect().head
> val collect1 = ds1.collect().head
> val collect2_ = ds2.collect().head
> val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
> val collect3 = ds3.collect().head
> println(s"Original: $original (scale = ${original.x.scale}, precision = 
> ${original.x.precision})")
> println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = 
> ${collect1.x.precision})")
> println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision = 
> ${collect2.precision})")
> println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision = 
> ${collect3.x.precision})")
> println(s"json1: $json1")
> println(s"json2: $json2")
> println(s"json3: $json3")
>   }
> }
> {code}
> Running it you'd see that json3 contains very much wrong data. After a bit of 
> debugging, and sorry since I'm bad with Spark internals, I've found that:
>  * In-memory representation of the data in this example used {{UnsafeRow}}, 
> whose {{.getDecimal}} uses compression to store small Decimal values as 
> longs, but doesn't remember decimal sizing parameters,
>  * However, there are at least two sources for precision & scale to pass to 
> that method: {{Dataset.schema}} (which is based on query execution, always 
> contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in 
> `ds2` to 12,6 but then is reset in `ds3`). Also, there is a 
> {{Dataset.deserializer}} that seems to be combining those two non-trivially.
>  * This doesn't seem to affect {{Dataset.collect()}} methods since they use 
> {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema.
> Seems to me that either {{.toJSON}} should be more aware of what's going on 
> or {{.as[]}} should be doing something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45745) Extremely slow execution of sum of columns in Spark 3.4.1

2024-08-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45745.
---
Resolution: Duplicate

> Extremely slow execution of sum of columns in Spark 3.4.1
> -
>
> Key: SPARK-45745
> URL: https://issues.apache.org/jira/browse/SPARK-45745
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.1
>Reporter: Javier
>Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to 
> Spark 3.4.1 and some code that was running fine is now basically never ending 
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark 
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
> "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the 
> computation time seems to explode when the value of `n_cols` is bigger than 
> about 25 columns. A colleague suggested that it could be related to the limit 
> of 22 elements in a tuple in Scala 2.13 
> (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 
> columns are suspiciously close to this. Is there any known defect in the 
> logical plan optimization in 3.4.1? Or is this kind of operations (sum of 
> multiple columns) supposed to be implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40706) IllegalStateException when querying array values inside a nested struct

2024-08-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-40706.
---
Resolution: Duplicate

> IllegalStateException when querying array values inside a nested struct
> ---
>
> Key: SPARK-40706
> URL: https://issues.apache.org/jira/browse/SPARK-40706
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Rohan Barman
>Priority: Major
>
> We are in the process of migrating our PySpark applications from Spark 
> version 3.1.2 to Spark version 3.2.0. 
> This bug is present in version 3.2.0. We do not see this issue in version 
> 3.1.2.
>  
> *Minimal example to reproduce bug*
> Below is a minimal example that generates hardcoded data and queries. The 
> data has several nested structs and arrays.
> Our real use case reads data from avro files and has more complex queries, 
> but this is sufficient to reproduce the error.
>  
> {code:java}
> # Generate data
> data = [
>   ('1',{
>   'timestamp': '09/07/2022',
>   'message': 'm1',
>   'data':{
> 'items': {
>   'id':1,
>   'attempt':[
> {'risk':[
>   {'score':[1,2,3]},
>   {'indicator':[
> {'code':'c1','value':'abc'},
> {'code':'c2','value':'def'}
>   ]}
> ]}
>   ]
> }
>   }
>   })
> ]
> from pyspark.sql.types import *
> schema = StructType([
> StructField('id', StringType(), True),
> StructField('response', StructType([
>   StructField('timestamp', StringType(), True),
>   StructField('message',StringType(), True),
>   StructField('data', StructType([
> StructField('items', StructType([
>   StructField('id', StringType(), True),
>   StructField("attempt", ArrayType(StructType([
> StructField("risk", ArrayType(StructType([
>   StructField('score', ArrayType(StringType()), True),
>   StructField('indicator', ArrayType(StructType([
> StructField('code', StringType(), True),
> StructField('value', StringType(), True),
>   ])))
>  ])))
>])))
> ]))
>   ]))
> ])),
>  ])
> df = spark.createDataFrame(data=data, schema=schema)
> df.printSchema()
> df.createOrReplaceTempView("tbl")
> # Execute query
> query = """
> SELECT 
> response.message as message,
> response.timestamp as timestamp,
> score as risk_score,
> model.value as model_type
> FROM tbl
>   LATERAL VIEW OUTER explode(response.data.items.attempt) 
> AS Attempt
>   LATERAL VIEW OUTER explode(response.data.items.attempt.risk)
> AS RiskModels
>   LATERAL VIEW OUTER explode(RiskModels)  
> AS RiskModel
>   LATERAL VIEW OUTER explode(RiskModel.indicator) 
> AS Model
>   LATERAL VIEW OUTER explode(RiskModel.Score) 
> AS Score
> """
> result = spark.sql(query)
> print(result.count())
> print(result.head(10)) {code}
>  
> *Post execution*
> The above code thows an IllegalStateException. The entire error log is posted 
> at the end of this ticket.
> {code:java}
> java.lang.IllegalStateException: Couldn't find _extract_timestamp#44 in 
> [_extract_message#50,RiskModel#12]{code}
>  
> The error seems to indicate that the _timestamp_ column is not available. 
> However we see _timestamp_ if we print the schema of the source dataframe.
> {code:java}
> # df.printSchema()
> root
>  |-- id: string (nullable = true)
>  |-- response: struct (nullable = true)
>  |    |-- timestamp: string (nullable = true)
>  |    |-- message: string (nullable = true)
>  |    |-- data: struct (nullable = true)
>  |    |    |-- items: struct (nullable = true)
>  |    |    |    |-- id: string (nullable = true)
>  |    |    |    |-- attempt: array (nullable = true)
>  |    |    |    |    |-- element: struct (containsNull = true)
>  |    |    |    |    |    |-- risk: array (nullable = true)
>  |    |    |    |    |    |    |-- element: struct (containsNull = true)
>  |    |    |    |    |    |    |    |-- score: array (nullable = true)
>  |    |    |    |    |    |    |    |    |-- element: string (containsNull = 
> true)
>  |    |    |    |    |    |    |    |-- indicator: array (nullable = true)
>  |    |    |    |    |    |    |    |    |-- element: struct (containsNull = 
> true)
>  |    |    |    |    |    |    |    |    |    |-- code: string (nullable = 
> true)
>  |    |    |    |    |    |    |    |    |    |-- value: string (nullable = 
> true) {code}
> *Extra observ

[jira] [Commented] (SPARK-45745) Extremely slow execution of sum of columns in Spark 3.4.1

2024-08-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877165#comment-17877165
 ] 

Bruce Robbins commented on SPARK-45745:
---

I will close as a duplicate of SPARK-45071 later today. You can reopen it if 
you still see the issue.

> Extremely slow execution of sum of columns in Spark 3.4.1
> -
>
> Key: SPARK-45745
> URL: https://issues.apache.org/jira/browse/SPARK-45745
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.1
>Reporter: Javier
>Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to 
> Spark 3.4.1 and some code that was running fine is now basically never ending 
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark 
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
> "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the 
> computation time seems to explode when the value of `n_cols` is bigger than 
> about 25 columns. A colleague suggested that it could be related to the limit 
> of 22 elements in a tuple in Scala 2.13 
> (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 
> columns are suspiciously close to this. Is there any known defect in the 
> logical plan optimization in 3.4.1? Or is this kind of operations (sum of 
> multiple columns) supposed to be implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49261) Correlation between lit and round during grouping

2024-08-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-49261:
--
Target Version/s:   (was: 3.5.0)

> Correlation between lit and round during grouping
> -
>
> Key: SPARK-49261
> URL: https://issues.apache.org/jira/browse/SPARK-49261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks DBR 14.3
> Spark 3.5.0
> Scala 2.12
>Reporter: Krystian Kulig
>Priority: Major
>  Labels: pull-request-available
>
> Running following code:
>  
> {code:java}
> import pyspark.sql.functions as F
> from decimal import Decimal
> data = [
>   (1, 100, Decimal("1.1"),  "L", True),
>   (2, 200, Decimal("1.2"),  "H", False),
>   (2, 300, Decimal("2.345"), "E", False),
> ]
> columns = ["group_a", "id", "amount", "selector_a", "selector_b"]
> df = spark.createDataFrame(data, schema=columns)
> df_final = (
>   df.select(
>     F.lit(6).alias("run_number"),
>     F.lit("AA").alias("run_type"),
>     F.col("group_a"),
>     F.col("id"),
>     F.col("amount"),
>     F.col("selector_a"),
>     F.col("selector_b"),
>   )
>   .withColumn(
>     "amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("amount"),
>     ).otherwise(F.lit(None))
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("id")
>     ).otherwise(F.lit(None))
>   )
> )
> group_by_cols = [
>   "run_number",
>   "group_a",
>   "run_type"
> ]
> df_final = df_final.groupBy(group_by_cols).agg(
>   F.countDistinct("id").alias("count_of_amount"),
>   F.round(F.sum("amount")/ 1000, 1).alias("total_amount"),
>   F.sum("amount_c").alias("amount_c"),
>   F.countDistinct("count_of_amount_c").alias(
>     "count_of_amount_c"
>   ),
> )
> df_final = (
>   df_final
>   .withColumn(
>     "total_amount",
>     F.round(F.col("total_amount") / 1000, 6),
>   )
>   .withColumn(
>     "count_of_amount", F.col("count_of_amount").cast("int")
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       F.col("amount_c").isNull(), F.lit(None).cast("int")
>     ).otherwise(F.col("count_of_amount_c").cast("int")),
>   )
> )
> df_final = df_final.select(
>   F.col("total_amount"),
>   "run_number",
>   "group_a",
>   "run_type",
>   "count_of_amount",
>   "amount_c",
>   "count_of_amount_c",
> )
> df_final.show() {code}
> Produces error:
> {code:java}
> [[INTERNAL_ERROR](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#internal_error)]
>  Couldn't find total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 {code}
> With stack trace:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
> total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:97) at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:101) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:83)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:481)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:449) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:97)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:286) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:279) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:97)
>  at 
> org.apache.spark.sql.execution

[jira] [Updated] (SPARK-49261) Correlation between lit and round during grouping

2024-08-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-49261:
--
Fix Version/s: (was: 3.5.0)

> Correlation between lit and round during grouping
> -
>
> Key: SPARK-49261
> URL: https://issues.apache.org/jira/browse/SPARK-49261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks DBR 14.3
> Spark 3.5.0
> Scala 2.12
>Reporter: Krystian Kulig
>Priority: Major
>  Labels: pull-request-available
>
> Running following code:
>  
> {code:java}
> import pyspark.sql.functions as F
> from decimal import Decimal
> data = [
>   (1, 100, Decimal("1.1"),  "L", True),
>   (2, 200, Decimal("1.2"),  "H", False),
>   (2, 300, Decimal("2.345"), "E", False),
> ]
> columns = ["group_a", "id", "amount", "selector_a", "selector_b"]
> df = spark.createDataFrame(data, schema=columns)
> df_final = (
>   df.select(
>     F.lit(6).alias("run_number"),
>     F.lit("AA").alias("run_type"),
>     F.col("group_a"),
>     F.col("id"),
>     F.col("amount"),
>     F.col("selector_a"),
>     F.col("selector_b"),
>   )
>   .withColumn(
>     "amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("amount"),
>     ).otherwise(F.lit(None))
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("id")
>     ).otherwise(F.lit(None))
>   )
> )
> group_by_cols = [
>   "run_number",
>   "group_a",
>   "run_type"
> ]
> df_final = df_final.groupBy(group_by_cols).agg(
>   F.countDistinct("id").alias("count_of_amount"),
>   F.round(F.sum("amount")/ 1000, 1).alias("total_amount"),
>   F.sum("amount_c").alias("amount_c"),
>   F.countDistinct("count_of_amount_c").alias(
>     "count_of_amount_c"
>   ),
> )
> df_final = (
>   df_final
>   .withColumn(
>     "total_amount",
>     F.round(F.col("total_amount") / 1000, 6),
>   )
>   .withColumn(
>     "count_of_amount", F.col("count_of_amount").cast("int")
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       F.col("amount_c").isNull(), F.lit(None).cast("int")
>     ).otherwise(F.col("count_of_amount_c").cast("int")),
>   )
> )
> df_final = df_final.select(
>   F.col("total_amount"),
>   "run_number",
>   "group_a",
>   "run_type",
>   "count_of_amount",
>   "amount_c",
>   "count_of_amount_c",
> )
> df_final.show() {code}
> Produces error:
> {code:java}
> [[INTERNAL_ERROR](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#internal_error)]
>  Couldn't find total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 {code}
> With stack trace:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
> total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:97) at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:101) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:83)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:481)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:449) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:97)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:286) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:279) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:97)
>  at 
> org.apache.spark.sql.execution.

[jira] [Resolved] (SPARK-49350) FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated result

2024-08-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-49350.
---
Resolution: Duplicate

> FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated 
> result
> --
>
> Key: SPARK-49350
> URL: https://issues.apache.org/jira/browse/SPARK-49350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Blocker
>
> {code:java}
> SELECT  cast(-1 AS BIGINT) AS ele1
> FROM(
>SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele=-1 {code}
> This query returns an empty result. However, the following query returns 1.  
> This result seems wrong.
> {code:java}
> SELECT  count(DISTINCT ele1)
> FROM(
> SELECT  cast(-1 as bigint) as ele1
> FROM(
> SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>
> ) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele = -1
> ) {code}
> By plan change log, I find that it is FoldablePropagation rule and 
> ConstantFolding rule that optimize Aggregate expression to `Aggregat 
> [[cast(count(distinct -1) as string) AS count(DISTINCT ele)#7|#7]] ]`.
>  
> Is this result right?  Does it need to be fixed? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49350) FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated result

2024-08-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877052#comment-17877052
 ] 

Bruce Robbins commented on SPARK-49350:
---

[~Wayne Guo] Thanks for the update. Closing as a dup then.

> FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated 
> result
> --
>
> Key: SPARK-49350
> URL: https://issues.apache.org/jira/browse/SPARK-49350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Blocker
>
> {code:java}
> SELECT  cast(-1 AS BIGINT) AS ele1
> FROM(
>SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele=-1 {code}
> This query returns an empty result. However, the following query returns 1.  
> This result seems wrong.
> {code:java}
> SELECT  count(DISTINCT ele1)
> FROM(
> SELECT  cast(-1 as bigint) as ele1
> FROM(
> SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>
> ) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele = -1
> ) {code}
> By plan change log, I find that it is FoldablePropagation rule and 
> ConstantFolding rule that optimize Aggregate expression to `Aggregat 
> [[cast(count(distinct -1) as string) AS count(DISTINCT ele)#7|#7]] ]`.
>  
> Is this result right?  Does it need to be fixed? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49350) FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated result

2024-08-23 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876366#comment-17876366
 ] 

Bruce Robbins commented on SPARK-49350:
---

Possibly the same as SPARK-49000?

> FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated 
> result
> --
>
> Key: SPARK-49350
> URL: https://issues.apache.org/jira/browse/SPARK-49350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Blocker
>
> {code:java}
> SELECT  cast(-1 AS BIGINT) AS ele1
> FROM(
>SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele=-1 {code}
> This query returns an empty result. However, the following query returns 1.  
> This result seems wrong.
> {code:java}
> SELECT  count(DISTINCT ele1)
> FROM(
> SELECT  cast(-1 as bigint) as ele1
> FROM(
> SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>
> ) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele = -1
> ) {code}
> By plan change log, I find that it is FoldablePropagation rule and 
> ConstantFolding rule that optimize Aggregate expression to `Aggregat 
> [[cast(count(distinct -1) as string) AS count(DISTINCT ele)#7|#7]] ]`.
>  
> Is this result right?  Does it need to be fixed? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49261) Correlation between lit and round during grouping

2024-08-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875657#comment-17875657
 ] 

Bruce Robbins commented on SPARK-49261:
---

{quote}It seems to be a correlation between F.lit(6).alias("run_number") and 
F.round(F.col("total_amount") / 1000, 6). If both lit and scale in round are 
set to the same number i.e. 6 code fails.
{quote}
That's a good summary of the issue. The bug seems to be 
[here|https://github.com/apache/spark/blob/a885365897acefcf353206aaabd0048e088cc9a7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala#L409].
 That code will replace foldable and non-foldable expressions with expressions 
from the group by attributes, but I think it should only replace non-foldable 
expressions.

In the case of the round function, that code is patching the second parameter, 
which requires a foldable expression, with a non-foldable expression. As a 
result, {{RoundBase#checkInputDataTypes}} fails.

> Correlation between lit and round during grouping
> -
>
> Key: SPARK-49261
> URL: https://issues.apache.org/jira/browse/SPARK-49261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks DBR 14.3
> Spark 3.5.0
> Scala 2.12
>Reporter: Krystian Kulig
>Priority: Major
> Fix For: 3.5.0
>
>
> Running following code:
>  
> {code:java}
> import pyspark.sql.functions as F
> from decimal import Decimal
> data = [
>   (1, 100, Decimal("1.1"),  "L", True),
>   (2, 200, Decimal("1.2"),  "H", False),
>   (2, 300, Decimal("2.345"), "E", False),
> ]
> columns = ["group_a", "id", "amount", "selector_a", "selector_b"]
> df = spark.createDataFrame(data, schema=columns)
> df_final = (
>   df.select(
>     F.lit(6).alias("run_number"),
>     F.lit("AA").alias("run_type"),
>     F.col("group_a"),
>     F.col("id"),
>     F.col("amount"),
>     F.col("selector_a"),
>     F.col("selector_b"),
>   )
>   .withColumn(
>     "amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("amount"),
>     ).otherwise(F.lit(None))
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       (F.col("selector_b") == False)
>       & (F.col("selector_a").isin(["L", "H", "E"])),
>       F.col("id")
>     ).otherwise(F.lit(None))
>   )
> )
> group_by_cols = [
>   "run_number",
>   "group_a",
>   "run_type"
> ]
> df_final = df_final.groupBy(group_by_cols).agg(
>   F.countDistinct("id").alias("count_of_amount"),
>   F.round(F.sum("amount")/ 1000, 1).alias("total_amount"),
>   F.sum("amount_c").alias("amount_c"),
>   F.countDistinct("count_of_amount_c").alias(
>     "count_of_amount_c"
>   ),
> )
> df_final = (
>   df_final
>   .withColumn(
>     "total_amount",
>     F.round(F.col("total_amount") / 1000, 6),
>   )
>   .withColumn(
>     "count_of_amount", F.col("count_of_amount").cast("int")
>   )
>   .withColumn(
>     "count_of_amount_c",
>     F.when(
>       F.col("amount_c").isNull(), F.lit(None).cast("int")
>     ).otherwise(F.col("count_of_amount_c").cast("int")),
>   )
> )
> df_final = df_final.select(
>   F.col("total_amount"),
>   "run_number",
>   "group_a",
>   "run_type",
>   "count_of_amount",
>   "amount_c",
>   "count_of_amount_c",
> )
> df_final.show() {code}
> Produces error:
> {code:java}
> [[INTERNAL_ERROR](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#internal_error)]
>  Couldn't find total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 {code}
> With stack trace:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find 
> total_amount#1046 in 
> [group_a#984L,count_of_amount#1054,amount_c#1033,count_of_amount_c#1034L] 
> SQLSTATE: XX000 at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:97) at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:101) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:83)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:505)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:481)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:449) at 
> org.apache.spark.sql.catalyst.expressions.BindR

[jira] [Updated] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

2024-08-16 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-48965:
--
Labels: correctness  (was: )

> toJSON produces wrong values if DecimalType information is lost in as[Product]
> --
>
> Key: SPARK-48965
> URL: https://issues.apache.org/jira/browse/SPARK-48965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.5.1
>Reporter: Dmitry Lapshin
>Priority: Major
>  Labels: correctness
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
>   case class Example(x: BigDecimal)
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .master("local[1]")
>   .getOrCreate()
> import spark.implicits._
> val originalRaw = BigDecimal("123.456")
> val original = Example(originalRaw)
> val ds1 = spark.createDataset(Seq(original))
> val ds2 = ds1
>   .withColumn("x", $"x" cast DecimalType(12, 6))
> val ds3 = ds2
>   .as[Example]
> println(s"DS1: schema=${ds1.schema}, 
> encoder.schema=${ds1.encoder.schema}")
> println(s"DS2: schema=${ds1.schema}, 
> encoder.schema=${ds2.encoder.schema}")
> println(s"DS3: schema=${ds1.schema}, 
> encoder.schema=${ds3.encoder.schema}")
> val json1 = ds1.toJSON.collect().head
> val json2 = ds2.toJSON.collect().head
> val json3 = ds3.toJSON.collect().head
> val collect1 = ds1.collect().head
> val collect2_ = ds2.collect().head
> val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
> val collect3 = ds3.collect().head
> println(s"Original: $original (scale = ${original.x.scale}, precision = 
> ${original.x.precision})")
> println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = 
> ${collect1.x.precision})")
> println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision = 
> ${collect2.precision})")
> println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision = 
> ${collect3.x.precision})")
> println(s"json1: $json1")
> println(s"json2: $json2")
> println(s"json3: $json3")
>   }
> }
> {code}
> Running it you'd see that json3 contains very much wrong data. After a bit of 
> debugging, and sorry since I'm bad with Spark internals, I've found that:
>  * In-memory representation of the data in this example used {{UnsafeRow}}, 
> whose {{.getDecimal}} uses compression to store small Decimal values as 
> longs, but doesn't remember decimal sizing parameters,
>  * However, there are at least two sources for precision & scale to pass to 
> that method: {{Dataset.schema}} (which is based on query execution, always 
> contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in 
> `ds2` to 12,6 but then is reset in `ds3`). Also, there is a 
> {{Dataset.deserializer}} that seems to be combining those two non-trivially.
>  * This doesn't seem to affect {{Dataset.collect()}} methods since they use 
> {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema.
> Seems to me that either {{.toJSON}} should be more aware of what's going on 
> or {{.as[]}} should be doing something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

2024-08-16 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874352#comment-17874352
 ] 

Bruce Robbins edited comment on SPARK-48965 at 8/16/24 6:52 PM:


It's not just decimals. {{toJSON}} is simply using the wrong schema. For 
example:

{noformat}
scala> case class Data(x: Int, y: String)
class Data

scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data(22,Hey there))

scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res1: Array[String] = 
Array({"y":"\u\u\u\u\u\u\u\u\u0016\u\u\u\u\u\u\u\t\u\u\u\u0018\u","x":9})
scala> 
{noformat}
Edit: Even more interesting, you can crash the JVM:
{noformat}
scala> case class Data(x: Array[Int], y: String)
class Data

scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as 
x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data([I@3ca370ae,Hey thereHey thereHey thereHey 
thereHey thereHey thereHey thereHey thereHey thereHey thereHey thereHey 
thereHey thereHey thereHey thereHey thereHey there))

scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as 
x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code
at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5(JacksonGenerator.scala:129)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5$adapted(JacksonGenerator.scala:128)
 ...
{noformat}


was (Author: bersprockets):
It's not just decimals. {{toJSON}} is simply using the wrong schema. For 
example:

{noformat}
scala> case class Data(x: Int, y: String)
class Data

scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data(22,Hey there))

scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res1: Array[String] = 
Array({"y":"\u\u\u\u\u\u\u\u\u0016\u\u\u\u\u\u\u\t\u\u\u\u0018\u","x":9})
scala> 
{noformat}

> toJSON produces wrong values if DecimalType information is lost in as[Product]
> --
>
> Key: SPARK-48965
> URL: https://issues.apache.org/jira/browse/SPARK-48965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.5.1
>Reporter: Dmitry Lapshin
>Priority: Major
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
>   case class Example(x: BigDecimal)
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .master("local[1]")
>   .getOrCreate()
> import spark.implicits._
> val originalRaw = BigDecimal("123.456")
> val original = Example(originalRaw)
> val ds1 = spark.createDataset(Seq(original))
> val ds2 = ds1
>   .withColumn("x", $"x" cast DecimalType(12, 6))
> val ds3 = ds2
>   .as[Example]
> println(s"DS1: schema=${ds1.schema}, 
> encoder.schema=${ds1.encoder.schema}")
> println(s"DS2: schema=${ds1.schema}, 
> encoder.schema=${ds2.encoder.schema}")
> println(s"DS3: schema=${ds1.schema}, 
> encoder.schema=${ds3.encoder.schema}")
> val json1 = ds1.toJSON.collect().head
> val json2 = ds2.toJSON.collect().head
> val json3 = ds3.toJSON.collect().head
> val collect1 = ds1.collect().head
> val collect2_ = ds2.collect().head
> val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
> val collect3 = ds3.collect().head
> println(s"Original: $original (scale = ${original.x.scale}, precision = 
> ${original.x.precision})")
> println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = 
> ${collect1.x.precision})")
> println(s"Collect2: $collect2 (scale = ${collect2.sc

[jira] [Commented] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

2024-08-16 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874352#comment-17874352
 ] 

Bruce Robbins commented on SPARK-48965:
---

It's not just decimals. {{toJSON}} is simply using the wrong schema. For 
example:

{noformat}
scala> case class Data(x: Int, y: String)
class Data

scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data(22,Hey there))

scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res1: Array[String] = 
Array({"y":"\u\u\u\u\u\u\u\u\u0016\u\u\u\u\u\u\u\t\u\u\u\u0018\u","x":9})
scala> 
{noformat}

> toJSON produces wrong values if DecimalType information is lost in as[Product]
> --
>
> Key: SPARK-48965
> URL: https://issues.apache.org/jira/browse/SPARK-48965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.5.1
>Reporter: Dmitry Lapshin
>Priority: Major
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
>   case class Example(x: BigDecimal)
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
>   .master("local[1]")
>   .getOrCreate()
> import spark.implicits._
> val originalRaw = BigDecimal("123.456")
> val original = Example(originalRaw)
> val ds1 = spark.createDataset(Seq(original))
> val ds2 = ds1
>   .withColumn("x", $"x" cast DecimalType(12, 6))
> val ds3 = ds2
>   .as[Example]
> println(s"DS1: schema=${ds1.schema}, 
> encoder.schema=${ds1.encoder.schema}")
> println(s"DS2: schema=${ds1.schema}, 
> encoder.schema=${ds2.encoder.schema}")
> println(s"DS3: schema=${ds1.schema}, 
> encoder.schema=${ds3.encoder.schema}")
> val json1 = ds1.toJSON.collect().head
> val json2 = ds2.toJSON.collect().head
> val json3 = ds3.toJSON.collect().head
> val collect1 = ds1.collect().head
> val collect2_ = ds2.collect().head
> val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
> val collect3 = ds3.collect().head
> println(s"Original: $original (scale = ${original.x.scale}, precision = 
> ${original.x.precision})")
> println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = 
> ${collect1.x.precision})")
> println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision = 
> ${collect2.precision})")
> println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision = 
> ${collect3.x.precision})")
> println(s"json1: $json1")
> println(s"json2: $json2")
> println(s"json3: $json3")
>   }
> }
> {code}
> Running it you'd see that json3 contains very much wrong data. After a bit of 
> debugging, and sorry since I'm bad with Spark internals, I've found that:
>  * In-memory representation of the data in this example used {{UnsafeRow}}, 
> whose {{.getDecimal}} uses compression to store small Decimal values as 
> longs, but doesn't remember decimal sizing parameters,
>  * However, there are at least two sources for precision & scale to pass to 
> that method: {{Dataset.schema}} (which is based on query execution, always 
> contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in 
> `ds2` to 12,6 but then is reset in `ds3`). Also, there is a 
> {{Dataset.deserializer}} that seems to be combining those two non-trivially.
>  * This doesn't seem to affect {{Dataset.collect()}} methods since they use 
> {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema.
> Seems to me that either {{.toJSON}} should be more aware of what's going on 
> or {{.as[]}} should be doing something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-06-11 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854211#comment-17854211
 ] 

Bruce Robbins commented on SPARK-47193:
---

I took a look at this today. This issue happens even with 
{{{}Dataset.toLocalIterator{}}}.

Assume {{/tmp/test.csv}} contains:
{noformat}
1,2021-11-22 11:27:01
2,2021-11-22 11:27:02
3,2021-11-22 11:27:03
{noformat}
Then the following produces incorrect results:
{noformat}
sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

val test = {
  spark
  .read
  .option("header", "false")
  .schema("id int, ts timestamp")
  .csv("/tmp/test.csv")
}

import scala.collection.JavaConverters._
test.toLocalIterator.asScala.toSeq
{noformat}
The incorrect results are:
{noformat}
val res1: Seq[org.apache.spark.sql.Row] = List([1,null], [2,null], [3,null])
{noformat}
However, {{Dataset.collect}} works as expected:
{noformat}
scala> test.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res2: Array[org.apache.spark.sql.Row] = Array([1,2021-11-22 11:27:01.0], 
[2,2021-11-22 11:27:02.0], [3,2021-11-22 11:27:03.0])

scala> 
{noformat}
The problem has to do with the lazy nature of the rdd (in the case of 
{{Dataset.rdd}}) or iterator (in the case of {{Dataset.toLocalIterator}}).

{{Dataset}} actions like {{count}} and {{collect}} are wrapped with the 
function {{withSQLConfPropagated}}, which ensures that the user-specified SQL 
config is propagated to the executors while the jobs associated with the query 
run. Actions like {{count}} and {{collect}} don't return until those jobs 
complete, so the SQL config is propagated during the entire execution of the 
query.

{{Dataset.toLocalIterator}} is also wrapped by {{withSQLConfPropagated}}, but 
due to the lazy nature of iterators, the method returns before the jobs 
associated with the query actually run. Those jobs don't run until someone 
calls {{next}} on the returned iterator, at which point the SQL conf is no 
longer propagated to the executors. So the jobs get run without the 
user-specified config and just assume default settings.

In the reporter's CSV case, the user's setting of 
{{spark.sql.legacy.timeParserPolicy}} is respected during planning on the 
driver, but not respected on the executors. This mix of settings results in 
null timestamps in the resulting rows.

I'll take a look at a possible fix.

> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null")

[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-05-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849785#comment-17849785
 ] 

Bruce Robbins commented on SPARK-47193:
---

Thanks for the update.

This issue is seemingly reproducible without a join, running {{spark-shell}} 
with {{spark.sql.legacy.timeParserPolicy=LEGACY}}:
{noformat}
val user = {
  spark
  .read
  .option("header", "true")
  .option("comment", "#")
  .option("nullValue", "null")
  .schema("UserId int, Created timestamp, deleted timestamp, Active boolean, 
ActivatedDate timestamp")
  .csv("/tmp/user.csv")
}
user.createOrReplaceTempView("user")
sql("select * from user where ActivatedDate is not null").count
sql("select * from user where ActivatedDate is not null").rdd.count
{noformat}
The first count operation returns 4. The second returns 0.

A collect action shows why: all the timestamps are null when {{rdd}} is 
specified:
{noformat}
scala> sql("select * from user").collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res10: Array[org.apache.spark.sql.Row] = Array(
[1,2021-11-22 11:27:27.0,2021-11-25 11:27:27.0,false,2021-11-22 11:27:27.0], 
[2,2021-11-22 11:27:27.0,null,true,2021-11-22 11:27:27.0],
[3,2021-11-22 11:27:27.0,null,true,2021-11-22 11:27:27.0],
[4,2021-11-22 11:27:27.0,null,null,2021-11-22 11:27:27.0],
[5,2021-11-22 11:27:27.0,null,false,null])

scala> sql("select * from user").rdd.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res11: Array[org.apache.spark.sql.Row] = Array(
[1,null,null,false,null],
[2,null,null,true,null],
[3,null,null,true,null],
[4,null,null,null,null],
[5,null,null,false,null])
scala> 
{noformat}

> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClass

[jira] [Comment Edited] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849747#comment-17849747
 ] 

Bruce Robbins edited comment on SPARK-48361 at 5/27/24 2:29 PM:


After looking at this, I see that this is arguably documented behavior 
(although still somewhat surprising).

The documentation for the {{mode}} option says the following:
{quote}Note that Spark tries to parse only required columns in CSV under column 
pruning. Therefore, corrupt records can be different based on required set of 
fields. This behavior can be controlled by 
spark.sql.csv.parser.columnPruning.enabled (enabled by default).
{quote}
And, indeed, if you turn off CSV column pruning, your issue goes away:
{noformat}
scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|  8|9.0|
|   four|5.0|
|ten|   11.0|
+---+---+


scala> sql("set spark.sql.csv.parser.columnPruning.enabled=false")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|   four|5.0|
|ten|   11.0|
+---+---+


scala> 
{noformat}
The grouping operation only needs a subset of the columns (column1, column2, 
and _corrupt_record for the filter), so the rest of the columns are pruned. 
Because only a part of the input record is parsed, the parser never discovers 
that the record is corrupted, so {{_corrupt_record}} is null.

It's still a little weird though, because if you include, say, {{column4}} as a 
grouping column, {{_corrupt_record}} still remains null. For the case where the 
record is truncated, it seems the code wants to set {{_corrupt_record}} only if 
it's parsing the entire input record.


was (Author: bersprockets):
After looking at this, I see that this is arguably documented behavior 
(although still somewhat surprising).

The documentation for the {{mode}} option says the following:
{quote}
Note that Spark tries to parse only required columns in CSV under column 
pruning. Therefore, corrupt records can be different based on required set of 
fields. This behavior can be controlled by 
spark.sql.csv.parser.columnPruning.enabled (enabled by default).
{quote}
And, indeed, if you turn off CSV column pruning, your issue goes away:
{noformat}
scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|  8|9.0|
|   four|5.0|
|ten|   11.0|
+---+---+


scala> sql("set spark.sql.csv.parser.columnPruning.enabled=false")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|   four|5.0|
|ten|   11.0|
+---+---+


scala> 
{noformat}
The grouping operation only needs a subset of the columns (column1, column2, 
and _corrupt_record for the filter), so the rest of the columns are pruned. 
Because only a part of the input record is parsed, the parser never discovers 
that the record is corrupted, so {{_corrupt_record}} is null.

It's still a little weird though, because if you include, say, {{column4}} as a 
grouping column, {{_corrupt_record}} still remains null. It seems the code 
wants to set {{_corrupt_record}} only if it's parsing the entire input record.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option(

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849747#comment-17849747
 ] 

Bruce Robbins commented on SPARK-48361:
---

After looking at this, I see that this is arguably documented behavior 
(although still somewhat surprising).

The documentation for the {{mode}} option says the following:
{quote}
Note that Spark tries to parse only required columns in CSV under column 
pruning. Therefore, corrupt records can be different based on required set of 
fields. This behavior can be controlled by 
spark.sql.csv.parser.columnPruning.enabled (enabled by default).
{quote}
And, indeed, if you turn off CSV column pruning, your issue goes away:
{noformat}
scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|  8|9.0|
|   four|5.0|
|ten|   11.0|
+---+---+


scala> sql("set spark.sql.csv.parser.columnPruning.enabled=false")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> groupedSum.show()
+---+---+
|column1|sum_column2|
+---+---+
|   four|5.0|
|ten|   11.0|
+---+---+


scala> 
{noformat}
The grouping operation only needs a subset of the columns (column1, column2, 
and _corrupt_record for the filter), so the rest of the columns are pruned. 
Because only a part of the input record is parsed, the parser never discovers 
that the record is corrupted, so {{_corrupt_record}} is null.

It's still a little weird though, because if you include, say, {{column4}} as a 
grouping column, {{_corrupt_record}} still remains null. It seems the code 
wants to set {{_corrupt_record}} only if it's parsing the entire input record.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.cs

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-23 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849095#comment-17849095
 ] 

Bruce Robbins commented on SPARK-48361:
---

I can take a look at the root cause, unless you are already looking at that, in 
which case I will hold off.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848787#comment-17848787
 ] 

Bruce Robbins commented on SPARK-48361:
---

Did you mean the following?
{noformat}
val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
{noformat}
Either way (with `=== true` or `=!= true`), a bug of some sort is revealed.

With `=== true`, the grouping produces an empty result (it shouldn't).

With `=!= true`, the grouping includes `8, 9` (it shouldn't, as you mentioned).

In fact, for both cases, if you persist {{dfWithJagged}}, you get the right 
answer.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") === true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apa

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848696#comment-17848696
 ] 

Bruce Robbins commented on SPARK-48361:
---

`8,9` is still present before the aggregate:
{noformat}
scala> dfWithJagged.show(false)
24/05/22 10:33:24 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: test, 1, 2, three
 Schema: column1, column2, column3, column4
Expected: column1 but found: test
CSV file: file:///Users/bruce/github/spark_up3.5.1/test.csv
+---+---+---++---+---+
|column1|column2|column3|column4 |_corrupt_record|__is_jagged|
+---+---+---++---+---+
|four   |5.0|6.0|seven   |NULL   |false  |
|8  |9.0|NULL   |NULL|8,9|true   |
|ten|11.0   |12.0   |thirteen|NULL   |false  |
+---+---+---++---+---+


scala> sql("select version()").collect
res6: Array[org.apache.spark.sql.Row] = Array([3.5.1 
fd86f85e181fc2dc0f50a096855acf83a6cc5d9c])

scala> 
{noformat}
Which piece of code filters out `8,9`? I could't find the filter in your 
example, but again I may be missing something. 

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged",

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848692#comment-17848692
 ] 

Bruce Robbins commented on SPARK-48361:
---

Sorry for being dense. What would the correct answer be?

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
>   
> # sum up column1
> val groupedSum = 
> dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases

2024-05-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-47134.
---
Resolution: Invalid

> Unexpected nulls when casting decimal values in specific cases
> --
>
> Key: SPARK-47134
> URL: https://issues.apache.org/jira/browse/SPARK-47134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Dylan Walker
>Priority: Major
> Attachments: 321queryplan.txt, 341queryplan.txt
>
>
> In specific cases, casting decimal values can result in `null` values where 
> no overflow exists.
> The cases appear very specific, and I don't have the depth of knowledge to 
> generalize this issue, so here is a simple spark-shell reproduction:
> *Setup:*
> {code:scala}
> scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", 
> x)).toDS
> ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
> scala> ds.createOrReplaceTempView("t")
> {code}
>  
> *Spark 3.2.1 behaviour (correct):*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> ++
> |  ct|
> ++
> | 9508.00|
> |13879.00|
> ++
> {code}
> *Spark 3.4.1 / Spark 3.5.0 behaviour:*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +---+
> | ct|
> +---+
> |   null|
> |9508.00|
> +---+
> {code}
> This is fairly delicate:
>  - removing the {{ORDER BY}} clause produces the correct result
>  - removing the {{CAST}} produces the correct result
>  - changing the number of 0s in the argument to {{SUM}} produces the correct 
> result
>  - setting {{spark.ansi.enabled}} to {{true}} produces the correct result 
> (and does not throw an error)
> Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also 
> result in the unexpected nulls.
> Please let me know if you need additional information.
> We are also interested in understanding whether setting 
> {{spark.ansi.enabled}} can be considered a reliable workaround to this issue 
> prior to a fix being released, if possible.
> Text files that include {{explain()}} output attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47633:
--
Affects Version/s: 3.4.2

> Cache miss for queries using JOIN LATERAL with join condition
> -
>
> Key: SPARK-47633
> URL: https://issues.apache.org/jira/browse/SPARK-47633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v1 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2)
> on c1 = a;
> cache table v1;
> explain select * from v1;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
>:- LocalTableScan [c1#180, c2#181]
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [plan_id=113]
>   +- LocalTableScan [a#173, b#174]
> {noformat}
> Note that there is no {{InMemoryRelation}}.
> However, if you move the join condition into the subquery, the cached plan is 
> used:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v2 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2
>   where t1.c1 = t2.c1);
> cache table v2;
> explain select * from v2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
>   +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   *(1) Project [c1#26, c2#27, a#19, b#20]
>   +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, 
> BuildLeft, false
>  :- BroadcastQueryStage 0
>  :  +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  : +- LocalTableScan [c1#26, c2#27]
>  +- *(1) LocalTableScan [a#19, b#20, c1#30]
>+- == Initial Plan ==
>   Project [c1#26, c2#27, a#19, b#20]
>   +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
> false
>  :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  :  +- LocalTableScan [c1#26, c2#27]
>  +- LocalTableScan [a#19, b#20, c1#30]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47633:
--
Affects Version/s: 3.5.1

> Cache miss for queries using JOIN LATERAL with join condition
> -
>
> Key: SPARK-47633
> URL: https://issues.apache.org/jira/browse/SPARK-47633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v1 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2)
> on c1 = a;
> cache table v1;
> explain select * from v1;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
>:- LocalTableScan [c1#180, c2#181]
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [plan_id=113]
>   +- LocalTableScan [a#173, b#174]
> {noformat}
> Note that there is no {{InMemoryRelation}}.
> However, if you move the join condition into the subquery, the cached plan is 
> used:
> {noformat}
> CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
> CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);
> create or replace temp view v2 as
> select *
> from t1
> join lateral (
>   select c1 as a, c2 as b
>   from t2
>   where t1.c1 = t2.c1);
> cache table v2;
> explain select * from v2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
>   +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- AdaptiveSparkPlan isFinalPlan=true
>+- == Final Plan ==
>   *(1) Project [c1#26, c2#27, a#19, b#20]
>   +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, 
> BuildLeft, false
>  :- BroadcastQueryStage 0
>  :  +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  : +- LocalTableScan [c1#26, c2#27]
>  +- *(1) LocalTableScan [a#19, b#20, c1#30]
>+- == Initial Plan ==
>   Project [c1#26, c2#27, a#19, b#20]
>   +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
> false
>  :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=37]
>  :  +- LocalTableScan [c1#26, c2#27]
>  +- LocalTableScan [a#19, b#20, c1#30]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition

2024-03-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-47633:
-

 Summary: Cache miss for queries using JOIN LATERAL with join 
condition
 Key: SPARK-47633
 URL: https://issues.apache.org/jira/browse/SPARK-47633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


For example:
{noformat}
CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);

create or replace temp view v1 as
select *
from t1
join lateral (
  select c1 as a, c2 as b
  from t2)
on c1 = a;

cache table v1;

explain select * from v1;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false
   :- LocalTableScan [c1#180, c2#181]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)),false), [plan_id=113]
  +- LocalTableScan [a#173, b#174]
{noformat}
Note that there is no {{InMemoryRelation}}.

However, if you move the join condition into the subquery, the cached plan is 
used:
{noformat}
CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2);

create or replace temp view v2 as
select *
from t1
join lateral (
  select c1 as a, c2 as b
  from t2
  where t1.c1 = t2.c1);

cache table v2;

explain select * from v2;
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179]
  +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
  *(1) Project [c1#26, c2#27, a#19, b#20]
  +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, 
false
 :- BroadcastQueryStage 0
 :  +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=37]
 : +- LocalTableScan [c1#26, c2#27]
 +- *(1) LocalTableScan [a#19, b#20, c1#30]
   +- == Initial Plan ==
  Project [c1#26, c2#27, a#19, b#20]
  +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false
 :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=37]
 :  +- LocalTableScan [c1#26, c2#27]
 +- LocalTableScan [a#19, b#20, c1#30]
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47527) Cache miss for queries using With expressions

2024-03-24 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-47527.
---
Resolution: Duplicate

> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither include an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47527) Cache misses for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-47527:
-

 Summary: Cache misses for queries using With expressions
 Key: SPARK-47527
 URL: https://issues.apache.org/jira/browse/SPARK-47527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither list an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47527:
--
Description: 
For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither include an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].

  was:
For example:
{noformat}
create or replace temp view v1 as
select id from range(10);

create or replace temp view q1 as
select * from v1
where id between 2 and 4;

cache table q1;

explain select * from q1;

== Physical Plan ==
*(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
+- *(1) Range (0, 10, step=1, splits=8)
{noformat}
Similarly:
{noformat}
create or replace temp view q2 as
select count_if(id > 3) as cnt
from v1;

cache table q2;

explain select * from q2;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else 
_common_expr_0#88)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
  +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
_common_expr_0#88) null else _common_expr_0#88)])
 +- Project [(id#86L > 3) AS _common_expr_0#88]
+- Range (0, 10, step=1, splits=8)

{noformat}
In the output of the above explain commands, neither list an 
{{InMemoryRelation}} node.

The culprit seems to be the common expression ids in the {{With}} expressions 
used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].


> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither include an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions

2024-03-23 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47527:
--
Summary: Cache miss for queries using With expressions  (was: Cache misses 
for queries using With expressions)

> Cache miss for queries using With expressions
> -
>
> Key: SPARK-47527
> URL: https://issues.apache.org/jira/browse/SPARK-47527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> For example:
> {noformat}
> create or replace temp view v1 as
> select id from range(10);
> create or replace temp view q1 as
> select * from v1
> where id between 2 and 4;
> cache table q1;
> explain select * from q1;
> == Physical Plan ==
> *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
> +- *(1) Range (0, 10, step=1, splits=8)
> {noformat}
> Similarly:
> {noformat}
> create or replace temp view q2 as
> select count_if(id > 3) as cnt
> from v1;
> cache table q2;
> explain select * from q2;
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null 
> else _common_expr_0#88)])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
>   +- HashAggregate(keys=[], functions=[partial_count(if (NOT 
> _common_expr_0#88) null else _common_expr_0#88)])
>  +- Project [(id#86L > 3) AS _common_expr_0#88]
> +- Range (0, 10, step=1, splits=8)
> {noformat}
> In the output of the above explain commands, neither list an 
> {{InMemoryRelation}} node.
> The culprit seems to be the common expression ids in the {{With}} expressions 
> used in runtime replacements for {{between}} and {{count_if}}, e.g. [this 
> code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821393#comment-17821393
 ] 

Bruce Robbins edited comment on SPARK-47193 at 2/27/24 8:48 PM:


Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the "{{...}}" above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}



was (Author: bersprockets):
Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]

[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss

2024-02-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821393#comment-17821393
 ] 

Bruce Robbins commented on SPARK-47193:
---

Running this in Spark 3.5.0 in local mode on my laptop, I get
{noformat}
df count = 8
...
rdd count = 8
{noformat}
What is your environment and Spark configuration?

By the way, the {{...}} above are messages like
{noformat}
24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: UserId, LocationId, LocationName, CreatedDate, Status
 Schema: UserId, LocationId, LocationName, Status, CreatedDate
Expected: Status but found: CreatedDate
CSV file: file:userLocation.csv
{noformat}


> Converting dataframe to rdd results in data loss
> 
>
> Key: SPARK-47193
> URL: https://issues.apache.org/jira/browse/SPARK-47193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Ivan Bova
>Priority: Critical
>  Labels: correctness
> Attachments: device.csv, deviceClass.csv, deviceType.csv, 
> language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, 
> userLocation.csv, userProfile.csv
>
>
> I have 10 csv files and need to create mapping from them. After all of the 
> joins dataframe contains all expected rows but rdd from this dataframe 
> contains only half of them.
> {code:java}
> case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: 
> String, LastName: String, LanguageId: Option[Int])
> case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String)
> case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], 
> UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: 
> Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: 
> Option[Int])
> case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String)
> case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String)
> case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: 
> Option[Double], Longitude: Option[Double], Radius: Option[Double], 
> CreatedDate: Timestamp)
> case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String)
> case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: 
> String, Status: Int, CreatedDate: Timestamp)
> case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: 
> Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp])
> case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], 
> Address1: String, Address2: String, City: String, State: String, Country: 
> String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: 
> Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, 
> UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: 
> Option[Boolean], Level: Option[Int], TimeZone: Option[Int])
> val userProfile = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage]
> val language = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage]
> val device = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage]
> val deviceClass = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage]
> val deviceType = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage]
> val location1 = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
> val timeZoneLookup = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage]
> val userLocation = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage]
> val user = spark.read.option("header", "true").option("comment", 
> "#").option("nullValue", 
> "null").schema(Encoders.product[MyUserMessage].schema).csv("u

[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases

2024-02-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819789#comment-17819789
 ] 

Bruce Robbins commented on SPARK-47134:
---

Oddly, I cannot reproduce on either 3.4.1 or 3.5.0.

Also, my 3.4.1 plan doesn't look like your 3.4.1 plan: My plan uses {{sum}}, 
your plan uses {{decimalsum}}. I can't find where {{decimalsum}} comes from in 
the code base, but maybe I am not looking hard enough.
{noformat}
scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> ds.createOrReplaceTempView("t")

scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").show()
++
|  ct|
++
| 9508.00|
|13879.00|
++

scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
FROM t GROUP BY `_1` ORDER BY ct ASC").explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [ct#19 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(ct#19 ASC NULLS FIRST, 200), 
ENSURE_REQUIREMENTS, [plan_id=68]
  +- HashAggregate(keys=[_1#2], functions=[sum(1.00)])
 +- Exchange hashpartitioning(_1#2, 200), ENSURE_REQUIREMENTS, 
[plan_id=65]
+- HashAggregate(keys=[_1#2], 
functions=[partial_sum(1.00)])
   +- LocalTableScan [_1#2]

scala> sql("select version()").show(false)
+--+
|version() |
+--+
|3.4.1 6b1ff22dde1ead51cbf370be6e48a802daae58b6|
+--+

scala> 
{noformat}

> Unexpected nulls when casting decimal values in specific cases
> --
>
> Key: SPARK-47134
> URL: https://issues.apache.org/jira/browse/SPARK-47134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Dylan Walker
>Priority: Major
> Attachments: 321queryplan.txt, 341queryplan.txt
>
>
> In specific cases, casting decimal values can result in `null` values where 
> no overflow exists.
> The cases appear very specific, and I don't have the depth of knowledge to 
> generalize this issue, so here is a simple spark-shell reproduction:
> *Setup:*
> {code:scala}
> scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", 
> x)).toDS
> ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
> scala> ds.createOrReplaceTempView("t")
> {code}
>  
> *Spark 3.2.1 behaviour (correct):*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> ++
> |  ct|
> ++
> | 9508.00|
> |13879.00|
> ++
> {code}
> *Spark 3.4.1 / Spark 3.5.0 behaviour:*
> {code:scala}
> scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct 
> FROM t GROUP BY `_1` ORDER BY ct ASC").show()
> +---+
> | ct|
> +---+
> |   null|
> |9508.00|
> +---+
> {code}
> This is fairly delicate:
>  - removing the {{ORDER BY}} clause produces the correct result
>  - removing the {{CAST}} produces the correct result
>  - changing the number of 0s in the argument to {{SUM}} produces the correct 
> result
>  - setting {{spark.ansi.enabled}} to {{true}} produces the correct result 
> (and does not throw an error)
> Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also 
> result in the unexpected nulls.
> Please let me know if you need additional information.
> We are also interested in understanding whether setting 
> {{spark.ansi.enabled}} can be considered a reliable workaround to this issue 
> prior to a fix being released, if possible.
> Text files that include {{explain()}} output attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-47104:
--
Affects Version/s: 3.5.0
   3.4.2

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Priority: Major
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-20 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818934#comment-17818934
 ] 

Bruce Robbins commented on SPARK-47104:
---

It's not a CSV specific issue. You can reproduce with a cached view. The 
following fails on the master branch, when using {{spark-sql}}:
{noformat}
create or replace temp view v1(id, name) as values
(1, "fred"),
(2, "bob");

cache table v1;

select name, uuid() as _iid from (
  select s.name
  from v1 s
  join v1 t
  on s.name = t.name
  order by name
)
limit 20;
{noformat}
The exception is:
{noformat}
java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.sql.catalyst.util.RandomUUIDGenerator.getNextUUIDUTF8String()"
 because "this.randomGen_0" is null
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$6(limit.scala:297)
at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:934)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$1(limit.scala:297)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:286)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:390)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:418)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
{noformat}
It seems that non-deterministic expressions are not getting initialized before 
being used in the unsafe projection. I can take a look.

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Chhavi Bansal
>Priority: Major
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExe

[jira] [Commented] (SPARK-47034) join between cached temp tables result in missing entries

2024-02-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817123#comment-17817123
 ] 

Bruce Robbins commented on SPARK-47034:
---

I wonder if this is SPARK-45592 (and, relatedly, SPARK-45282), which existed as 
a bug in 3.5.0 but is fixed on master and branch-3.5.

> join between cached temp tables result in missing entries
> -
>
> Key: SPARK-47034
> URL: https://issues.apache.org/jira/browse/SPARK-47034
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 3.5.0
>Reporter: shurik mermelshtein
>Priority: Major
>
> we create several temp tables (views) by loading several delta tables and 
> joining between them. 
> those views are used for calculation of different metrics. each metric 
> requires different views to be used. some of the more popular views are 
> cached for better performance. 
> we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some 
> of the join started to fail.
> we can reproduce a case were we have 2 data frames (views) (this is not the 
> real names  / values we use. this is just for the example)
>  # users with the column user_id, campaign_id, user_name.
> we make sure it has a single entry
> '11', '2', 'Jhon Doe'
>  # actions with the column user_id, campaign_id, action_id, action count
> we make sure it has a single entry
> '11', '2', 'clicks', 5
>  
>  # users view can be filtered for user_id = '11' or/and campaign_id = 
> '2' and it will find the existing single row
>  # actions view can be filtered for user_id = '11' or/and campaign_id = 
> '2' and it will find the existing single row
>  # users and actions can be inner join by user_id *OR* campaign_id and the 
> join will be successful. 
>  # users and actions can *not* be inner join by user_id *AND* campaign_id. 
> The join results in no entry.
>  # if we write both of the views to S3 and read them back to new data frames, 
> suddenly the join is working.
>  # if we disable AQE the join is working
>  # running checkpoint on the views does not make join #4 work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47019) AQE dynamic cache partitioning causes SortMergeJoin to result in data loss

2024-02-10 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816321#comment-17816321
 ] 

Bruce Robbins commented on SPARK-47019:
---

I can reproduce on my laptop using Spark 3.5.0 and {{--master 
"local-cluster[3,1,1024]"}}. However, I can not reproduce on the latest 
branch-3.5 or master.

So it seems to have been fixed, probably by SPARK-45592.


> AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
> --
>
> Key: SPARK-47019
> URL: https://issues.apache.org/jira/browse/SPARK-47019
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 3.5.0
> Environment: Tested in 3.5.0
> Reproduced on, so far:
>  * kubernetes deployment
>  * docker cluster deployment
> Local Cluster:
>  * master
>  * worker1 (2/2G)
>  * worker2 (1/1G)
>Reporter: Ridvan Appa Bugis
>Priority: Blocker
>  Labels: DAG, caching, correctness, data-loss, 
> dynamic_allocation, inconsistency, partitioning
> Attachments: Screenshot 2024-02-07 at 20.09.44.png, Screenshot 
> 2024-02-07 at 20.10.07.png, eventLogs-app-20240207175940-0023.zip, 
> testdata.zip
>
>
> It seems like we have encountered an issue with Spark AQE's dynamic cache 
> partitioning which causes incorrect *count* output values and data loss.
> A similar issue could not be found, so i am creating this ticket to raise 
> awareness.
>  
> Preconditions:
>  - Setup a cluster as per environment specification
>  - Prepare test data (or a data large enough to trigger read by both 
> executors)
> Steps to reproduce:
>  - Read parent
>  - Self join parent
>  - cache + materialize parent
>  - Join parent with child
>  
> Performing a self-join over a parentDF, then caching + materialising the DF, 
> and then joining it with a childDF results in *incorrect* count value and 
> {*}missing data{*}.
>  
> Performing a *repartition* seems to fix the issue, most probably due to 
> rearrangement of the underlying partitions and statistic update.
>  
> This behaviour is observed over a multi-worker cluster with a job running 2 
> executors (1 per worker), when reading a large enough data file by both 
> executors.
> Not reproducible in local mode.
>  
> Circumvention:
> So far, by disabling 
> _spark.sql.optimizer.canChangeCachedPlanOutputPartitioning_ or performing 
> repartition this can be alleviated, but it is not the fix of the root cause.
>  
> This issue is dangerous considering that data loss is occurring silently and 
> in absence of proper checks can lead to wrong behaviour/results down the 
> line. So we have labeled it as a blocker.
>  
> There seems to be a file-size treshold after which dataloss is observed 
> (possibly implying that it happens when both executors start reading the data 
> file)
>  
> Minimal example:
> {code:java}
> // Read parent
> val parentData = session.read.format("avro").load("/data/shared/test/parent")
> // Self join parent and cache + materialize
> val parent = parentData.join(parentData, Seq("PID")).cache()
> parent.count()
> // Read child
> val child = session.read.format("avro").load("/data/shared/test/child")
> // Basic join
> val resultBasic = child.join(
>   parent,
>   parent("PID") === child("PARENT_ID")
> )
> // Count: 16479 (Wrong)
> println(s"Count no repartition: ${resultBasic.count()}")
> // Repartition parent join
> val resultRepartition = child.join(
>   parent.repartition(),
>   parent("PID") === child("PARENT_ID")
> )
> // Count: 50094 (Correct)
> println(s"Count with repartition: ${resultRepartition.count()}") {code}
>  
> Invalid count-only DAG:
>   !Screenshot 2024-02-07 at 20.10.07.png|width=519,height=853!
> Valid repartition DAG:
> !Screenshot 2024-02-07 at 20.09.44.png|width=368,height=1219!  
>  
> Spark submit for this job:
> {code:java}
> spark-submit 
>   --class ExampleApp 
>   --packages org.apache.spark:spark-avro_2.12:3.5.0 
>   --deploy-mode cluster 
>   --master spark://spark-master:6066 
>   --conf spark.sql.autoBroadcastJoinThreshold=-1  
>   --conf spark.cores.max=3 
>   --driver-cores 1 
>   --driver-memory 1g 
>   --executor-cores 1 
>   --executor-memory 1g 
>   /path/to/test.jar
>  {code}
> The cluster should be setup to the following (worker1(m+e) worker2(e)) as to 
> split the executors onto two workers.
> I have prepared a simple github repository which contains the compilable 
> above example.
> [https://github.com/ridvanappabugis/spark-3.5-issue]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46779:
--
Description: 
Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.

Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not cached 
views. I think that's because cached views were not getting properly 
deduplicated in those versions.

  was:
Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.


> Grouping by subquery with a cached relation can fail
> 
>
> Key: SPARK-46779
> URL: https://issues.apache.org/jira/browse/SPARK-46779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> Example:
> {noformat}
> create or replace temp view data(c1, c2) as values
> (1, 2),
> (1, 3),
> (3, 7),
> (4, 5);
> cache table data;
> select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
> data d2 group by all;
> {noformat}
> It fails with the following error:
> {noformat}
> [INTERNAL_ERROR] Couldn't find count(1)#163L in 
> [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
> in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> {noformat}
> If you don't cache the view, the query succeeds.
> Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not 
> cached views. I think that's because cached views were not getting properly 
> deduplicated in those versions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46779:
--
Affects Version/s: 3.5.0
   3.4.2

> Grouping by subquery with a cached relation can fail
> 
>
> Key: SPARK-46779
> URL: https://issues.apache.org/jira/browse/SPARK-46779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> Example:
> {noformat}
> create or replace temp view data(c1, c2) as values
> (1, 2),
> (1, 3),
> (3, 7),
> (4, 5);
> cache table data;
> select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
> data d2 group by all;
> {noformat}
> It fails with the following error:
> {noformat}
> [INTERNAL_ERROR] Couldn't find count(1)#163L in 
> [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
> in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
> {noformat}
> If you don't cache the view, the query succeeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46779) Grouping by subquery with a cached relation can fail

2024-01-19 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46779:
-

 Summary: Grouping by subquery with a cached relation can fail
 Key: SPARK-46779
 URL: https://issues.apache.org/jira/browse/SPARK-46779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Bruce Robbins


Example:
{noformat}
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from 
data d2 group by all;
{noformat}
It fails with the following error:
{noformat}
[INTERNAL_ERROR] Couldn't find count(1)#163L in 
[c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L 
in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
{noformat}
If you don't cache the view, the query succeeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46373) Create DataFrame Bug

2023-12-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796385#comment-17796385
 ] 

Bruce Robbins commented on SPARK-46373:
---

Maybe due to this (from [the docs|https://spark.apache.org/docs/3.5.0/]):

{quote}Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.8+, and R 
3.5+.{quote}

Scala 3 is not listed as a supported version.

> Create DataFrame Bug
> 
>
> Key: SPARK-46373
> URL: https://issues.apache.org/jira/browse/SPARK-46373
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Bleibtreu
>Priority: Major
>
> Scala version is 3.3.1
> Spark version is 3.5.0
> I am using spark-core 3.5.1. I am trying to create a DataFrame through the 
> reflection api, but "No TypeTag available for Person" will appear. I have 
> tried for a long time, but I still don't quite understand why TypeTag cannot 
> recognize my Person case class. 
> {code:java}
>     import sparkSession.implicits._
>     import scala.reflect.runtime.universe._
>     case class Person(name: String)
>     val a = List(Person("A"), Person("B"), Person("C"))
>     val df = sparkSession.createDataFrame(a)
>     df.show(){code}
> !https://media.discordapp.net/attachments/839723072239566878/1183747749204725821/image.png?ex=65897600&is=65770100&hm=4eeba8d8499499439590a34260f8b441c6594c572c545f5f61f8dc65beeb6a4b&=&format=webp&quality=lossless&width=1178&height=142!
> I tested it and it is indeed a problem unique to Scala3
> There is no problem on Scala2.13
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46289:
--
Priority: Minor  (was: Major)

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-06 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46289:
--
Affects Version/s: 3.3.3

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-06 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46289:
-

 Summary: Exception when ordering by UDT in interpreted mode
 Key: SPARK-46289
 URL: https://issues.apache.org/jira/browse/SPARK-46289
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.2
Reporter: Bruce Robbins


In interpreted mode, ordering by a UDT will result in an exception. For example:
{noformat}
import org.apache.spark.ml.linalg.{DenseVector, Vector}

val df = Seq.tabulate(30) { x =>
  (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
}.toDF("id", "c1", "c2", "c3")

df.createOrReplaceTempView("df")

// this works
sql("select * from df order by c3").collect

sql("set spark.sql.codegen.wholeStage=false")
sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

// this gets an error
sql("select * from df order by c3").collect
{noformat}
The second {{collect}} action results in the following exception:
{noformat}
org.apache.spark.SparkIllegalArgumentException: Type UninitializedPhysicalType 
does not support ordered operations.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
at 
org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
at 
org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
{noformat}
Note: You don't get an error if you use {{show}} rather than {{collect}}. This 
is because {{show}} will implicitly add a {{limit}}, in which case the ordering 
is performed by {{TakeOrderedAndProject}} rather than 
{{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-12-04 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792942#comment-17792942
 ] 

Bruce Robbins commented on SPARK-45644:
---

Even though this is the original issue, I closed it as a duplicate because the 
fix was applied under SPARK-45896.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.s

[jira] [Resolved] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-12-04 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45644.
---
Resolution: Duplicate

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache

[jira] [Updated] (SPARK-46189) Various Pandas functions fail in interpreted mode

2023-11-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-46189:
--
Description: 
Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
spark.sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}

  was:
Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}


> Various Pandas functions fail in interpreted mode
> -
>
> Key: SPARK-46189
> URL: https://issues.apache.org/jira/browse/SPARK-46189
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and 
> {{stddev}}) fail with an unboxing-related exception when run in interpreted 
> mode.
> Here are some reproduction cases for pyspark interactive mode:
> {noformat}
> spark.sql("set spark.sql.codegen.wholeStage=false")
> spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> import numpy as np
> import pandas as pd
> import pyspark.pandas as ps
> pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
> psser = ps.from_pandas(pser)
> # each of the following actions gets an unboxing error
> psser.kurt()
> psser.var()
> psser.skew()
> # set up for covaria

[jira] [Created] (SPARK-46189) Various Pandas functions fail in interpreted mode

2023-11-30 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-46189:
-

 Summary: Various Pandas functions fail in interpreted mode
 Key: SPARK-46189
 URL: https://issues.apache.org/jira/browse/SPARK-46189
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark, SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Bruce Robbins


Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) 
fail with an unboxing-related exception when run in interpreted mode.

Here are some reproduction cases for pyspark interactive mode:
{noformat}
sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

# each of the following actions gets an unboxing error
psser.kurt()
psser.var()
psser.skew()

# set up for covariance test
pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
psdf = ps.from_pandas(pdf)

# this gets an unboxing error
psdf.cov()

# set up for stddev resr
from pyspark.pandas.spark import functions as SF
from pyspark.sql.functions import col
from pyspark.sql import Row
df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), 
Row(a=8)])

# this gets an unboxing error
df.select(SF.stddev(col("a"), 1)).collect()
{noformat}
Exception from the first case ({{psser.kurt()}}) is
{noformat}
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class 
java.lang.Double (java.lang.Integer and java.lang.Double are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184)
at scala.math.Ordering.lt(Ordering.scala:98)
at scala.math.Ordering.lt$(Ordering.scala:98)
at 
org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184)
at 
org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785234#comment-17785234
 ] 

Bruce Robbins commented on SPARK-45896:
---

I think I have a handle on this and will make a PR shortly.

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}
> Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
> - {{Seq[Option[Timestamp]]}}
> - {{Map[Option[Timestamp]]}}
> - {{Seq[Option[Date]]}}
> - {{Map[Option[Date]]}}
> - {{Seq[Option[BigDecimal]]}}
> - {{Map[Option[BigDecimal]]}}
> However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:
> - {{Seq[Option[Map]]}}
> - {{Map[Option[Map]]}}
> - {{Seq[Option[]]}}
> - {{Map[Option[]]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}
Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3:
- {{Seq[Option[Timestamp]]}}
- {{Map[Option[Timestamp]]}}
- {{Seq[Option[Date]]}}
- {{Map[Option[Date]]}}
- {{Seq[Option[BigDecimal]]}}
- {{Map[Option[BigDecimal]]}}

However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master:

- {{Seq[Option[Map]]}}
- {{Map[Option[Map]]}}
- {{Seq[Option[]]}}
- {{Map[Option[]]}}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.D

[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Summary: Expression encoding fails for Seq/Map of 
Option[Seq/Date/Timestamp/BigDecimal]  (was: Expression encoding fails for 
Seq/Map of Option[Seq])

> Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
> --
>
> Key: SPARK-45896
> URL: https://issues.apache.org/jira/browse/SPARK-45896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following action fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> val df = Seq(Seq(Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
> scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
> AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> However, it succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: array>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
> {noformat}
> Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed 
> to encode a value of the expressions: 
> externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), 
> assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
> ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
> lambdavariable(ExternalMapToCatalyst_value, ObjectType(class 
> java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -3), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), IntegerType, IntegerType)), 
> unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
> validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
> ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
> ObjectType(class scala.Option))), None), input[0, 
> scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
> ...
> Caused by: java.lang.RuntimeException: scala.Some is not a valid external 
> type for schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
>  Source)
> ...
> {noformat}
> As with the first example, this succeeds on 3.3.3:
> {noformat}
> scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: map>]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45896:
--
Description: 
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}

  was:
The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode

[jira] [Created] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]

2023-11-11 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45896:
-

 Summary: Expression encoding fails for Seq/Map of Option[Seq]
 Key: SPARK-45896
 URL: https://issues.apache.org/jira/browse/SPARK-45896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Bruce Robbins


The following action fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
val df = Seq(Seq(Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class 
scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) 
AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
However, it succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Seq(Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: array>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))])
{noformat}
Map of option of sequence also fails on 3.4.1, 3.5.0, and master:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to 
encode a value of the expressions: 
externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), 
assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, 
ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), 
lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), 
true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), 
assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), IntegerType, IntegerType)), 
unwrapoption(ObjectType(interface scala.collection.immutable.Seq), 
validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, 
ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), 
ObjectType(class scala.Option))), None), input[0, 
scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846
...
Caused by: java.lang.RuntimeException: scala.Some is not a valid external type 
for schema of array
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown
 Source)
...
{noformat}
As with the first example, this succeeds on 3.3.3:
{noformat}
scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map>]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))])
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45797) Discrepancies in PySpark DataFrame Results When Using Window Functions and Filters

2023-11-05 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783015#comment-17783015
 ] 

Bruce Robbins commented on SPARK-45797:
---

I wonder if this is the same as SPARK-45543, which had two window specs and 
then produced wrong answers when filtered on rank = 1.

> Discrepancies in PySpark DataFrame Results When Using Window Functions and 
> Filters
> --
>
> Key: SPARK-45797
> URL: https://issues.apache.org/jira/browse/SPARK-45797
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Python 3.10
> Pyspark 3.5.0
> Ubuntu 22.04.3 LTS
>Reporter: Daniel Diego Horcajuelo
>Priority: Major
> Fix For: 3.5.0
>
>
> When doing certain types of transformations on a dataframe which involve 
> window functions with filters I am getting the wrong results. Here is a 
> minimal example of the results I get with my code:
>  
> {code:java}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> from pyspark.sql.window import Window as w
> from datetime import datetime, date
> spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", 
> True).getOrCreate()
> # Base dataframe
> df = spark.createDataFrame(
> [
> (1, date(2023, 10, 1), date(2023, 10, 2), "open"),
> (1, date(2023, 10, 2), date(2023, 10, 3), "close"),
> (2, date(2023, 10, 1), date(2023, 10, 2), "close"),
> (2, date(2023, 10, 2), date(2023, 10, 4), "close"),
> (3, date(2023, 10, 2), date(2023, 10, 4), "open"),
> (3, date(2023, 10, 3), date(2023, 10, 6), "open"),
> ],
> schema="id integer, date_start date, date_end date, status string"
> )
> # We define two partition functions
> partition = w.partitionBy("id").orderBy("date_start", 
> "date_end").rowsBetween(w.unboundedPreceding, w.unboundedFollowing)
> partition2 = w.partitionBy("id").orderBy("date_start", "date_end")
> # Define dataframe A
> A = df.withColumn(
> "date_end_of_last_close",
> f.max(f.when(f.col("status") == "close", 
> f.col("date_end"))).over(partition)
> ).withColumn(
> "rank",
> f.row_number().over(partition2)
> )
> display(A)
> | id | date_start | date_end   | status | date_end_of_last_close | rank |
> ||||||--|
> | 1  | 2023-10-01 | 2023-10-02 | open   | 2023-10-03 | 1|
> | 1  | 2023-10-02 | 2023-10-03 | close  | 2023-10-03 | 2|
> | 2  | 2023-10-01 | 2023-10-02 | close  | 2023-10-04 | 1|
> | 2  | 2023-10-02 | 2023-10-04 | close  | 2023-10-04 | 2|
> | 3  | 2023-10-02 | 2023-10-04 | open   | NULL   | 1|
> | 3  | 2023-10-03 | 2023-10-06 | open   | NULL   | 2|
> # When filtering by rank = 1, I get this weird result
> A_result = A.filter(f.col("rank") == 1).drop("rank")
> display(A_result)
> | id | date_start | date_end   | status | date_end_of_last_close |
> ||||||
> | 1  | 2023-10-01 | 2023-10-02 | open   | NULL   |
> | 2  | 2023-10-01 | 2023-10-02 | close  | 2023-10-02 |
> | 3  | 2023-10-02 | 2023-10-04 | open   | NULL   | {code}
> I think spark engine might be managing wrongly the internal partitions. If 
> creating the dataframe from scratch (without transformations), the filtering 
> operation returns the right result. In pyspark 3.4.0 this error doesn't 
> happen.
>  
> For more details, please check out this same question in stackoverflow: 
> [stackoverflow 
> question|https://stackoverflow.com/questions/77396807/discrepancies-in-pyspark-dataframe-results-when-using-window-functions-and-filte?noredirect=1#comment136446225_77396807]
>  
> I'll mark this issue as important because it affects some basic operations 
> which are daily used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-31 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781531#comment-17781531
 ] 

Bruce Robbins commented on SPARK-45644:
---

I will look into it and try to submit a fix. If I can't, I will ping someone 
who can.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expression

[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-31 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781494#comment-17781494
 ] 

Bruce Robbins commented on SPARK-45644:
---

OK, I can reproduce. I will take a look. I will also try to get my reproduction 
example down to a minimal case and will post here later.

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>  

[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"

2023-10-30 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781091#comment-17781091
 ] 

Bruce Robbins commented on SPARK-45644:
---

You can turn on display of the generated code by adding the following to your 
log4j conf:
{noformat}
logger.codegen.name = 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator
logger.codegen.level = debug
{noformat}
Do you have any application code you can share? It looks like the error happens 
at the start of the job (task 0 stage 0).

> After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException 
> "scala.Some is not a valid external type for schema of array"
> --
>
> Key: SPARK-45644
> URL: https://issues.apache.org/jira/browse/SPARK-45644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Adi Wehrli
>Priority: Major
>
> I do not really know if this is a bug, but I am at the end with my knowledge.
> A Spark job ran successfully with Spark 3.2.x and 3.3.x. 
> But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job 
> with the same data the following always occurs now:
> {code}
> scala.Some is not a valid external type for schema of array
> {code}
> The corresponding stacktrace is:
> {code}
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch 
> worker for task 0.0 in stage 0.0 (TID 0)"
> java.lang.RuntimeException: scala.Some is not a valid external type for 
> schema of array
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380)
>  ~[spark-sql_2.12-3.5.0.jar:3.5.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) 
> ~[scala-library-2.12.15.jar:?]
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.scheduler.Task.run(Task.scala:141) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>  ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>  ~[spark-common-utils_2.12-3.5.0.jar:3.5.0]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) 
> ~[spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) 
> [spark-core_2.12-3.5.0.jar:3.5.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor 
> msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch 
> worker for task 1.0 in stage 0.0 (TID 1)"
> java.lang.Runtime

[jira] [Updated] (SPARK-45580) Subquery changes the output schema of outer query

2023-10-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Summary: Subquery changes the output schema of outer query  (was: 
RewritePredicateSubquery unexpectedly changes the output schema of certain 
queries)

> Subquery changes the output schema of outer query
> -
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query

2023-10-21 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Summary: Subquery changes the output schema of the outer query  (was: 
Subquery changes the output schema of outer query)

> Subquery changes the output schema of the outer query
> -
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.

2023-10-20 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45583.
---
Resolution: Fixed

> Spark SQL returning incorrect values for full outer join on keys with the 
> same name.
> 
>
> Key: SPARK-45583
> URL: https://issues.apache.org/jira/browse/SPARK-45583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Huw
>Priority: Major
> Fix For: 3.5.0
>
>
> {{The following query gives the wrong results.}}
>  
> {{WITH people as (}}
> {{  SELECT * FROM (VALUES }}
> {{    (1, 'Peter'), }}
> {{    (2, 'Homer'), }}
> {{    (3, 'Ned'),}}
> {{    (3, 'Jenny')}}
> {{  ) AS Idiots(id, FirstName)}}
> {{{}){}}}{{{}, location as ({}}}
> {{  SELECT * FROM (VALUES}}
> {{    (1, 'sample0'),}}
> {{    (1, 'sample1'),}}
> {{    (2, 'sample2')  }}
> {{  ) as Locations(id, address)}}
> {{{}){}}}{{{}SELECT{}}}
> {{  *}}
> {{FROM}}
> {{  people}}
> {{FULL OUTER JOIN}}
> {{  location}}
> {{ON}}
> {{  people.id = location.id}}
> {{We find the following table:}}
> ||id: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |null|Ned|null|null|
> |null|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|
> {{But clearly the first `id` column is wrong, the nulls should be 3.}}
> If we rename the id column in (only) the person table to pid we get the 
> correct results:
> ||pid: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |3|Ned|null|null|
> |3|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions

2023-10-19 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1304#comment-1304
 ] 

Bruce Robbins commented on SPARK-45601:
---

Possibly SPARK-38666

> stackoverflow when executing rule ExtractWindowExpressions
> --
>
> Key: SPARK-45601
> URL: https://issues.apache.org/jira/browse/SPARK-45601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: JacobZheng
>Priority: Major
>
> I am encountering stackoverflow errors while executing the following test 
> case. I looked at the source code and it is ExtractWindowExpressions not 
> extracting the window correctly and encountering a dead loop at 
> resolveOperatorsDownWithPruning that is causing it.
> {code:scala}
>  test("agg filter contains window") {
> val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
>   .withColumn("test",
> expr("count(col1) filter (where min(col1) over(partition by col2 
> order by col3)>1)"))
> src.show()
>   }
> {code}
> Now my question is this kind of in agg filter (window) is the correct usage? 
> Or should I add a check like spark sql and throw an error "It is not allowed 
> to use window functions inside WHERE clause"?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.

2023-10-18 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776783#comment-17776783
 ] 

Bruce Robbins commented on SPARK-45583:
---

Strangely, I cannot reproduce. Is some setting required?
{noformat}
sql("select version()").show(false)
+--+
|version() |
+--+
|3.5.0 ce5ddad990373636e94071e7cef2f31021add07b|
+--+

scala> sql("""WITH people as (
  SELECT * FROM (VALUES 
(1, 'Peter'), 
(2, 'Homer'), 
(3, 'Ned'),
(3, 'Jenny')
  ) AS Idiots(id, FirstName)
), location as (
  SELECT * FROM (VALUES
(1, 'sample0'),
(1, 'sample1'),
(2, 'sample2')  
  ) as Locations(id, address)
)SELECT
  *
FROM
  people
FULL OUTER JOIN
  location
ON
  people.id = location.id""").show(false)
 |  |  |  |  |  |  |  |  |  |  |
  |  |  |  |  |  |  |  |  | 
+---+-++---+
|id |FirstName|id  |address|
+---+-++---+
|1  |Peter|1   |sample0|
|1  |Peter|1   |sample1|
|2  |Homer|2   |sample2|
|3  |Ned  |NULL|NULL   |
|3  |Jenny|NULL|NULL   |
+---+-++---+

scala> 
{noformat}

> Spark SQL returning incorrect values for full outer join on keys with the 
> same name.
> 
>
> Key: SPARK-45583
> URL: https://issues.apache.org/jira/browse/SPARK-45583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Huw
>Priority: Major
>
> {{The following query gives the wrong results.}}
>  
> {{WITH people as (}}
> {{  SELECT * FROM (VALUES }}
> {{    (1, 'Peter'), }}
> {{    (2, 'Homer'), }}
> {{    (3, 'Ned'),}}
> {{    (3, 'Jenny')}}
> {{  ) AS Idiots(id, FirstName)}}
> {{{}){}}}{{{}, location as ({}}}
> {{  SELECT * FROM (VALUES}}
> {{    (1, 'sample0'),}}
> {{    (1, 'sample1'),}}
> {{    (2, 'sample2')  }}
> {{  ) as Locations(id, address)}}
> {{{}){}}}{{{}SELECT{}}}
> {{  *}}
> {{FROM}}
> {{  people}}
> {{FULL OUTER JOIN}}
> {{  location}}
> {{ON}}
> {{  people.id = location.id}}
> {{We find the following table:}}
> ||id: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |null|Ned|null|null|
> |null|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|
> {{But clearly the first `id` column is wrong, the nulls should be 3.}}
> If we rename the id column in (only) the person table to pid we get the 
> correct results:
> ||pid: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |3|Ned|null|null|
> |3|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776401#comment-17776401
 ] 

Bruce Robbins commented on SPARK-45580:
---

I'll make a PR in the coming days.

> RewritePredicateSubquery unexpectedly changes the output schema of certain 
> queries
> --
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace temp view t1(a) as values (1), (2), (3), (7);
> create or replace temp view t2(c1) as values (1), (2), (3);
> create or replace temp view t3(col1) as values (3), (9);
> cache table t1;
> cache table t2;
> cache table t3;
> {noformat}
> When run in {{spark-sql}}, the following query has a superfluous boolean 
> column:
> {noformat}
> select *
> from t1
> where exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> 1 false
> 2 false
> 3 true
> {noformat}
> The result should be:
> {noformat}
> 1
> 2
> 3
> {noformat}
> When executed via the {{Dataset}} API, you don't see the incorrect result, 
> because the Dataset API truncates the right-side of the rows based on the 
> analyzed plan's schema (it's the optimized plan's schema that goes wrong).
> However, even with the {{Dataset}} API, this query goes wrong:
> {noformat}
> select (
>   select *
>   from t1
>   where exists (
> select c1
> from t2
> where a = c1
> or a in (select col1 from t3)
>   )
>   limit 1
> )
> from range(1);
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis
>   at scala.Predef$.assert(Predef.scala:279)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
> ...
> {noformat}
> Other queries that have the wrong schema:
> {noformat}
> select *
> from t1
> where a in (
>   select c1
>   from t2
>   where a in (select col1 from t3)
> );
> {noformat}
> and
> {noformat}
> select *
> from t1
> where not exists (
>   select c1
>   from t2
>   where a = c1
>   or a in (select col1 from t3)
> );
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45580:
--
Description: 
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see the incorrect result, 
because the Dataset API truncates the right-side of the rows based on the 
analyzed plan's schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}


  was:
A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the 
Dataset API truncates the right-side of the rows based on the analyzed plan's 
schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}



> RewritePredicateSubquery unexpectedly changes the output schema of certain 
> queries
> --
>
> Key: SPARK-45580
> URL: https://issues.apache.org/jira/browse/SPARK-45580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query can have an incorrect output schema because of a subquery.
> Assume this data:
> {noformat}
> create or replace

[jira] [Created] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries

2023-10-17 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45580:
-

 Summary: RewritePredicateSubquery unexpectedly changes the output 
schema of certain queries
 Key: SPARK-45580
 URL: https://issues.apache.org/jira/browse/SPARK-45580
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.3.3
Reporter: Bruce Robbins


A query can have an incorrect output schema because of a subquery.

Assume this data:
{noformat}
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
cache table t1;
cache table t2;
cache table t3;
{noformat}
When run in {{spark-sql}}, the following query has a superfluous boolean column:
{noformat}
select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1   false
2   false
3   true
{noformat}
The result should be:
{noformat}
1
2
3
{noformat}
When executed via the {{Dataset}} API, you don't see this result, because the 
Dataset API truncates the right-side of the rows based on the analyzed plan's 
schema (it's the optimized plan's schema that goes wrong).

However, even with the {{Dataset}} API, this query goes wrong:
{noformat}
select (
  select *
  from t1
  where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
at scala.Predef$.assert(Predef.scala:279)
at 
org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
...
{noformat}
Other queries that have the wrong schema:
{noformat}
select *
from t1
where a in (
  select c1
  from t2
  where a in (select col1 from t3)
);
{noformat}
and
{noformat}
select *
from t1
where not exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45440) Incorrect summary counts from a CSV file

2023-10-06 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772724#comment-17772724
 ] 

Bruce Robbins commented on SPARK-45440:
---

I added {{inferSchema=true}} as a datasource option in your example and I got 
the expected answer. Otherwise it's doing a max and min on a string (not a 
number).

> Incorrect summary counts from a CSV file
> 
>
> Key: SPARK-45440
> URL: https://issues.apache.org/jira/browse/SPARK-45440
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0
> Environment: Pyspark version 3.5.0 
>Reporter: Evan Volgas
>Priority: Major
>  Labels: aggregation, bug, pyspark
>
> I am using pip-installed Pyspark version 3.5.0 inside the context of an 
> IPython shell. The task is straightforward: take [this CSV 
> file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv]
>  of AAPL stock prices and compute the minimum and maximum volume weighted 
> average price for the entire file. 
> My code is [here. 
> |https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've 
> also performed the same computation in DuckDB because I noticed that the 
> results of the Spark code are wrong. 
> Literally, the exact same SQL in DuckDB and in Spark yield different results, 
> and Spark's are wrong. 
> I have never seen this behavior in a Spark release before. I'm very confused 
> by it, and curious if anyone else can replicate this behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45171:
-

 Summary: GenerateExec fails to initialize non-deterministic 
expressions before use
 Key: SPARK-45171
 URL: https://issues.apache.org/jira/browse/SPARK-45171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query fails:
{noformat}
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);
{noformat}
The error is:
{noformat}
23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
at 
org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
{noformat}
However, this query succeeds:
{noformat}
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);
{noformat}
The difference is that {{transform}} turns off whole-stage codegen, which 
exposes a bug in {{GenerateExec}} where the non-deterministic expression passed 
to the generator function is not initialized before being used.

An even simpler reprod case is:
{noformat}
set spark.sql.codegen.wholeStage=false;

select explode(array(rand()));
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

2023-09-10 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763455#comment-17763455
 ] 

Bruce Robbins commented on SPARK-44912:
---

It looks like this was fixed with SPARK-45071. Your issue was reported earlier, 
but missed somehow.

> Spark 3.4 multi-column sum slows with many columns
> --
>
> Key: SPARK-44912
> URL: https://issues.apache.org/jira/browse/SPARK-44912
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Brady Bickel
>Priority: Major
>
> The code below is a minimal reproducible example of an issue I discovered 
> with Pyspark 3.4.x. I want to sum the values of multiple columns and put the 
> sum of those columns (per row) into a new column. This code works and returns 
> in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in 
> Pyspark 3.4.x when the number of columns grows. See below for execution 
> timing summary as N varies.
> {code:java}
> import pyspark.sql.functions as F
> import random
> import string
> from functools import reduce
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> # generate a dataframe N columns by M rows with random 8 digit column 
> # names and random integers in [-5,10]
> N = 30
> M = 100
> columns = [''.join(random.choices(string.ascii_uppercase +
>   string.digits, k=8))
>for _ in range(N)]
> data = [tuple([random.randint(-5,10) for _ in range(N)])
> for _ in range(M)]
> df = spark.sparkContext.parallelize(data).toDF(columns)
> # 3 ways to add a sum column, all of them slow for high N in spark 3.4
> df = df.withColumn("col_sum1", sum(df[col] for col in columns))
> df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
> df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
> Timing results for Spark 3.3:
> ||N||Exe Time (s)||
> |5|0.514|
> |10|0.248|
> |15|0.327|
> |20|0.403|
> |25|0.279|
> |30|0.322|
> |50|0.430|
> Timing results for Spark 3.4:
> ||N||Exe Time (s)||
> |5|0.379|
> |10|0.318|
> |15|0.405|
> |20|1.32|
> |25|28.8|
> |30|448|
> |50|>1 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45106:
--
Affects Version/s: 3.3.2

>  percentile_cont gets internal error when user input fails runtime 
> replacement's input type check
> -
>
> Key: SPARK-45106
> URL: https://issues.apache.org/jira/browse/SPARK-45106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> This query throws an internal error rather than producing a useful error 
> message:
> {noformat}
> select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
> from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
> [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
> "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
> replaceable expression "percentile_cont(a, b)". The replacement is 
> unresolved: "percentile(a, b, 1)".
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
> ...
> {noformat}
> It should instead inform the user that the input expression must be foldable.
> {{PercentileCont}} does not check the user's input. If the runtime 
> replacement (an instance of {{Percentile}}) rejects the user's input, the 
> runtime replacement ends up unresolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45106:
-

 Summary:  percentile_cont gets internal error when user input 
fails runtime replacement's input type check
 Key: SPARK-45106
 URL: https://issues.apache.org/jira/browse/SPARK-45106
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0, 4.0.0
Reporter: Bruce Robbins


This query throws an internal error rather than producing a useful error 
message:
{noformat}
select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);

[INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
"percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
replaceable expression "percentile_cont(a, b)". The replacement is unresolved: 
"percentile(a, b, 1)".
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:92)
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:96)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
...
{noformat}
It should instead inform the user that the input expression must be foldable.

{{PercentileCont}} does not check the user's input. If the runtime replacement 
(an instance of {{Percentile}}) rejects the user's input, the runtime 
replacement ends up unresolved.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-07 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44805:
--
Affects Version/s: 3.4.1

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.4.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-07 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762792#comment-17762792
 ] 

Bruce Robbins commented on SPARK-44805:
---

PR here: https://github.com/apache/spark/pull/42850

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-05 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762234#comment-17762234
 ] 

Bruce Robbins commented on SPARK-44805:
---

I looked at this yesterday and I think I have a handle on what's going on. I 
will make a PR in the coming days.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-04 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44805:
--
Labels: correctness  (was: )

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-08-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754344#comment-17754344
 ] 

Bruce Robbins edited comment on SPARK-44805 at 8/15/23 12:26 AM:
-

[~sunchao] 

It seems to be some weird interaction between Parquet nested vectorization and 
the {{Cast}} expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)

set spark.sql.parquet.enableNestedColumnVectorizedReader=false;

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[1,1,2]}   <== now has expected value
Time taken: 0.244 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.


was (Author: bersprockets):
It seems to be some weird interaction between Parquet and the {{Cast}} 
expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look int

[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-08-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754344#comment-17754344
 ] 

Bruce Robbins commented on SPARK-44805:
---

It seems to be some weird interaction between Parquet and the {{Cast}} 
expression:
{noformat}
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select value from t1;
{"f1":[1,2,3],"f2":[1,1,2]} <== this is expected
Time taken: 0.126 seconds, Fetched 1 row(s)

select cast(value as struct,f2:array>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}   <== this is not expected
Time taken: 0.102 seconds, Fetched 1 row(s)
{noformat}
The union operation adds this {{Cast}} expression because {{value}} has 
different datatypes between your two dataframes.

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744314#comment-17744314
 ] 

Bruce Robbins commented on SPARK-44477:
---

PR here: https://github.com/apache/spark/pull/42064

> CheckAnalysis uses error subclass as an error class
> ---
>
> Key: SPARK-44477
> URL: https://issues.apache.org/jira/browse/SPARK-44477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
> but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
> {noformat}
> spark-sql (default)> select bitmap_count(12);
> [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error 
> class 'TYPE_CHECK_FAILURE_WITH_HINT'
> at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
> at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
> at 
> org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
> at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
> at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
> {noformat}
> This issue only occurs when an expression uses 
> {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
> {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
> {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
> added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
> {{{}BitmapOrAgg{}}}.
> {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
> {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
> {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
> corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44477:
-

 Summary: CheckAnalysis uses error subclass as an error class
 Key: SPARK-44477
 URL: https://issues.apache.org/jira/browse/SPARK-44477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


{{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
{noformat}
spark-sql (default)> select bitmap_count(12);
[INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error class 
'TYPE_CHECK_FAILURE_WITH_HINT'
at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
at 
org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
{noformat}
This issue only occurs when an expression uses 
{{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
{{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
{{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
{{{}BitmapOrAgg{}}}.

{{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
{{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
{{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-07-01 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Labels: correctness  (was: )

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Affects Version/s: 3.3.2

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Affects Version/s: 3.4.1

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-30 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739180#comment-17739180
 ] 

Bruce Robbins commented on SPARK-44251:
---

PR can be found here: https://github.com/apache/spark/pull/41809

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738762#comment-17738762
 ] 

Bruce Robbins commented on SPARK-44251:
---

This is similar to, but not quite the same as SPARK-43718, and the fix will be 
similar too.

I will make a PR shortly.
 

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Summary: Potential for incorrect results or NPE when full outer USING join 
has null key value  (was: Potentially incorrect results or NPE when full outer 
USING join has null key value)

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44251) Potentially incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44251:
-

 Summary: Potentially incorrect results or NPE when full outer 
USING join has null key value
 Key: SPARK-44251
 URL: https://issues.apache.org/jira/browse/SPARK-44251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query produces incorrect results:
{noformat}
create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values (2, 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

-1   <== should be null
1
2
{noformat}
The following query fails with a {{NullPointerException}}:
{noformat}
create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values ('2', 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
...
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735976#comment-17735976
 ] 

Bruce Robbins commented on SPARK-44132:
---

[~steven.aerts] Go for it!

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735944#comment-17735944
 ] 

Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM:


You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221, SPARK-26680).


was (Author: bersprockets):
You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in t

[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735944#comment-17735944
 ] 

Bruce Robbins commented on SPARK-44132:
---

You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44040) Incorrect result after count distinct

2023-06-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732163#comment-17732163
 ] 

Bruce Robbins commented on SPARK-44040:
---

It seems this can be reproduced in {{spark-sql}} as well.

Interestingly, turning off AQE seems to fix the issue (for both the above 
dataframe version and the below SQL version):
{noformat}
spark-sql (default)> create or replace temp view v1 as
select 1 as c1 limit 0;
Time taken: 0.959 seconds
spark-sql (default)> create or replace temp view agg1 as
select sum(c1) as c1, "agg1" as name
from v1;
Time taken: 0.16 seconds
spark-sql (default)> create or replace temp view agg2 as
select sum(c1) as c1, "agg2" as name
from v1;
Time taken: 0.035 seconds
spark-sql (default)> create or replace temp view union1 as
select * from agg1
union
select * from agg2;
Time taken: 0.088 seconds
spark-sql (default)> -- the following incorrectly produces 2 rows
select distinct c1 from union1;
NULL
NULL
Time taken: 1.649 seconds, Fetched 2 row(s)
spark-sql (default)> set spark.sql.adaptive.enabled=false;
spark.sql.adaptive.enabled  false
Time taken: 0.019 seconds, Fetched 1 row(s)
spark-sql (default)> -- the following correctly produces 1 row
select distinct c1 from union1;
NULL
Time taken: 1.372 seconds, Fetched 1 row(s)
spark-sql (default)> 
{noformat}

> Incorrect result after count distinct
> -
>
> Key: SPARK-44040
> URL: https://issues.apache.org/jira/browse/SPARK-44040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Aleksandr Aleksandrov
>Priority: Critical
>
> When i try to call count after distinct function for Decimal null field, 
> spark return incorrect result starting from spark 3.4.0.
> A minimal example to reproduce:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession}
> import org.apache.spark.sql.types.\{StringType, StructField, StructType}
> val schema = StructType( Array(
> StructField("money", DecimalType(38,6), true),
> StructField("reference_id", StringType, true)
> ))
> val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema)
> val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1"))
> val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", 
> lit("df2"))
> val unionDF: DataFrame = aggDf.union(aggDf1)
> unionDF.select("money").distinct.show // return correct result
> unionDF.select("money").distinct.count // return 2 instead of 1
> unionDF.select("money").distinct.count == 1 // return false
> This block of code returns some assertion error and after that an incorrect 
> count (in spark 3.2.1 everything works fine and i get correct result = 1):
> *scala> unionDF.select("money").distinct.show // return correct result*
> java.lang.AssertionError: assertion failed:
> Decimal$DecimalIsFractional
> while compiling: 
> during phase: globalPhase=terminal, enteringPhase=jvm
> library version: version 2.12.17
> compiler version: version 2.12.17
> reconstructed args: -classpath 
> /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar
>  -Yrepl-class-based -Yrepl-outdir 
> /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1
> last tree to typer: TypeTree(class Byte)
> tree position: line 6 of 
> tree tpe: Byte
> symbol: (final abstract) class Byte in package scala
> symbol definition: final abstract class Byte extends (a ClassSymbol)
> symbol package: scala
> symbol owners: class Byte
> call site: constructor $eval in object $eval in package $line19
> == Source file context for tree position ==
> 3
> 4object $eval {
> 5lazyval $result = 
> $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
> 6lazyval $print: {_}root{_}.java.lang.String = {
> 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
> 8
> 9""
> at 
> scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
> at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
> at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
> at 
> scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
> at 
> scala.reflect.internal.S

[jira] [Resolved] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-43843.
---
Resolution: Invalid

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726988#comment-17726988
 ] 

Bruce Robbins commented on SPARK-43843:
---

Nevermind, I had an old {{spark-avro_2.12-3.5.0-SNAPSHOT.jar}} laying about in 
my {{work}} directory which the find in my {{--jars}} value found first.

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException

2023-05-28 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726980#comment-17726980
 ] 

Bruce Robbins commented on SPARK-43841:
---

PR at https://github.com/apache/spark/pull/41353

> Non-existent column in projection of full outer join with USING results in 
> StringIndexOutOfBoundsException
> --
>
> Key: SPARK-43841
> URL: https://issues.apache.org/jira/browse/SPARK-43841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> The following query throws a {{StringIndexOutOfBoundsException}}:
> {noformat}
> with v1 as (
>  select * from values (1, 2) as (c1, c2)
> ),
> v2 as (
>   select * from values (2, 3) as (c1, c2)
> )
> select v1.c1, v1.c2, v2.c1, v2.c2, b
> from v1
> full outer join v2
> using (c1);
> {noformat}
> The query should fail anyway, since {{b}} refers to a non-existent column. 
> But it should fail with a helpful error message, not with a 
> {{StringIndexOutOfBoundsException}}.
> The issue seems to be in 
> {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. 
> {{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate 
> attributes with a mix of prefixes will never have an attribute name with an 
> empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no 
> prefix, since it is not associated with any relation or subquery):
> {noformat}
> +- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
>+- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no 
> prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
>   +- Join FullOuter, (c1#5 = c1#7)
>  :- SubqueryAlias v1
>  :  +- CTERelationRef 0, true, [c1#5, c2#6]
>  +- SubqueryAlias v2
> +- CTERelationRef 1, true, [c1#7, c2#8]
> {noformat}
> Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted 
> list of suggestions like this:
> {noformat}
> ArrayBuffer(.c1, v1.c2, v2.c2)
> {noformat}
> {{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that 
> starts with a namespace separator ('.').



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43843:
--
Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, Java 
11.0.12)

> Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
> ---
>
> Key: SPARK-43843
> URL: https://issues.apache.org/jira/browse/SPARK-43843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, 
> Java 11.0.12)
>Reporter: Bruce Robbins
>Priority: Major
>
> I launched spark-shell as so:
> {noformat}
> bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
> grep -v test | head -1`
> {noformat}
> I got the below error trying to create an AVRO file:
> {noformat}
> scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
> val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> df.write.mode("overwrite").format("avro").save("avro_file")
> df.write.mode("overwrite").format("avro").save("avro_file")
> java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
>   at 
> org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
>   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
>   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
> ...
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError

2023-05-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43843:
-

 Summary: Saving an AVRO file with Scala 2.13 results in 
NoClassDefFoundError
 Key: SPARK-43843
 URL: https://issues.apache.org/jira/browse/SPARK-43843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


I launched spark-shell as so:
{noformat}
bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | 
grep -v test | head -1`
{noformat}
I got the below error trying to create an AVRO file:
{noformat}
scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
val df = Seq((1, 2), (3, 4)).toDF("a", "b")
val df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.write.mode("overwrite").format("avro").save("avro_file")
df.write.mode("overwrite").format("avro").save("avro_file")
java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps
  at 
org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74)
  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
  at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105)
  at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120)
...
scala> 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException

2023-05-28 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-43841:
-

 Summary: Non-existent column in projection of full outer join with 
USING results in StringIndexOutOfBoundsException
 Key: SPARK-43841
 URL: https://issues.apache.org/jira/browse/SPARK-43841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query throws a {{StringIndexOutOfBoundsException}}:
{noformat}
with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.c1, v1.c2, v2.c1, v2.c2, b
from v1
full outer join v2
using (c1);
{noformat}
The query should fail anyway, since {{b}} refers to a non-existent column. But 
it should fail with a helpful error message, not with a 
{{StringIndexOutOfBoundsException}}.

The issue seems to be in {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. 
{{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate 
attributes with a mix of prefixes will never have an attribute name with an 
empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no 
prefix, since it is not associated with any relation or subquery):
{noformat}
+- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
   +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no 
prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
  +- Join FullOuter, (c1#5 = c1#7)
 :- SubqueryAlias v1
 :  +- CTERelationRef 0, true, [c1#5, c2#6]
 +- SubqueryAlias v2
+- CTERelationRef 1, true, [c1#7, c2#8]
{noformat}
Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted 
list of suggestions like this:
{noformat}
ArrayBuffer(.c1, v1.c2, v2.c2)
{noformat}
{{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that 
starts with a namespace separator ('.').




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability

2023-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725143#comment-17725143
 ] 

Bruce Robbins commented on SPARK-43718:
---

PR here: https://github.com/apache/spark/pull/41267

> References to a specific side's key in a USING join can have wrong nullability
> --
>
> Key: SPARK-43718
> URL: https://issues.apache.org/jira/browse/SPARK-43718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view t1 as values (1), (2), (3) as (c1);
> create or replace temp view t2 as values (2), (3), (4) as (c1);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1
> from t1
> full outer join t2
> using (c1);
> 1
> -1  <== should be null
> 2
> 2
> 3
> 3
> -1  <== should be null
> 4
> Time taken: 0.663 seconds, Fetched 8 row(s)
> spark-sql (default)> 
> {noformat}
> Similar issues occur with right outer join and left outer join.
> {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is 
> resolved, so the array's {{containsNull}} value is incorrect.
> Queries that don't use arrays also can get wrong results. Assume this data:
> {noformat}
> create or replace temp view t1 as values (0), (1), (2) as (c1);
> create or replace temp view t2 as values (1), (2), (3) as (c1);
> create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b);
> {noformat}
> The following query produces incorrect results:
> {noformat}
> select t1.c1 as t1_c1, t2.c1 as t2_c1, b
> from t1
> full outer join t2
> using (c1),
> lateral (
>   select b
>   from t3
>   where a = coalesce(t2.c1, 1)
> ) lt3;
> 1 1   2
> NULL  3   4
> Time taken: 2.395 seconds, Fetched 2 row(s)
> spark-sql (default)> 
> {noformat}
> The result should be the following:
> {noformat}
> 0 NULL2
> 1 1   2
> NULL  3   4
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >