[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849785#comment-17849785 ]
Bruce Robbins commented on SPARK-47193: --------------------------------------- Thanks for the update. This issue is seemingly reproducible without a join, running {{spark-shell}} with {{spark.sql.legacy.timeParserPolicy=LEGACY}}: {noformat} val user = { spark .read .option("header", "true") .option("comment", "#") .option("nullValue", "null") .schema("UserId int, Created timestamp, deleted timestamp, Active boolean, ActivatedDate timestamp") .csv("/tmp/user.csv") } user.createOrReplaceTempView("user") sql("select * from user where ActivatedDate is not null").count sql("select * from user where ActivatedDate is not null").rdd.count {noformat} The first count operation returns 4. The second returns 0. A collect action shows why: all the timestamps are null when {{rdd}} is specified: {noformat} scala> sql("select * from user").collect warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` val res10: Array[org.apache.spark.sql.Row] = Array( [1,2021-11-22 11:27:27.0,2021-11-25 11:27:27.0,false,2021-11-22 11:27:27.0], [2,2021-11-22 11:27:27.0,null,true,2021-11-22 11:27:27.0], [3,2021-11-22 11:27:27.0,null,true,2021-11-22 11:27:27.0], [4,2021-11-22 11:27:27.0,null,null,2021-11-22 11:27:27.0], [5,2021-11-22 11:27:27.0,null,false,null]) scala> sql("select * from user").rdd.collect warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` val res11: Array[org.apache.spark.sql.Row] = Array( [1,null,null,false,null], [2,null,null,true,null], [3,null,null,true,null], [4,null,null,null,null], [5,null,null,false,null]) scala> {noformat} > Converting dataframe to rdd results in data loss > ------------------------------------------------ > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.0, 3.5.1 > Reporter: Ivan Bova > Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] > val timeZoneLookup = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage] > val userLocation = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage] > val user = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage] > val location = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage] > val result = user > .join(userProfile, user("UserId") === userProfile("UserId"), "inner") > .join(language, userProfile("LanguageId") === language("LanguageId"), > "left") > .join(userLocation, user("UserId") === userLocation("UserId"), "inner") > .join(location, userLocation("LocationId") === location("LocationId"), > "inner") > .join(device, location("LocationId") === device("LocationId"), "inner") > .join(deviceType, device("DeviceTypeId") === deviceType("DeviceTypeId"), > "inner") > .join( > deviceClass, > device("DeviceClassId") === deviceClass("DeviceClassId"), > "inner") > .join( > timeZoneLookup, > timeZoneLookup("TimeZoneId") === location("TimeZone"), > "left") > .join(location1, location("LocationId") === location1("LocationId"), "left") > .where( > device("UserId1").isNull > && (user("Active") === lit(true) || user("ActivatedDate").isNotNull) > ) > .dropDuplicates() > println("df count = " + result.count()) > println("rdd count = "+ result.rdd.count()) > result.show(false) > println("------") > result.rdd.foreach(println) {code} > output: > {code:java} > df count = 8 > rdd count = 4 > +------+-------------------+-------------------+------+-------------------+------+------+----------+---------+----------+----------+----------------+------+----------+------------+------+-----------+----------+---------+----------+----------+-----+------+--------+-------+---------------+--------------+----------------+-----------+-------------------+-------------------+---------------+-----+--------+---------+-------------------+-------------------+----------+-----------+----------+------------+-------------+-------+------------+--------------+-------------+---------------+----------+--------+-----------+----------+--------+---------+------+-------------------+ > |UserId|Created |Deleted |Active|ActivatedDate > |UserId|Email |FirstName |LastName > |LanguageId|LanguageId|LanguageLocaleId|UserId|LocationId|LocationName|Status|CreatedDate|LocationId|IsDeleted|Address1 > |Address2 |City |State |Country > |ZipCode|Feature2Enabled|LocationStatus|Location1Enabled|LocationKey|UpdatedDateTime > |CreatedDate |Feature1Enabled|Level|TimeZone|DeviceId1|Created > |UpdatedDate |DeviceId2 |DeviceName > |LocationId|DeviceTypeId|DeviceClassId|UserId1|DeviceTypeId|DeviceTypeName|DeviceClassId|DeviceClassName|TimeZoneId|ZoneName|LocationId1|LocationId|Latitude|Longitude|Radius|CreatedDate > | > +------+-------------------+-------------------+------+-------------------+------+------+----------+---------+----------+----------+----------------+------+----------+------------+------+-----------+----------+---------+----------+----------+-----+------+--------+-------+---------------+--------------+----------------+-----------+-------------------+-------------------+---------------+-----+--------+---------+-------------------+-------------------+----------+-----------+----------+------------+-------------+-------+------------+--------------+-------------+---------------+----------+--------+-----------+----------+--------+---------+------+-------------------+ > |1 |2021-11-22 11:27:27|2021-11-25 11:27:27|false |2021-11-22 11:27:27|1 > |email1|firstName1|lastName1|1 |1 |It |1 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device3 |2021-11-18 > 11:27:27|2021-11-19 11:27:27|DeviceId23|DeviceName3|1 |3 |3 > |NULL |3 |type3 |3 |class3 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > |1 |2021-11-22 11:27:27|2021-11-25 11:27:27|false |2021-11-22 11:27:27|1 > |email1|firstName1|lastName1|1 |1 |It |1 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device1 |2021-11-16 > 11:27:27|2021-11-17 11:27:27|DeviceId21|DeviceName1|1 |1 |1 > |NULL |1 |type1 |1 |class1 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > |2 |2021-11-22 11:27:27|NULL |true |2021-11-22 11:27:27|2 > |email2|firstName2|lastName2|2 |2 |En |2 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device3 |2021-11-18 > 11:27:27|2021-11-19 11:27:27|DeviceId23|DeviceName3|1 |3 |3 > |NULL |3 |type3 |3 |class3 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > |2 |2021-11-22 11:27:27|NULL |true |2021-11-22 11:27:27|2 > |email2|firstName2|lastName2|2 |2 |En |2 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device1 |2021-11-16 > 11:27:27|2021-11-17 11:27:27|DeviceId21|DeviceName1|1 |1 |1 > |NULL |1 |type1 |1 |class1 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > |3 |2021-11-22 11:27:27|NULL |true |2021-11-22 11:27:27|3 > |email3|firstName3|lastName3|3 |3 |DE |3 > |2 |Location2 |NULL |NULL |2 |false > |address1_2|address2_2|City2|State2|Country2|code2 |true |2 > |true |LocKey2 |2021-11-17 11:27:27|2021-11-17 > 11:27:27|false |1 |1 |device4 |2021-11-25 11:27:27|NULL > |DeviceId24|DeviceName4|2 |1 |2 |NULL > |1 |type1 |2 |class2 |1 |Zone1 > |3 |2 |14.32 |45.23 |16.2 |2021-11-23 11:27:27| > |3 |2021-11-22 11:27:27|NULL |true |2021-11-22 11:27:27|3 > |email3|firstName3|lastName3|3 |3 |DE |3 > |2 |Location2 |NULL |NULL |2 |false > |address1_2|address2_2|City2|State2|Country2|code2 |true |2 > |true |LocKey2 |2021-11-17 11:27:27|2021-11-17 > 11:27:27|false |1 |1 |device2 |2021-11-17 > 11:27:27|2021-11-18 11:27:27|DeviceId22|DeviceName2|2 |2 |2 > |NULL |2 |type2 |2 |class2 > |1 |Zone1 |3 |2 |14.32 |45.23 |16.2 > |2021-11-23 11:27:27| > |4 |2021-11-22 11:27:27|NULL |NULL |2021-11-22 11:27:27|4 > |email4|firstName4|lastName4|NULL |NULL |NULL |4 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device3 |2021-11-18 > 11:27:27|2021-11-19 11:27:27|DeviceId23|DeviceName3|1 |3 |3 > |NULL |3 |type3 |3 |class3 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > |4 |2021-11-22 11:27:27|NULL |NULL |2021-11-22 11:27:27|4 > |email4|firstName4|lastName4|NULL |NULL |NULL |4 > |1 |Location1 |NULL |NULL |1 |false > |address1_1|address2_1|City1|State1|Country1|code1 |true |1 > |true |LocKey1 |2021-11-16 11:27:27|2021-11-16 > 11:27:27|false |1 |1 |device1 |2021-11-16 > 11:27:27|2021-11-17 11:27:27|DeviceId21|DeviceName1|1 |1 |1 > |NULL |1 |type1 |1 |class1 > |1 |Zone1 |1 |1 |12.32 |43.23 |14.2 > |2021-11-21 11:27:27| > +------+-------------------+-------------------+------+-------------------+------+------+----------+---------+----------+----------+----------------+------+----------+------------+------+-----------+----------+---------+----------+----------+-----+------+--------+-------+---------------+--------------+----------------+-----------+-------------------+-------------------+---------------+-----+--------+---------+-------------------+-------------------+----------+-----------+----------+------------+-------------+-------+------------+--------------+-------------+---------------+----------+--------+-----------+----------+--------+---------+------+-------------------+------ > [2,null,null,true,null,2,email2,firstName2,lastName2,2,2,En,2,1,Location1,null,null,1,false,address1_1,address2_1,City1,State1,Country1,code1,true,1,true,LocKey1,null,null,false,1,1,device3,null,null,DeviceId23,DeviceName3,1,3,3,null,3,type3,3,class3,1,Zone1,1,1,12.32,43.23,14.2,null] > [2,null,null,true,null,2,email2,firstName2,lastName2,2,2,En,2,1,Location1,null,null,1,false,address1_1,address2_1,City1,State1,Country1,code1,true,1,true,LocKey1,null,null,false,1,1,device1,null,null,DeviceId21,DeviceName1,1,1,1,null,1,type1,1,class1,1,Zone1,1,1,12.32,43.23,14.2,null] > [3,null,null,true,null,3,email3,firstName3,lastName3,3,3,DE,3,2,Location2,null,null,2,false,address1_2,address2_2,City2,State2,Country2,code2,true,2,true,LocKey2,null,null,false,1,1,device4,null,null,DeviceId24,DeviceName4,2,1,2,null,1,type1,2,class2,1,Zone1,3,2,14.32,45.23,16.2,null] > [3,null,null,true,null,3,email3,firstName3,lastName3,3,3,DE,3,2,Location2,null,null,2,false,address1_2,address2_2,City2,State2,Country2,code2,true,2,true,LocKey2,null,null,false,1,1,device2,null,null,DeviceId22,DeviceName2,2,2,2,null,2,type2,2,class2,1,Zone1,3,2,14.32,45.23,16.2,null] > {code} > Dataframe count and show work as expected, dut rdd does not fave first two > and last two records -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org