[jira] [Resolved] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saniya Tech resolved SPARK-22460. - Resolution: Not A Problem > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245163#comment-16245163 ] Saniya Tech commented on SPARK-22460: - Based on the feedback I am going to close this ticket and try to resolve the issue in spark-avro code-base. Thanks! > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saniya Tech reopened SPARK-22460: - See my last comment. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242109#comment-16242109 ] Saniya Tech commented on SPARK-22460: - No it is not a spark-avro issue. The spark-avro serializes and deserializes the long value (the time in milliseconds since EPOCH) correctly. It is the Spark encoder which converts the dataset as the case class that incorrectly interprets the long value as the time in seconds since EPOCH. I have broken the code into steps to show the results after each step: {code:java} // De-serialize // rawOutput is deserializing using spark-avro connector val rawOutput = spark.read.avro(path) // output is encoding using Spark's `as` val output = rawOutput.as[TestRecord] {code} Print-out of results for each step: {code:java} scala> data.head res3: TestRecord = TestRecord(One,2017-11-07 14:19:42.427) scala> data.head.modified.getTime res4: Long = 1510064382427 scala> rawOutput.collect().head res5: org.apache.spark.sql.Row = [One,1510064382427] scala> output.collect().head res6: TestRecord = TestRecord(One,49822-01-14 00:27:07.0) {code} This is the relevant code in Spark where it assumes the long value is in seconds: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L285 {code:java} // converting seconds to us private[this] def longToTimestamp(t: Long): Long = t * 100L {code} The Java API specifies the Timestamp.getTime() returns a long number in milliseconds: https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html#getTime-- > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saniya Tech updated SPARK-22460: Description: We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to a long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} was: We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to as long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
Saniya Tech created SPARK-22460: --- Summary: Spark De-serialization of Timestamp field is Incorrect Key: SPARK-22460 URL: https://issues.apache.org/jira/browse/SPARK-22460 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.1.1 Reporter: Saniya Tech We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to as long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org