[ https://issues.apache.org/jira/browse/SPARK-32016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-32016: ---------------------------- Fix Version/s: (was: 3.0.0) > Why spark does not preserve the original timestamp format while writing > dataset to file or hdfs > ----------------------------------------------------------------------------------------------- > > Key: SPARK-32016 > URL: https://issues.apache.org/jira/browse/SPARK-32016 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming > Affects Versions: 2.3.0, 2.4.0, 2.4.3 > Environment: Apache spark 2.3 and spark 2.4. May happen in other as > well > Reporter: Anupam Jain > Priority: Major > > Want to write spark dataset having few timestamp columns into hdfs. > * While reading, by default spark infers data as timestamp, if format is > similar to "*yyyy-MM-dd HH:mm:ss*". > * But while writing to file, saves in format as > "*yyyy-MM-dd'T'HH:mm:ss.SSSXXX*" > * For e.g. source data *2020-06-01 12:10:03* is written as > *2020-06-01T12:10:03.000+05:30*. > * Expected is to preserve the oroginal timestamp format before writing. > Why spark does not preserve the original timestamp format while writing > dataset to file or hdfs? > Using simple java code like: > {color:#4c9aff}Dataset<Row> ds = > spark.read().format("csv").option("path",the_path).option("inferSchema","true").load(); > {color} > {color:#4c9aff}ds.write().format("csv").save("path_to_save");{color} > I know the workaround: > * Use "*timestampFormat*" option before save. > * But may have performance overhead and also its global for all columns. > * So lets say have 2 columns having formats "*yyyy-MM-dd HH:mm:ss*" and > "*yyyy-MM-dd HH*". Both can be inferred as timestamp by default, but outputs > in a single specified "timestampFormat". > * Another way is to use date_format(col, format). But that also may have > performance overhead and includes operations to apply, whereas I expect spark > to preserve the original format > Tried with spark2.3 and spark2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org