[ 
https://issues.apache.org/jira/browse/SPARK-32016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141265#comment-17141265
 ] 

Hyukjin Kwon commented on SPARK-32016:
--------------------------------------

I am not very clear what original format you mean. Once it's parsed from CSV to 
JVM, there's no original format but it's timestamp instances on JVM.
If you want to keep the original format, you might have to deal with them as 
strings are are.
The workarounds you mentioned look fair enough to me.

> Why spark does not preserve the original timestamp format while writing 
> dataset to file or hdfs
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32016
>                 URL: https://issues.apache.org/jira/browse/SPARK-32016
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Structured Streaming
>    Affects Versions: 2.3.0, 2.4.0, 2.4.3
>         Environment: Apache spark 2.3 and spark 2.4. May happen in other as 
> well
>            Reporter: Anupam Jain
>            Priority: Major
>
> Want to write spark dataset having few timestamp columns into hdfs.
>  * While reading, by default spark infers data as timestamp, if format is 
> similar to "*yyyy-MM-dd HH:mm:ss*".
>  * But while writing to file, saves in format as 
> "*yyyy-MM-dd'T'HH:mm:ss.SSSXXX*"
>  * For e.g. source data *2020-06-01 12:10:03* is written as 
> *2020-06-01T12:10:03.000+05:30*.
>  * Expected is to preserve the oroginal timestamp format before writing.
> Why spark does not preserve the original timestamp format while writing 
> dataset to file or hdfs?
> Using simple java code like:
> {color:#4c9aff}Dataset<Row> ds = 
> spark.read().format("csv").option("path",the_path).option("inferSchema","true").load();
>  {color}
> {color:#4c9aff}ds.write().format("csv").save("path_to_save");{color}
> I know the workaround:
>  * Use "*timestampFormat*" option before save.
>  * But may have performance overhead and also its global for all columns.
>  * So lets say have 2 columns having formats "*yyyy-MM-dd HH:mm:ss*" and 
> "*yyyy-MM-dd HH*". Both can be inferred as timestamp by default, but outputs 
> in a single specified "timestampFormat".
>  * Another way is to use date_format(col, format). But that also may have 
> performance overhead and includes operations to apply, whereas I expect spark 
> to preserve the original format
> Tried with spark2.3 and spark2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to