[ https://issues.apache.org/jira/browse/SPARK-31414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-31414: ----------------------------------- Assignee: Kent Yao > Performance regression with new TimestampFormatter for json and csv > ------------------------------------------------------------------- > > Key: SPARK-31414 > URL: https://issues.apache.org/jira/browse/SPARK-31414 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0 > Reporter: Kent Yao > Assignee: Kent Yao > Priority: Major > > with benchmark original, where the timestamp values are valid to new parser > the result is > {code:java} > [info] Running benchmark: Read dates and timestamps > [info] Running case: timestamp strings > [info] Stopped after 3 iterations, 5781 ms > [info] Running case: parse timestamps from Dataset[String] > [info] Stopped after 3 iterations, 44764 ms > [info] Running case: infer timestamps from Dataset[String] > [info] Stopped after 3 iterations, 93764 ms > [info] Running case: from_json(timestamp) > [info] Stopped after 3 iterations, 59021 ms > {code} > when we modify the benchmark to > {code:java} > def timestampStr: Dataset[String] = { > spark.range(0, rowsNum, 1, 1).mapPartitions { iter => > iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") > }.select($"value".as("timestamp")).as[String] > } > readBench.addCase("timestamp strings", numIters) { _ => > timestampStr.noop() > } > readBench.addCase("parse timestamps from Dataset[String]", numIters) { > _ => > spark.read.schema(tsSchema).json(timestampStr).noop() > } > readBench.addCase("infer timestamps from Dataset[String]", numIters) { > _ => > spark.read.json(timestampStr).noop() > } > {code} > where the timestamp values are invalid for the new parser which cause > fallback to legacy parser. > the result is > {code:java} > [info] Running benchmark: Read dates and timestamps > [info] Running case: timestamp strings > [info] Stopped after 3 iterations, 5623 ms > [info] Running case: parse timestamps from Dataset[String] > [info] Stopped after 3 iterations, 506637 ms > [info] Running case: infer timestamps from Dataset[String] > [info] Stopped after 3 iterations, 509076 ms > {code} > About 10x perf-regression -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org