cloud-fan commented on a change in pull request #27710: [SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser URL: https://github.com/apache/spark/pull/27710#discussion_r384679403
########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ########## @@ -175,10 +175,30 @@ class UnivocityParser( } case _: TimestampType => (d: String) => - nullSafeDatum(d, name, nullable, options)(timestampFormatter.parse) + nullSafeDatum(d, name, nullable, options) { datum => + try { + timestampFormatter.parse(datum) + } catch { + case NonFatal(e) => + // If fails to parse, then tries the way used in 2.0 and 1.x for backwards + // compatibility. + val str = UTF8String.fromString(datum) + DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) Review comment: We switch the calendar from "hybrid calendar (Julian + Gregorian)" to "ISO chronology (Proleptic Gregorian)", which affects date/timestamp values before October 15, 1582. For example, some dates are invalid according to the new calendar but the old calendar accepts it. When we add days to a date, the calendar skips invalid dates. The new calendar is widely used by RDBMS and others like parquet, snowflake, but we should have a better rollout plan for the calendar switch, like making the calendar configurable for Spark. Unfortunately, we catch this too late. However, the more serious problem is the switch of the pattern string(as we switch the JDK datetime API internally), and 3.0 can't parse some date/timestamp strings suddenly. This is the first step to fix this issue by adding back the legacy fallback, we are still working on fixing the 1% when users set a custom pattern string: 1. try to support the previous pattern string with the new calendar. If can't, go 2) 2. detect the conflicts between old and new pattern string and throw an exception to tell users what to do (turn on the legacy config or accept the new pattern string knowing what it means) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org