[ https://issues.apache.org/jira/browse/SPARK-36958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-36958. ---------------------------------- Resolution: Not A Problem > Reading of legacy timestamps from Parquet confusing in Spark 3, related > config values don't seem working > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-36958 > URL: https://issues.apache.org/jira/browse/SPARK-36958 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.1.2 > Environment: emr-6.4.0 > spark 3.1.2 > Reporter: Dmitry Goldenberg > Priority: Major > > I'm having a major issue with trying to run in Spark 3, reading parquet data > that got generated with Spark 2.4. > The full stack trace is below. > The error message is very confusing: > # I do not have dates that before 1582-10-15 or timestamps before > 1900-01-01T00:00:00Z > # The documentation does not state clearly how to work around/fix this > issue. What exactly is the difference between the LEGACY and CORRECTED values > of the config settings? > # Which of the following would I want to set and to what values? - > spark.sql.legacy.parquet.datetimeRebaseModeInWrite > - spark.sql.legacy.parquet.datetimeRebaseModeInRead > - spark.sql.legacy.parquet.int96RebaseModeInRead > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.timeParserPolicy > # I've tried setting these to CORRECTED,CORRECTED,CORRECTED,CORRECTED, and > LEGACY, respectively, and got the same error (see the stack trace). > The issues that I see with this: > # Lack of thorough clear documentation on what this is and how it's meant to > work. > # The confusing error message. > # The fact that the error still occurs even when you set the config values. > > {quote} py4j.protocol.Py4JJavaError: An error occurred while calling > o1134.count.py4j.protocol.Py4JJavaError: An error occurred while calling > o1134.count.: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 8 in stage 36.0 failed 4 times, most recent failure: Lost task > 8.3 in stage 36.0 (TID 619) (ip-10-2-251-59.awsinternal.audiomack.com > executor 2): org.apache.spark.SparkUpgradeException: You may get a different > result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or > timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be > ambiguous, as the files may be written by Spark 2.x or legacy versions of > Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s > Proleptic Gregorian calendar. See more details in SPARK-31404. You can set > spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the > datetime values w.r.t. the calendar difference during reading. Or set > spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the > datetime values as it is. at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseTimestamp(VectorizedColumnReader.java:228) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseInt96(VectorizedColumnReader.java:242) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:662) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:300) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at > org.apache.spark.scheduler.Task.run(Task.scala:131) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org