lava created SPARK-39953:
----------------------------

             Summary:  Hudi spark-submits from EMR 5.33 to EMR 6.5 
                 Key: SPARK-39953
                 URL: https://issues.apache.org/jira/browse/SPARK-39953
             Project: Spark
          Issue Type: Question
          Components: Input/Output
    Affects Versions: 3.1.2
            Reporter: lava


We upgraded ourselves from running our Hudi spark-submits from EMR 5.33 to EMR 
6.5 that has Spark 3x and then started running into below errors with date and 
timestamp. Please let us know if someone faced a similar issue and if there is 
a resolution.spark-submit \
--deploy-mode client \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.shuffle.service.enabled=true \
--conf spark.default.parallelism=500 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=3 \
--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.app.name=ETS_CUST \
--jars /usr/lib/spark/external/lib/spark-avro.jar, 
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type MERGE_ON_READ \
--op INSERT \
--hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:[hive2://localhost:10000] \
--source-ordering-field dms_seq_no \
--props 
[s3://ets-aws-daas-prod-resource/config/TOEFL/DMEREG02/ETS_CUST/ets_cust_full.properties]
 \
--hoodie-conf 
hoodie.datasource.hive_sync.database=ets_aws_daas_raw_toefl_dmereg02 \
--target-base-path [s3://ets-aws-daas-prod-raw/TOEFL/DMEREG02/ETS_CUST] \
--target-table ETS_CUST \
--transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=[s3://ets-aws-daas-prod-landing/DMS/FULL/DMEREG02/ETS_CUST/]
 \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
--enable-sync22/08/02 16:24:48 INFO DAGScheduler: ShuffleMapStage 3 (countByKey 
at BaseSparkCommitActionExecutor.java:175) failed in 27.903 s due to Job 
aborted due to stage failure: Task 53 in stage 3.0 failed 4 times, most recent 
failure: Lost task 53.3 in stage 3.0 (TID 105) (ip-172-31-26-128.ec2.internal 
executor 3): org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as 
the files may be written by Spark 2.x or legacy versions of Hive, which uses a 
legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
    at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
    at 
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
    at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
    at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
    at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
    at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
    at 
[org.apache.spark.storage.BlockManager.org|http://org.apache.spark.storage.blockmanager.org/]$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
    at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
    at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to