Migrate Existing DataFrame to Hudi DataSet

Zhengxiang Pan Sun, 10 Nov 2019 18:47:17 -0800

Hi,
I am new to the Hudi, my first attempt is to convert my existing dataframe
to Hudi managed dataset. I follow the Quick guide and Option (2) or (3) In
Migration Guide. Got two issues


1) Got the following error when Append mode afterward to upsert the data
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4
in stage 23.0 failed 4 times, most recent failure: Lost task 4.3 in stage
23.0 (TID 74, tkcnode49.alphonso.tv, executor 7):
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType
UPDATE for partition :4
        at
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:261)
        at
org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
        at
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
        at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
        at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
        at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I noticed that "Date" type is converted to "Long" type in hudi dataset.

I workaround to save my dataframe to JSONL, and read back to save it to
Hudi managed dataset.

are there any requirement for data schema conversion explicitly from my
original data frame?

2) even if I managed to get around first issue,  the number of records in
Hudi managed data is way less than my original data frame.

Is there any size limitation in Hudi dataset?

Thanks

Migrate Existing DataFrame to Hudi DataSet

Reply via email to