Re: Migrate Existing DataFrame to Hudi DataSet

Zhengxiang Pan Mon, 11 Nov 2019 13:32:04 -0800

Hi

The snippet for issue is here
https://gist.github.com/zxpan/c5e989958d7688026f1679e53d2fca44
1) write script is to simulate to migrate existing data frame (saved
in /tmp/hudi-testing/inserts
parquet)
2) update script is to simulate to incremental update (saved in
/tmp/hudi-testing/updates
parquet) the existing dataset, this is where the issue


See attached inserts parquet file and updates parquet file.

Your help is appreciated.
Thanks


On Mon, Nov 11, 2019 at 11:23 AM Zhengxiang Pan <[email protected]> wrote:

> Thanks for quick response. will try to create snippet to reduce the issue.
>
> For number 2), I am aware of the de-dup behavior.  pretty sure the
> precombine key is unique.
>
> Thanks
>
> On Mon, Nov 11, 2019 at 8:46 AM Vinoth Chandar <[email protected]> wrote:
>
>> Hi,
>>
>> On 1. I am wondering if its relatd to
>> https://issues.apache.org/jira/browse/HUDI-83 , i.e support for
>> timestamps.
>> if you can give us a small snippet to reproduce the problem that would be
>> great.
>>
>> On 2, Not sure whats going on. there are no size limitations. Please check
>> if you precombine field and keys are correct.. for eg if you pick a
>> field/value that is in all records,then precombine will crunch it down to
>> just 1 record, coz thats what we ask it do.
>>
>> On Sun, Nov 10, 2019 at 6:46 PM Zhengxiang Pan <[email protected]> wrote:
>>
>> > Hi,
>> > I am new to the Hudi, my first attempt is to convert my existing
>> dataframe
>> > to Hudi managed dataset. I follow the Quick guide and Option (2) or (3)
>> In
>> > Migration Guide. Got two issues
>> >
>> > 1) Got the following error when Append mode afterward to upsert the data
>> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 4
>> > in stage 23.0 failed 4 times, most recent failure: Lost task 4.3 in
>> stage
>> > 23.0 (TID 74, tkcnode49.alphonso.tv, executor 7):
>> > org.apache.hudi.exception.HoodieUpsertException: Error upserting
>> bucketType
>> > UPDATE for partition :4
>> >         at
>> >
>> >
>> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:261)
>> >         at
>> >
>> >
>> org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
>> >         at
>> >
>> >
>> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>> >         at
>> >
>> >
>> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>> >         at
>> >
>> >
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>> >         at
>> >
>> >
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>> >         at
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>> >         at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>> >         at
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>> >         at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>> >         at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>> >         at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>> >         at
>> >
>> >
>> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>> >         at
>> >
>> >
>> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>> >         at
>> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>> >         at
>> >
>> >
>> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>> >         at
>> >
>> >
>> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>> >         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>> >         at
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>> >         at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>> >         at
>> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>> >         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>> >         at
>> >
>> >
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>> >         at
>> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>> >         at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>> >         at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> >         at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >         at java.lang.Thread.run(Thread.java:748)
>> >
>> > I noticed that "Date" type is converted to "Long" type in hudi dataset.
>> >
>> > I workaround to save my dataframe to JSONL, and read back to save it to
>> > Hudi managed dataset.
>> >
>> > are there any requirement for data schema conversion explicitly from my
>> > original data frame?
>> >
>> > 2) even if I managed to get around first issue,  the number of records
>> in
>> > Hudi managed data is way less than my original data frame.
>> >
>> > Is there any size limitation in Hudi dataset?
>> >
>> > Thanks
>> >
>>
>

Re: Migrate Existing DataFrame to Hudi DataSet

Reply via email to