Hi lamber-ken,

Thanks for this. I am not quite following the proposal. What do you mean by
spark built in operators? Dont we use the RDD based spark operations.

Are you suggesting that we perform the merging in sql? Not following.
Please clarify.

On Wed, Feb 26, 2020 at 10:08 AM lamberken <lamber...@163.com> wrote:

>
>
> Hi guys,
>
>
> Motivation
> Impove the merge performance for cow table when upsert, handle merge
> operation by using spark built-in operators.
>
>
> Background
> When do a upsert operation, for each bucket, hudi needs to put new input
> elements to memory cache map, and will
> need an external map that spills content to disk when there is
> insufficient space for it to grow.
>
>
> There are several performance issuses:
> 1. We may need an external disk map, serialize / deserialize records
> 2. Only single thread do the I/O operation when check
> 3. Can't take advantage of built-in spark operators
>
>
> Based on above, reworked the merge logic and done draft test.
> If you are also interested in this, please go ahead with this doc[1], any
> suggestion are welcome. :)
>
>
>
>
> Thanks,
> Lamber-Ken
>
>
> [1]
> https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
>
>

Reply via email to