Re: Is it possible to run to run compaction asynchronously while upserting via Spark DataSource writer

Vinoth Chandar Thu, 25 Jun 2020 06:57:25 -0700

Hi Anton,

https://github.com/apache/hudi/pull/1752 brings the self managed compaction
to Spark Streaming as well. Would you be interested in testing this out?
This is a highly requested feature, that we are trying to get into the next
release


Thanks
Vinoth

On Wed, Jun 24, 2020 at 3:56 PM Zuyeu, Anton <[email protected]>
wrote:

> Hi team,
>
> We are upserting incremental dataFrames into our MoR spark table using
> spark datasource writer. Currently we are running compaction inline. We
> would want to have our compaction running asynchronously. As far as I
> understand to do so our only option is to utilize DeltaStreamer. The
> problem with that is that it seems like DeltaStreamer was built to
> orchestrate all writes to the table ( upserts and compaction), where is we
> want to have our own job to take care of upserting but DeltaStream to take
> care of compaction only (scheduling compaction, running , rerunning failed
> compactions etc). So the question is : is it even possible? After looking
> into DeltaStreamer parameter, what if we supply some mock class as
> -source-class, so DeltaStreamer can pull empty incremental data and
> therefore don't upsert anything but still run compactions based on its
> schedule, will it work? Please also share if there is other ways to achieve
> async compaction without using DeltaStreamer.
>
> Thank you,
> Anton Zuyeu
>
>

Re: Is it possible to run to run compaction asynchronously while upserting via Spark DataSource writer

Reply via email to