Hi Anton, https://github.com/apache/hudi/pull/1752 brings the self managed compaction to Spark Streaming as well. Would you be interested in testing this out? This is a highly requested feature, that we are trying to get into the next release
Thanks Vinoth On Wed, Jun 24, 2020 at 3:56 PM Zuyeu, Anton <[email protected]> wrote: > Hi team, > > We are upserting incremental dataFrames into our MoR spark table using > spark datasource writer. Currently we are running compaction inline. We > would want to have our compaction running asynchronously. As far as I > understand to do so our only option is to utilize DeltaStreamer. The > problem with that is that it seems like DeltaStreamer was built to > orchestrate all writes to the table ( upserts and compaction), where is we > want to have our own job to take care of upserting but DeltaStream to take > care of compaction only (scheduling compaction, running , rerunning failed > compactions etc). So the question is : is it even possible? After looking > into DeltaStreamer parameter, what if we supply some mock class as > -source-class, so DeltaStreamer can pull empty incremental data and > therefore don't upsert anything but still run compactions based on its > schedule, will it work? Please also share if there is other ways to achieve > async compaction without using DeltaStreamer. > > Thank you, > Anton Zuyeu > >
