Re: [PROPOSAL] Asynchronous & Reliable Tasks

Pierre Laporte Tue, 16 Dec 2025 07:30:00 -0800

Thanks for the heads up, Robert.  I reviewed the API/SPI/Store PR and apart
from minor Javadoc comments, that is a +1 for me.  I especially like the
clear SPI definition that allows for using data store of choice.  This is
really good work.


Go team !

--

Pierre


On Fri, Dec 12, 2025 at 12:53 PM Robert Stupp <[email protected]> wrote:

> Hi all,
>
> Thanks for the discussion that happened so far.
> After some time of silence, apologies, I would like to revive this
> discussion!
>
> The context, for those who haven't followed the thread since the
> beginning, is to provide a resilient framework to submit long-running
> tasks and to eventually execute those on "any" live Polaris instance.
> (You can find the "full version" in the initial email of this thread
> [1] from May 2025.)
>
> For some context, a related proposal [2] was proposed in June 2025 to
> keep the existing implementations but move the task execution to a
> separate service, with the current, local behavior as the fallback (or
> default if you like).
>
> The "Async & reliable tasks" proposal allows instances to choose
> whether tasks can only be submitted or whether tasks can also be
> executed. In other words, support for delegation is built-in.
>
> Related to the overall effort is the "Object store functionality"
> proposal [3] (via PR [3256]) to provide a CPU, heap-friendly API and
> implementation to work against object stores. It is built in a way to
> provide "pluggable" functions.
>
> The "object store functionality" proposal implicitly addresses the
> current issue of running into out-of-memory errors when purging
> Iceberg tables. Details about that issue can be found in [4].
>
> I would like to bring the whole effort "back to live" and propose to
> 1. Start with [2180] and [3256]. Those two are orthogonal and do not
> depend on each other.
> 2. Continue with implementation PR(s) building on top of [2180] for
> both NoSQL and "DB native" persistence.
> 3. Provide a task behavior implementation using "object store
> functionality" to purge Iceberg tables and views.
> 4. Wire the behavior into the existing code base.
> The code base of the existing tasks implementation is not touched by
> this effort.
>
> If all that's in, we could even think about a more intelligent and
> fully automatic approach to purge unreferenced files in object stores
> to keep object store usage at a reasonable size.
>
> Looking forward to hearing your thoughts and a friendly and
> constructive collaboration!
>
> Robert
>
> [1] https://lists.apache.org/thread/gg0kn89vmblmjgllxn7jkn8ky2k28f5l
> (initial email of this thread)
> [2] https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> (delegation service proposal)
> [3] https://lists.apache.org/thread/0z8nb3w58zb9s617gsoyhzlnz53rt9zx
> (object storage operations proposal)
> [3256] https://github.com/apache/polaris/pull/3256 (Object storage
> operations PR)
> [4] https://lists.apache.org/thread/9pgvhr9btfgzofbm6qhyfyqnk62hzp4m (OOM)
> [2180] https://github.com/apache/polaris/pull/2180 (Async & reliable
> tasks PR)
>
>
>
>
> On Mon, Aug 4, 2025 at 11:15 AM Robert Stupp <[email protected]> wrote:
> >
> > RIght, the idea is to have a "common abstraction" first.
> > I'm actively looking into exactly that at the moment. WIll come up
> > with a couple PRs to enable this.
> > Some of it is implicitly covered by the work that Christopher's
> > contributing, although it's rather orthogonal.
> >
> > On Fri, Aug 1, 2025 at 6:54 PM Eric Maynard <[email protected]>
> wrote:
> > >
> > > I agree with Robert that the current implementation is not good and
> should
> > > be ripped out ASAP. However, I see this effort as complementary to
> Will's
> > > refactor, not as a dependency. We should first add a layer of
> abstraction
> > > between the business logic in Polaris and the task execution -- once
> that's
> > > in place, we can replace the existing task implementation behind that
> > > abstraction. At the same time, adding this abstraction will unlock the
> > > ability for us to implement remote task execution as well.
> > >
> > > --EM
> > >
> > > On Fri, Aug 1, 2025 at 6:31 AM Yufei Gu <[email protected]> wrote:
> > >
> > > > Thanks for the async task proposal. I think it's the right direction
> > > > for async light tasks. Meanwhile, we will still need other models:
> > > > 1. A scalable way to execute synchronous tasks
> > > > 2. A scalable way to execute heavy async tasks, e.g., table
> maintenance
> > > > tasks.
> > > >
> > > > The delegation service[1] is a good candidate for that.
> > > >
> > > > 1.
> > > >
> > > >
> https://docs.google.com/document/d/1AhR-cZ6WW6M-z8v53txOfcWvkDXvS-0xcMe3zjLMLj8/edit?tab=t.0#heading=h.xjibr7sfbv6a
> > > >
> > > > Yufei
> > > >
> > > >
> > > > On Thu, Jul 31, 2025 at 11:37 AM Russell Spitzer <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > I'm fine with the plan although I think we should probably change
> step 4
> > > > > to allow both the current implementation and the new
> implementation to
> > > > > exist at the same time with a flag for switching over to the new
> task
> > > > > implementation. While the new implementation may be much better,
> it is a
> > > > > pretty significant behavior change that I think should be opt in
> until
> > > > it's
> > > > > been in Polaris for a release or two. After that we could force
> all users
> > > > > to switch once it's been out in the wild for a bit.
> > > > >
> > > > > On 2025/07/30 01:30:43 William Hyun wrote:
> > > > > > >
> > > > > > > Considering the current issues, I don't think it's worth the
> effort
> > > > to
> > > > > > > keep the current implementation.
> > > > > >
> > > > > >
> > > > > > It seems risky to me to not support the current implementation
> at least
> > > > > for
> > > > > > the period where the new tasks implementation is unstable.
> > > > > >
> > > > > > Bests,
> > > > > > William
> > > > > >
> > > > > > On Tue, Jul 29, 2025 at 3:49 AM Robert Stupp <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > (starting w/ a recap for everybody watching this thread)
> > > > > > > The goal of this is to have a mechanism to guarantee the
> _eventual_
> > > > > > > execution of a task. That may happen immediately on the same
> node or
> > > > > > > at a later time on another node.
> > > > > > > This particular "async reliable tasks" is to ensure that tasks
> run
> > > > > > > eventually in any Polaris node. The related "Delegation
> Service"
> > > > > > > proposal is to let tasks run in a separate, different remote
> service.
> > > > > > > But it requires a "local fallback" in case the remote service
> is not
> > > > > > > available, which would be provided by this proposal.
> > > > > > >
> > > > > > > Currently, all scheduled and running tasks are "lost", if
> Polaris is
> > > > > > > stopped, killed or crashed. So I'd prefer to get this proposal
> in
> > > > > > > first to address the current issues and have a reliable
> fallback for
> > > > > > > the Delegation Service.
> > > > > > >
> > > > > > > Considering the current issues, I don't think it's worth the
> effort
> > > > to
> > > > > > > keep the current implementation.
> > > > > > >
> > > > > > > Both, this proposal and the Delegation Service, shouldn't rely
> on
> > > > > > > Polaris entities but rather have targeted definitions for the
> tasks
> > > > to
> > > > > > > execute, which contain exactly (and not more) what the tasks
> need to
> > > > > > > be executed.
> > > > > > >
> > > > > > > So I think the following steps (approx 1 PR for each) would be:
> > > > > > > 1. Add the tasks API (the draft PR [1])
> > > > > > > 2. Add the tasks implementation, w/o any persistence
> integration but
> > > > > > > with mock testing
> > > > > > > 3. Add persistence integration
> > > > > > > 4. Replace current task implementation with the new one
> > > > > > >
> > > > > > > I'll probably have more details soon-ish.
> > > > > > >
> > > > > > > Robert
> > > > > > >
> > > > > > > [1] https://github.com/apache/polaris/pull/2180
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 28, 2025 at 6:22 AM William Hyun <
> [email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Hey Robert!
> > > > > > > >
> > > > > > > > Thank you for the draft PR.
> > > > > > > > I have taken a look and the general approach seems good to
> me.
> > > > > > > > However, one of my concerns would be the timeline to deliver
> this
> > > > new
> > > > > > > > task framework refactoring as this could be intrusive due to
> the
> > > > > scope
> > > > > > > > of the change.
> > > > > > > > What do you plan as the ETA for delivering this change?
> > > > > > > >
> > > > > > > > It seems we need to support both the pre-existing (v1) and
> new task
> > > > > > > > framework (v2) until we are sure that v2 is stabilized so
> that we
> > > > can
> > > > > > > > delete v1.
> > > > > > > > With the Delegation Service proposal being a new feature for
> > > > users, I
> > > > > > > > am proposing to include it within the 1.1 release as a small,
> > > > > optional
> > > > > > > > extension and also support it in v2 by reusing via
> implementing
> > > > v2's
> > > > > > > > SPI module as we previously discussed.
> > > > > > > > I also have opened a PR demonstrating what the Delegation
> Service
> > > > > > > > looks like here:
> > > > > > > >
> > > > > > > > - https://github.com/apache/polaris/pull/2193
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > Bests,
> > > > > > > > William
> > > > > > > >
> > > > > > > > On Thu, Jul 24, 2025 at 11:18 AM Robert Stupp <
> [email protected]>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > As discussed on the Polaris Community Sync today, we're
> aligned
> > > > > that
> > > > > > > > > the current tasks handling needs some refactoring.
> > > > > > > > >
> > > > > > > > > This proposal focuses on the "eventual execution" of a
> task.
> > > > > > > > > Implementations for would follow.
> > > > > > > > > The "Delegation Service" [1]  proposal focuses on the
> execution
> > > > of
> > > > > > > > > tasks "outside" of Polaris.
> > > > > > > > >
> > > > > > > > > I've pushed a draft PR [2] with the Java interfaces and
> value
> > > > types
> > > > > > > > > for the API, the SPI (behavior implementation) and store
> (used by
> > > > > > > > > tasks implementations).
> > > > > > > > >
> > > > > > > > > The only entry point is the
> `org.apache.polaris.tasks.api.Tasks`
> > > > > > > > > interface with a function defining the behavior and
> providing a
> > > > > > > > > parameter object (if necessary), returning a
> `TaskSubmission`.
> > > > Call
> > > > > > > > > sites _may_ subscribe to a `CompletionStage`, but the idea
> is
> > > > that
> > > > > > > > > it's rather "fire and forget" and the task behavior does
> > > > > "everything
> > > > > > > > > that's needed". This allows the task to be executed on any
> node.
> > > > > > > > > There's no guarantee in any form that a task will run
> "locally"
> > > > or
> > > > > any
> > > > > > > > > other specific node. Every Polaris node can handle task
> execution
> > > > > and
> > > > > > > > > perform failure/retry handling. Polaris nodes may use a
> "server"
> > > > > > > > > implementation or a "client" implementation or a "remote"
> > > > > > > > > implementation - that's defined upon deployment or by
> > > > configuration
> > > > > > > > > (TBD).
> > > > > > > > >
> > > > > > > > > I think that we can get to a Polaris internal API/SPI that
> can be
> > > > > > > > > leveraged by both proposals.
> > > > > > > > > This proposal is implementation and persistence backend
> agnostic.
> > > > > > > > > There could be a "server" implementation that can run
> tasks, a
> > > > > > > > > "client" implementation that can only submit tasks (think:
> from
> > > > the
> > > > > > > > > polaris-admin tool), and an implementation for the
> delegation
> > > > > service
> > > > > > > > > to execute tasks remotely.
> > > > > > > > >
> > > > > > > > > I do have a working implementation sitting around locally
> that's
> > > > > > > > > passing tests exercising concurrency, multi-node and
> failure
> > > > > > > > > scenarios. Since there's only a store-implementation for
> NoSQL, I
> > > > > > > > > haven't pushed that yet. Adding a store-implementation that
> > > > solely
> > > > > > > > > uses `BasePersistence``(JDBC) is not such a big deal.
> > > > > > > > >
> > > > > > > > > If we're okay with the approach in general, I can follow
> up with
> > > > a
> > > > > > > > > more concrete implementation including the "purge table"
> use case
> > > > > and
> > > > > > > > > maybe another example use case.
> > > > > > > > >
> > > > > > > > > Robert
> > > > > > > > >
> > > > > > > > > [1]
> > > > > https://lists.apache.org/thread/ph10th4ocjczpf5gz17mqys4fkp5qrzw
> > > > > > > > > [2] https://github.com/apache/polaris/pull/2180
> > > > > > > > >
> > > > > > > > > On Mon, May 19, 2025 at 12:05 PM Robert Stupp <
> [email protected]>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Yes, each "task behavior" has an ID. I've chosen the
> term "task
> > > > > > > > > > behavior" over "type", because it doesn't only define
> "what's
> > > > > done"
> > > > > > > but
> > > > > > > > > > also "when" it's done (delay) and "how it behaves"
> (retries on
> > > > > > > failures).
> > > > > > > > > >
> > > > > > > > > > On 14.05.25 04:25, Adnan Hemani wrote:
> > > > > > > > > > > Hi Robert,
> > > > > > > > > > >
> > > > > > > > > > > Firstly, thanks for this document. One quick question:
> is the
> > > > > > > `behavior ID` basically the task type? This part was slightly
> unclear
> > > > > to me.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Adnan Hemani
> > > > > > > > > > >
> > > > > > > > > > >> On May 9, 2025, at 6:07 AM, Robert Stupp <
> [email protected]>
> > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> Hi,
> > > > > > > > > > >>
> > > > > > > > > > >> Polaris is a service, which has to eventually perform
> > > > > operations
> > > > > > > asynchronously. Polaris is also meant to be backed by multiple
> server
> > > > > > > instances (think: high-availability & load-balancing setups).
> > > > > > > > > > >>
> > > > > > > > > > >> During runtime, things can go sideways in many ways.
> Server
> > > > > > > instances may crash, be killed or whatever... Task executions
> may
> > > > fail,
> > > > > > > because some other remote service fails, configuration values
> (and
> > > > > > > credentials) may be wrong or other error situations.
> > > > > > > > > > >>
> > > > > > > > > > >> Task execution should be resilient to both kinds of
> > > > scenarios:
> > > > > > > being able to eventually recover from a "dead/lost node"
> scenario and
> > > > > to
> > > > > > > retry failed tasks.
> > > > > > > > > > >>
> > > > > > > > > > >> Each individual task should also be executed only
> once.
> > > > > > > > > > >>
> > > > > > > > > > >> There are also different kinds of tasks with different
> > > > > behaviors:
> > > > > > > the "function" being executed and the retry behavior.
> > > > > > > > > > >>
> > > > > > > > > > >> Proposal doc for this:
> > > > > > >
> > > > >
> > > >
> https://www.google.com/url?q=https://docs.google.com/document/d/17D28E2ne5dzOHWc9DJ91Yz3lnQOtgmWaA_TBNdXv0sY/edit?tab%3Dt.0&source=gmail-imap&ust=1747400861000000&usg=AOvVaw3x56ChuB1ga0MelG6URxxi
> > > > > > > > > > >>
> > > > > > > > > > >> Robert
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> --
> > > > > > > > > > >> Robert Stupp
> > > > > > > > > > >> @snazy
> > > > > > > > > > >>
> > > > > > > > > > --
> > > > > > > > > > Robert Stupp
> > > > > > > > > > @snazy
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
>

Re: [PROPOSAL] Asynchronous & Reliable Tasks

Reply via email to