[jira] [Commented] (JAMES-3900) Running task updates stalled on the Distributed task manager

Benoit Tellier (Jira) Wed, 19 Apr 2023 22:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714418#comment-17714418
 ]


Benoit Tellier commented on JAMES-3900:
---------------------------------------

https://github.com/apache/james-project/pull/1523 covers the error management 
part of the topic.

https://github.com/apache/james-project/pull/1533 proposes to avoid the timeout 
in the first place...

It uses snapshots ( introduced by JAMES-3777 ) upon polled updates (they fully 
capture the state of the aggregate).
This prevents loading the full history for looong
running tasks and prevents timeouts upon polling updates.

Before the aggregate was containing a lot of events, that were periodically 
loaded:

{code:java}
Widest Partitions:
   [Task/2d534232-fade-47aa-8c6f-1ec2d0f238f9] 1855
   [Task/d8499ec7-a62a-4541-970a-7339fccf23e8] 779
{code}


Eventually resulting in timeouts as the full partition was loaded...

{code:java}
reactor.core.Exceptions$ErrorCallbackNotImplemented: 
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after 
PT5S
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT5S
        at 
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)
        at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
        at 
io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
        at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
        at 
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
{code}

With this work, this is mitigated by loading only the latest event...

> Running task updates stalled on the Distributed task manager
> ------------------------------------------------------------
>
>                 Key: JAMES-3900
>                 URL: https://issues.apache.org/jira/browse/JAMES-3900
>             Project: James Server
>          Issue Type: Improvement
>          Components: task
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon performing a long reindexing upon a long reindexing, we were prompted 
> for the following error:
> {code:java}
> reactor.core.Exceptions$ErrorCallbackNotImplemented: 
> com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out 
> after PT5S
> Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query 
> timed out after PT5S
>       at 
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
>       at 
> io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
>       at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> After which scheduled updates for the task no longer happens.
> After investigation the error upon polling updates within SerialTaskManager 
> are not handled thus cancelling the whole subscription is the default reactor 
> behaviour.
> We likely should manage this error and prevent it from aborting the overall 
> process. I will propose a PR to be doing just this.
> Also, using event sourcing for the updates for managing tasks updates is a 
> somewhat debatable choice... At one update every 30s a task generating 10KB 
> of JSON (not uncommon, eg if a task generate a large error report...) running 
> for a week could easily generate 200MB of data being read at consistency 
> level SERIAL from Cassandra, which is likely too much of an expectation to be 
> honest... (not mentionning the *massive* deserialization effort...)
> As such, I propose to move polling updates management out of the aggregate, 
> have dedicate 
> a dedicated storage API for it. I will likely do it in a follow up of this 
> ticket...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (JAMES-3900) Running task updates stalled on the Distributed task manager

Reply via email to