Benoit Tellier created JAMES-3900:
-------------------------------------
Summary: Running task updates stalled on the Distributed task
manager
Key: JAMES-3900
URL: https://issues.apache.org/jira/browse/JAMES-3900
Project: James Server
Issue Type: Improvement
Components: task
Reporter: Benoit Tellier
Fix For: 3.8.0
Upon performing a long reindexing upon a long reindexing, we were prompted for
the following error:
{code:java}
reactor.core.Exceptions$ErrorCallbackNotImplemented:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after
PT5S
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed
out after PT5S
at
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)
at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at
io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
{code}
After which scheduled updates for the task no longer happens.
After investigation the error upon polling updates within SerialTaskManager are
not handled thus cancelling the whole subscription is the default reactor
behaviour.
We likely should manage this error and prevent it from aborting the overall
process. I will propose a PR to be doing just this.
Also, using event sourcing for the updates for managing tasks updates is a
somewhat debatable choice... At one update every 30s a task generating 10KB of
JSON (not uncommon, eg if a task generate a large error report...) running for
a week could easily generate 200MB of data being read at consistency level
SERIAL from Cassandra, which is likely too much of an expectation to be
honest... (not mentionning the *massive* deserialization effort...)
As such, I propose to move polling updates management out of the aggregate,
have dedicate
a dedicated storage API for it. I will likely do it in a follow up of this
ticket...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]