Ilja Weis created JAMES-3710:
--------------------------------

             Summary: Restarting James while deleting using POP3 causes 
inconsistency
                 Key: JAMES-3710
                 URL: https://issues.apache.org/jira/browse/JAMES-3710
             Project: James Server
          Issue Type: Bug
    Affects Versions: master
            Reporter: Ilja Weis
         Attachments: logs.zip

Running James (distributed-pop3-app) and restarting it while
it is processing a lot of POP3 sessions that are deleting messages
causes the tables messagev3, pop3metadata and imapuidtable to be
inconsistent, leading to the problems discussed in JAMES-3709.

The effect after applying the changes from 
https://github.com/apache/james-project/pull/861 is that POP3 sessions TOPing 
messages that no longer exist
but still are in pop3metadata now cause "-ERR Message (12) does not exist."
as expected but no longer destroy the whole session.

The interesting part is why the data becomes inconsistent. This is probably
an edge case because we're probably going to restart James all the time
but perhaps there's a problem somewhere that's just more likely to happen
when restarting?

My setup: 4 James instances in Kubernetes, S3 storage, Cassandra
cluster, 2 datacenters with 3 nodes each, replication DC1:3,DC2:3. Relevant
cassandra properties:

cassandra.consistency_level.regular=QUORUM
cassandra.consistency_level.lightweight_transaction=SERIAL
message.read.strong.consistency=false
message.write.strong.consistency.unsafe=false
mailbox.read.strong.consistency=false
mailbox.read.repair.chance=0.00
mailbox.counters.read.repair.chance.max=0.000
mailbox.counters.read.repair.chance.one.hundred=0.000

What I did:

- Send a number of mails to 200 mailboxes. After this:

> select count(*) from messagev3;

 count
-------
 16837

> select count(*) from imapuidtable;

 count
-------
 16837

> select count(*) from pop3metadata ;

 count
-------
 16837

Then, start deleting all those messages with 40 parallel sessions.
There are no concurrent sessions to the same account.

While the deletes are running, restart all the james instances.
After a moment, we have:

> select count(*) from messagev3;

 count
-------
 14669

> select count(*) from imapuidtable;

 count
-------
 14194

> select count(*) from pop3metadata ;

 count
-------
 14669

Not sure if messagev3 is relevant here, just adding it for completeness.

Now if I'm accessing the mailboxes, this will touch some of the
messages that are no longer in imapuidtable which of course leads to

-ERR Message (12) does not exist.

and

16:32:48.175 [WARN ] o.a.j.p.m.DistributedMailboxAdapter - Removing 
cc9b9f70-7f8d-11ec-b659-0f643f8b11e7 from 8870f760-7f87-11ec-9109-5fc36fc8b18d 
POP3 projection for user james-032@james.testing at it is not backed by a 
MailboxMessage

Running the fixPop3Inconsistencies task in such a situation would then
clean up all the pop3metadata messages as "stale".

I have attached the James log while deleting/restarting and after
restarting for all 4 instances.

What do we make of this? Is this something relevant or more like something
that just can happen if we restart James at the wrong time?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to