https://bugs.exim.org/show_bug.cgi?id=3142

            Bug ID: 3142
           Summary: Exim Performance Regression: 200ms Delay on Connection
                    Closure
           Product: Exim
           Version: 4.96
          Hardware: x86-64
                OS: Linux
            Status: NEW
          Severity: bug
          Priority: medium
         Component: ACLs
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected]

Hello, 

A performance regression has been identified when running Exim 4.96 on Linux
kernel 6.1 compared to Exim 4.94.2 on kernel 5.10. Every SMTP connection
experiences a consistent 200ms delay during connection closure.

Environment:

Working configuration: Exim 4.94.2-7+deb11u4 on Linux kernel 5.10 (using select
syscall)
Problematic configuration: Exim 4.96-15+deb12u6 on Linux kernel 6.1 (using poll
syscall)
Distribution: Debian 11 (working) and Debian 12 (problematic)


Issue Details

When handling SMTP connections, Exim 4.96 deliberately calls poll() with a
200ms timeout right before closing connections. This is visible in smtp_in.c:

"(void) poll_one_fd(smtp_in_fd, POLLIN, 200);
DEBUG(D_any) debug_printf_indent("SMTP(close)>>\n");
smtp_inout_close();"


The comment in the code explains the purpose:

"An overenthusiastic fail2ban/iptables implementation has been seen to result
in the TCP conn staying open, and retrying, despite this process exiting. A
malicious client could possibly do the same, tying up server networking
resources."

In kernel 5.10, with Exim 4.94.2, similar functionality used the select()
system call and did not cause significant delays. In kernel 6.1, with Exim
4.96, the poll() consistently waits for the full 200ms timeout, introducing a
delay for every SMTP transaction.

This poll delay occurs both in normal connection closures and in
security-related code that handles ACL failures, suggesting it was implemented
as a security measure, but the behavior change in kernel 6.1 has transformed it
into a performance bottleneck.


strace Output Showing the Delay
[pid 1231632]      0.000 read(7, "QUIT\r\n", 8191) = 6
[pid 1231632]      0.000 newfstatat(AT_FDCWD, "/etc/localtime",
{st_mode=S_IFREG|0644, st_size=114, ...}, 0) = 0
[pid 1231632]      0.000 poll([{fd=7, events=POLLIN}], 1, 200) = 0 (Timeout)
[pid 1231632]      0.200 write(6, "221 server_name closing"..., 45) = 45



previous working version doesn't have this delay:

[pid 1132755]      0.000 read(7, "QUIT\r\n", 8191) = 6
[pid 1132755]      0.000 stat("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=118, ...}) = 0
[pid 1132755]      0.000 write(6, "221 server_name closing"..., 45) = 45
[pid 1132755]      0.000 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 1132755]      0.000 read(7, 0x7ffd1eb13db0, 128) = -1 EAGAIN (Resource
temporarily unavailable)
[pid 1132755]      0.000 exit_group(0)  = ?
[pid 1132755]      0.000 +++ exited with 0 +++


ftrace Output Confirming Poll Behavior
  0)               |  __x64_sys_poll() {
  0)               |    do_sys_poll() {
  0)               |      sock_poll() {
  0)               |        tcp_poll() {
  0)   0.421 us    |          __pollwait();
  0)   1.182 us    |        }
  0)   1.783 us    |      }

  ... [other socket polls] ...

  0)   0.641 us    |      poll_freewait();
  0) @ 204432.8 us |    }
  0) @ 204435.1 us |  }

The ftrace shows poll() calls consistently taking ~200ms, confirming the kernel
is waiting for the full timeout period.


Thank you for your attention to this issue.
Best Konstantin

-- 
You are receiving this mail because:
You are on the CC list for the bug.

-- 
## subscription configuration (requires account):
##   https://lists.exim.org/mailman3/postorius/lists/exim-dev.lists.exim.org/
## unsubscribe (doesn't require an account):
##   [email protected]
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/

Reply via email to