[
https://issues.apache.org/jira/browse/QPID-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079309#comment-14079309
]
Kim van der Riet commented on QPID-5924:
----------------------------------------
I have examined the source code and made some measurements using {{stap}}, and
find that at the moment the linearstore is consuming 2 file descriptors per
queue during recovery. Once recovery is complete, this number drops to the
expected value of 1 per queue.
The first of these file descriptors is a dedicated file handle each queue
keeps, and which is located in the leading {{JournalFile}} object. This file
descriptor is used for AIO write operations.
During recovery only, another file handle is being used for reading the
journals. Recovery is performed serially, which allows just one to be used for
the entire recovery. However, because of an error in the code, these file
descriptors are not being closed explicitly when each queue has completed
recovery, and thus they remain open until the end of the entire recovery phase,
at which point they are all released together. This causes an additional file
handle to be consumed for each queue (temporarily, until the end of the
recovery).
I have a proposed patch which does the following:
# The bug which keeps the files handles for each queue during recovery has been
fixed and now only one file handle is used during the entire recovery process.
# The dedicated per-queue file handle in the JournalFile objects are not opened
on initialization. Instead, they are opened on first use. This delays the
consumption of file handles until necessary, and indefinitely on queues which
are not in active use.
These changes would optimise the consumption of file handles, and in addition
allows recovery to take place using a single file handle. However, there is
still a limit to the number of file handles that may be practically used on a
given hardware configuration. The larger goal of using a pool of file handles
so that the number of queues that can be handled may be well in excess of the
maximum number of available file descriptors will have to be tackled as a later
enhancement.
> [linearstore] Qpidd Will Not Start with Large Number of Queues
> --------------------------------------------------------------
>
> Key: QPID-5924
> URL: https://issues.apache.org/jira/browse/QPID-5924
> Project: Qpid
> Issue Type: Bug
> Components: C++ Broker
> Affects Versions: 0.22
> Environment: qpid-cpp-server-0.22-42
> qpid-cpp-server-linearstore-0.22-42
> Reporter: Brian Bouterse
> Assignee: Kim van der Riet
> Priority: Critical
>
> Pulp is an open source project that uses Qpidd. Pulp has need for a large
> number of queues 10K+, and these queues need to be durable. When creating a
> large number of queues (thousands), if you restart qpidd, it won't start.
> Here is how to reproduce:
> 1. Install qpid-cpp-server and qpid-cpp-server-store
> 2. Start qpidd
> 3. Create a crazy number of unique queues (10K) with durability
> 4. Restart Qpidd
> 5. Observe an error message such as the following
> Starting Qpid AMQP daemon: Daemon startup failed: Queue
> pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119: recoverMessages() failed:
> jexception 0x0104 RecoveryManager::getFile() threw JERR__FILEIO: File read or
> write failure.
> (/var/lib/qpidd/qls/jrnl/pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119/818fa4b0-3319-4478-b2b0-d2195f90f695.jrnl)
>
> (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)
> Looking at /var/lib/qpidd/qls/jrnl/ directory there is 2676 jrnl files, 2640
> of them start with pulp.agent. In our case the lots of queues are named
> 'pulp.agent.<UUID>'.
> The expected behavior is that qpidd would start and run awesome with a crazy
> number of queues (1 Million +).
> Raising the number of file descriptors is a viable workaround, but eventually
> those will run out too. It would be an architectural win if a constant number
> of file descriptors were used that are not affected by the number of queues
> qpidd is managing.
> Perhaps this could be introduced as a new journal type that would run slower
> but be more scalable. It could be introduced as
> qpid-cpp-server-crazy-scalable-but-slower-store.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]