Kim, I see these two settings from code, can these be configurable?
#define MAX_AIO_SLEEPS 100000 // tot: ~1 sec #define AIO_SLEEP_TIME_US 10 // 0.01 ms Ram On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu < rammohanga...@gmail.com> wrote: > Thank you Kim, i will try your suggestions. > > On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <kvand...@redhat.com wrote: > >> This error is a linearstore issue. It looks as though there is a single >> write operation to disk that has become stuck, and is holding up all >> further write operations. This happens because there is a fixed circular >> pool of memory pages used for the AIO operations to disk, and when one >> of these is "busy" (indicated by the A letter in the page state map), >> write operations cannot continue until it is cleared. It it does not >> clear within a certain time, then an exception is thrown, which usually >> results in the broker closing the connection. >> >> The events leading up to a "stuck" write operation are complex and >> sometimes difficult to reproduce. If you have a reproducer, then I would >> be interested to see it! Even so, the ability to reproduce on another >> machine is hard as it depends on such things as disk write speed, the >> disk controller characteristics, the number of threads in the thread >> pool (ie CPU type), memory and other hardware-related things. >> >> There are two linearstore parameters that you can try playing with to >> see if you can change the behavior of the store: >> >> wcache-page-size: This sets the size of each page in the write buffer. >> Larger page size is good for large messages, a smaller size will help if >> you have small messages. >> >> wchache-num-pages: The total number of pages in the write buffer. >> >> Use the --help on the broker with the linearstore loaded to see more >> details on this. I hope that helps a little. >> >> Kim van der Riet >> >> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote: >> > Any help in understand why/when broker throws those errors and stop >> > receiving message would be appreciated. >> > >> > Not sure if any kernel tuning or broker tuning needs to be done to >> > solve this issue. >> > >> > Thanks in advance, >> > Ram >> > >> > On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu < >> > rammohanga...@gmail.com> wrote: >> > >> >> Also from this log message (store level) it seems like waiting for AIO >> to >> >> complete. >> >> >> >> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal >> >> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT; >> >> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF >> >> ps=[-------------------------A------] >> >> >> >> page_state ps=[-------------------------A------] where A is >> AIO_PENDING >> >> aer=1 _aio_evt_rem; ///< Remaining AIO events >> >> >> >> When there is or there are pending AIO, does broker close the >> connection? >> >> is there any tuning that can be done to resolve this? >> >> >> >> Thanks, >> >> Ram >> >> >> >> >> >> >> >> >> >> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu < >> >> rammohanga...@gmail.com> wrote: >> >> >> >>> I was check the code and i see these lines for that AIO timeout. >> >>> >> >>> case >> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT: >> >>> if (++aio_sleep_cnt > MAX_AIO_SLEEPS) >> >>> THROW_STORE_EXCEPTION("Timeout waiting for AIO in >> >>> MessageStoreImpl::recoverMessages()"); >> >>> ::usleep(AIO_SLEEP_TIME_US); >> >>> break; >> >>> >> >>> And these are the defaults >> >>> >> >>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec >> >>> >> >>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms >> >>> >> >>> >> >>> RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page is >> >>> waiting for AIO. >> >>> >> >>> >> >>> >> >>> So does page got blocked and its waiting for page availability? >> >>> >> >>> >> >>> Ram >> >>> >> >>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu < >> >>> rammohanga...@gmail.com> wrote: >> >>> >> >>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after that >> we >> >>>> see this message >> >>>> >> >>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal >> >>>> "<journal-name>": Bad record alignment found at fid=0x4605b >> offs=0x107680 >> >>>> (likely journal overwrite boundary); 19 filler record(s) required. >> >>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >> >>>> "<journal-name>": Recover phase write: Wrote filler record: >> fid=0x4605b >> >>>> offs=0x107680 >> >>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >> >>>> "<journal-name>": Recover phase write: Wr... few more Recover phase >> logs >> >>>> >> >>>> It worked fine for a day and started throwing this message: >> >>>> >> >>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<name>": >> >>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: >> pi=25 pc=8 >> >>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------] >> >>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot deliver >> to >> >>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() >> failed: >> >>>> jexception 0x0202 jcntl::handle_aio_wait() threw >> JERR_JCNTL_AIOCMPLWAIT: >> >>>> Timeout waiting for AIOs to complete. >> >>>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >> >>>> 2018-10-28 12:27:01 [Broker] error Connection exception: >> framing-error: >> >>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception >> 0x0202 >> >>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >> waiting for >> >>>> AIOs to complete. >> >>>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >> >>>> 2018-10-28 12:27:01 [Protocol] error Connection >> >>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue >> <queue-name>: >> >>>> MessageStoreImpl::store() failed: jexception 0x0202 >> >>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >> waiting for >> >>>> AIOs to complete. >> >>>> >> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501) >> >>>> 2018-10-28 12:27:01 [Protocol] error Connection >> >>>> qpid.server-ip:5672-client-ip:44457 closed by error: >> illegal-argument: >> >>>> Value for replyText is too large(320) >> >>>> >> >>>> Thanks, >> >>>> Ram >> >>>> >> >>>> >> >>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu < >> >>>> rammohanga...@gmail.com> wrote: >> >>>> >> >>>>> No, local disk. >> >>>>> >> >>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <g...@redhat.com> wrote: >> >>>>> >> >>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote: >> >>>>>>> Gordon, >> >>>>>>> >> >>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 version >> >>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what >> scenario >> >>>>>> its >> >>>>>>> happening but after i restart broker and if we wait for few days >> its >> >>>>>>> happening again. From the above logs do you have any pointers to >> >>>>>> check? >> >>>>>> >> >>>>>> Are you using NFS? >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> --------------------------------------------------------------------- >> >>>>>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org >> >>>>>> For additional commands, e-mail: users-h...@qpid.apache.org >> >>>>>> >> >>>>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org >> For additional commands, e-mail: users-h...@qpid.apache.org >> >>