Do you have any kernel (net/disk) tuning recommendations for qpid-cpp with linearstore?
Ram On Thu, Nov 8, 2018 at 8:56 AM rammohan ganapavarapu < rammohanga...@gmail.com> wrote: > Kim/Gordon, > > I was wrong about the NFS for qpid journal files, looks like they are on > NFS, so does NFS cause this issue? > > Ram > > On Wed, Nov 7, 2018 at 12:18 PM rammohan ganapavarapu < > rammohanga...@gmail.com> wrote: > >> Kim, >> >> Ok, i am still trying to see what part of my java application is causing >> that issue, yes that issue is happening intermittently. Regarding >> "JERR_WMGR_ENQDISCONT" error, may be they are chained exceptions from the >> previous error JERR_JCNTL_AIOCMPLWAIT? >> >> Does message size contribute to this issue? >> >> Thanks, >> Ram >> >> On Wed, Nov 7, 2018 at 11:37 AM Kim van der Riet <kvand...@redhat.com> >> wrote: >> >>> No, they are not. >>> >>> These two defines govern the number of sleeps and the sleep time while >>> waiting for before throwing an exception during recovery only. They do >>> not play a role during normal operation. >>> >>> If you are able to compile the broker code, you can try playing with >>> these values. But I don't think they will make much difference to the >>> overall problem. I think some of the other errors you have been seeing >>> prior to this one are closer to where the real problem lies - such as >>> the JRNL_WMGR_ENQDISCONT error. >>> >>> Do you have a reproducer of any kind? Does this error occur predictably >>> under some or other conditions? >>> >>> Thanks, >>> >>> Kim van der Riet >>> >>> On 11/7/18 12:51 PM, rammohan ganapavarapu wrote: >>> > Kim, >>> > >>> > I see these two settings from code, can these be configurable? >>> > >>> > #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec >>> > >>> > #define AIO_SLEEP_TIME_US 10 // 0.01 ms >>> > >>> > >>> > Ram >>> > >>> > On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu < >>> > rammohanga...@gmail.com> wrote: >>> > >>> >> Thank you Kim, i will try your suggestions. >>> >> >>> >> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <kvand...@redhat.com >>> wrote: >>> >> >>> >>> This error is a linearstore issue. It looks as though there is a >>> single >>> >>> write operation to disk that has become stuck, and is holding up all >>> >>> further write operations. This happens because there is a fixed >>> circular >>> >>> pool of memory pages used for the AIO operations to disk, and when >>> one >>> >>> of these is "busy" (indicated by the A letter in the page state >>> map), >>> >>> write operations cannot continue until it is cleared. It it does not >>> >>> clear within a certain time, then an exception is thrown, which >>> usually >>> >>> results in the broker closing the connection. >>> >>> >>> >>> The events leading up to a "stuck" write operation are complex and >>> >>> sometimes difficult to reproduce. If you have a reproducer, then I >>> would >>> >>> be interested to see it! Even so, the ability to reproduce on another >>> >>> machine is hard as it depends on such things as disk write speed, the >>> >>> disk controller characteristics, the number of threads in the thread >>> >>> pool (ie CPU type), memory and other hardware-related things. >>> >>> >>> >>> There are two linearstore parameters that you can try playing with to >>> >>> see if you can change the behavior of the store: >>> >>> >>> >>> wcache-page-size: This sets the size of each page in the write >>> buffer. >>> >>> Larger page size is good for large messages, a smaller size will >>> help if >>> >>> you have small messages. >>> >>> >>> >>> wchache-num-pages: The total number of pages in the write buffer. >>> >>> >>> >>> Use the --help on the broker with the linearstore loaded to see more >>> >>> details on this. I hope that helps a little. >>> >>> >>> >>> Kim van der Riet >>> >>> >>> >>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote: >>> >>>> Any help in understand why/when broker throws those errors and stop >>> >>>> receiving message would be appreciated. >>> >>>> >>> >>>> Not sure if any kernel tuning or broker tuning needs to be done to >>> >>>> solve this issue. >>> >>>> >>> >>>> Thanks in advance, >>> >>>> Ram >>> >>>> >>> >>>> On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu < >>> >>>> rammohanga...@gmail.com> wrote: >>> >>>> >>> >>>>> Also from this log message (store level) it seems like waiting for >>> AIO >>> >>> to >>> >>>>> complete. >>> >>>>> >>> >>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal >>> "<journal >>> >>>>> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT; >>> >>>>> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF >>> >>>>> ps=[-------------------------A------] >>> >>>>> >>> >>>>> page_state ps=[-------------------------A------] where A is >>> >>> AIO_PENDING >>> >>>>> aer=1 _aio_evt_rem; ///< Remaining AIO events >>> >>>>> >>> >>>>> When there is or there are pending AIO, does broker close the >>> >>> connection? >>> >>>>> is there any tuning that can be done to resolve this? >>> >>>>> >>> >>>>> Thanks, >>> >>>>> Ram >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu < >>> >>>>> rammohanga...@gmail.com> wrote: >>> >>>>> >>> >>>>>> I was check the code and i see these lines for that AIO timeout. >>> >>>>>> >>> >>>>>> case >>> >>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT: >>> >>>>>> if (++aio_sleep_cnt > MAX_AIO_SLEEPS) >>> >>>>>> THROW_STORE_EXCEPTION("Timeout waiting for >>> AIO in >>> >>>>>> MessageStoreImpl::recoverMessages()"); >>> >>>>>> ::usleep(AIO_SLEEP_TIME_US); >>> >>>>>> break; >>> >>>>>> >>> >>>>>> And these are the defaults >>> >>>>>> >>> >>>>>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec >>> >>>>>> >>> >>>>>> #define AIO_SLEEP_TIME_US 10 // 0.01 ms >>> >>>>>> >>> >>>>>> >>> >>>>>> RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next >>> page is >>> >>>>>> waiting for AIO. >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> So does page got blocked and its waiting for page availability? >>> >>>>>> >>> >>>>>> >>> >>>>>> Ram >>> >>>>>> >>> >>>>>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu < >>> >>>>>> rammohanga...@gmail.com> wrote: >>> >>>>>> >>> >>>>>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after >>> that >>> >>> we >>> >>>>>>> see this message >>> >>>>>>> >>> >>>>>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal >>> >>>>>>> "<journal-name>": Bad record alignment found at fid=0x4605b >>> >>> offs=0x107680 >>> >>>>>>> (likely journal overwrite boundary); 19 filler record(s) >>> required. >>> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >>> >>>>>>> "<journal-name>": Recover phase write: Wrote filler record: >>> >>> fid=0x4605b >>> >>>>>>> offs=0x107680 >>> >>>>>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal >>> >>>>>>> "<journal-name>": Recover phase write: Wr... few more Recover >>> phase >>> >>> logs >>> >>>>>>> It worked fine for a day and started throwing this message: >>> >>>>>>> >>> >>>>>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal >>> "<name>": >>> >>>>>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr: >>> >>> pi=25 pc=8 >>> >>>>>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------] >>> >>>>>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot >>> deliver >>> >>> to >>> >>>>>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store() >>> >>> failed: >>> >>>>>>> jexception 0x0202 jcntl::handle_aio_wait() threw >>> >>> JERR_JCNTL_AIOCMPLWAIT: >>> >>>>>>> Timeout waiting for AIOs to complete. >>> >>>>>>> >>> >>> >>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >>> >>>>>>> 2018-10-28 12:27:01 [Broker] error Connection exception: >>> >>> framing-error: >>> >>>>>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception >>> >>> 0x0202 >>> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >>> >>> waiting for >>> >>>>>>> AIOs to complete. >>> >>>>>>> >>> >>> >>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211) >>> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection >>> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue >>> >>> <queue-name>: >>> >>>>>>> MessageStoreImpl::store() failed: jexception 0x0202 >>> >>>>>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout >>> >>> waiting for >>> >>>>>>> AIOs to complete. >>> >>>>>>> >>> >>> >>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501) >>> >>>>>>> 2018-10-28 12:27:01 [Protocol] error Connection >>> >>>>>>> qpid.server-ip:5672-client-ip:44457 closed by error: >>> >>> illegal-argument: >>> >>>>>>> Value for replyText is too large(320) >>> >>>>>>> >>> >>>>>>> Thanks, >>> >>>>>>> Ram >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu < >>> >>>>>>> rammohanga...@gmail.com> wrote: >>> >>>>>>> >>> >>>>>>>> No, local disk. >>> >>>>>>>> >>> >>>>>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <g...@redhat.com> >>> wrote: >>> >>>>>>>> >>> >>>>>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote: >>> >>>>>>>>>> Gordon, >>> >>>>>>>>>> >>> >>>>>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 >>> version >>> >>>>>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what >>> >>> scenario >>> >>>>>>>>> its >>> >>>>>>>>>> happening but after i restart broker and if we wait for few >>> days >>> >>> its >>> >>>>>>>>>> happening again. From the above logs do you have any pointers >>> to >>> >>>>>>>>> check? >>> >>>>>>>>> >>> >>>>>>>>> Are you using NFS? >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>> --------------------------------------------------------------------- >>> >>>>>>>>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org >>> >>>>>>>>> For additional commands, e-mail: users-h...@qpid.apache.org >>> >>>>>>>>> >>> >>>>>>>>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org >>> >>> For additional commands, e-mail: users-h...@qpid.apache.org >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org >>> For additional commands, e-mail: users-h...@qpid.apache.org >>> >>>