Kim,

I see these two settings from code, can these be configurable?

#define MAX_AIO_SLEEPS 100000 // tot: ~1 sec

#define AIO_SLEEP_TIME_US  10 // 0.01 ms


Ram

On Wed, Nov 7, 2018 at 7:04 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> Thank you Kim, i will try your suggestions.
>
> On Wed, Nov 7, 2018, 6:58 AM Kim van der Riet <kvand...@redhat.com wrote:
>
>> This error is a linearstore issue. It looks as though there is a single
>> write operation to disk that has become stuck, and is holding up all
>> further write operations. This happens because there is a fixed circular
>> pool of memory pages used for the AIO operations to disk, and when one
>> of these is "busy" (indicated by the A letter in the  page state map),
>> write operations cannot continue until it is cleared. It it does not
>> clear within a certain time, then an exception is thrown, which usually
>> results in the broker closing the connection.
>>
>> The events leading up to a "stuck" write operation are complex and
>> sometimes difficult to reproduce. If you have a reproducer, then I would
>> be interested to see it! Even so, the ability to reproduce on another
>> machine is hard as it depends on such things as disk write speed, the
>> disk controller characteristics, the number of threads in the thread
>> pool (ie CPU type), memory and other hardware-related things.
>>
>> There are two linearstore parameters that you can try playing with to
>> see if you can change the behavior of the store:
>>
>> wcache-page-size: This sets the size of each page in the write buffer.
>> Larger page size is good for large messages, a smaller size will help if
>> you have small messages.
>>
>> wchache-num-pages: The total number of pages in the write buffer.
>>
>> Use the --help on the broker with the linearstore loaded to see more
>> details on this. I hope that helps a little.
>>
>> Kim van der Riet
>>
>> On 11/6/18 2:12 PM, rammohan ganapavarapu wrote:
>> > Any help in understand why/when broker throws those errors and stop
>> > receiving message would be appreciated.
>> >
>> > Not sure if any kernel tuning or broker tuning needs to be done to
>> > solve this issue.
>> >
>> > Thanks in advance,
>> > Ram
>> >
>> > On Tue, Nov 6, 2018 at 8:35 AM rammohan ganapavarapu <
>> > rammohanga...@gmail.com> wrote:
>> >
>> >> Also from this log message (store level) it seems like waiting for AIO
>> to
>> >> complete.
>> >>
>> >> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<journal
>> >> name>": get_events() returned JERR_JCNTL_AIOCMPLWAIT;
>> >> wmgr_status: wmgr: pi=25 pc=8 po=0 aer=1 edac=TFFF
>> >> ps=[-------------------------A------]
>> >>
>> >> page_state ps=[-------------------------A------]  where A is
>> AIO_PENDING
>> >> aer=1 _aio_evt_rem;          ///< Remaining AIO events
>> >>
>> >> When there is or there are pending AIO, does broker close the
>> connection?
>> >> is there any tuning that can be done to resolve this?
>> >>
>> >> Thanks,
>> >> Ram
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Nov 5, 2018 at 8:55 PM rammohan ganapavarapu <
>> >> rammohanga...@gmail.com> wrote:
>> >>
>> >>> I was check the code and i see these lines for that AIO timeout.
>> >>>
>> >>>                case
>> qpid::linearstore::journal::RHM_IORES_PAGE_AIOWAIT:
>> >>>                  if (++aio_sleep_cnt > MAX_AIO_SLEEPS)
>> >>>                      THROW_STORE_EXCEPTION("Timeout waiting for AIO in
>> >>> MessageStoreImpl::recoverMessages()");
>> >>>                  ::usleep(AIO_SLEEP_TIME_US);
>> >>>                  break;
>> >>>
>> >>> And these are the defaults
>> >>>
>> >>> #define MAX_AIO_SLEEPS 100000 // tot: ~1 sec
>> >>>
>> >>> #define AIO_SLEEP_TIME_US  10 // 0.01 ms
>> >>>
>> >>>
>> >>>    RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page is
>> >>> waiting for AIO.
>> >>>
>> >>>
>> >>>
>> >>> So does page got blocked and its waiting for page availability?
>> >>>
>> >>>
>> >>> Ram
>> >>>
>> >>> On Mon, Nov 5, 2018 at 8:00 PM rammohan ganapavarapu <
>> >>> rammohanga...@gmail.com> wrote:
>> >>>
>> >>>> Actually we have upgraded from qpid-cpp 0.28 to 1.35 and after that
>> we
>> >>>> see this message
>> >>>>
>> >>>> 2018-10-27 18:58:25 [Store] warning Linear Store: Journal
>> >>>> "<journal-name>": Bad record alignment found at fid=0x4605b
>> offs=0x107680
>> >>>> (likely journal overwrite boundary); 19 filler record(s) required.
>> >>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
>> >>>> "<journal-name>": Recover phase write: Wrote filler record:
>> fid=0x4605b
>> >>>> offs=0x107680
>> >>>> 2018-10-27 18:58:25 [Store] notice Linear Store: Journal
>> >>>> "<journal-name>": Recover phase write: Wr... few more Recover phase
>> logs
>> >>>>
>> >>>> It worked fine for a day and started throwing this message:
>> >>>>
>> >>>> 2018-10-28 12:27:01 [Store] critical Linear Store: Journal "<name>":
>> >>>> get_events() returned JERR_JCNTL_AIOCMPLWAIT; wmgr_status: wmgr:
>> pi=25 pc=8
>> >>>> po=0 aer=1 edac=TFFF ps=[-------------------------A------]
>> >>>> 2018-10-28 12:27:01 [Broker] warning Exchange <name> cannot deliver
>> to
>> >>>> queue <queue-name>: Queue <queue-name>: MessageStoreImpl::store()
>> failed:
>> >>>> jexception 0x0202 jcntl::handle_aio_wait() threw
>> JERR_JCNTL_AIOCMPLWAIT:
>> >>>> Timeout waiting for AIOs to complete.
>> >>>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
>> >>>> 2018-10-28 12:27:01 [Broker] error Connection exception:
>> framing-error:
>> >>>> Queue <queue-name>: MessageStoreImpl::store() failed: jexception
>> 0x0202
>> >>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
>> waiting for
>> >>>> AIOs to complete.
>> >>>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)
>> >>>> 2018-10-28 12:27:01 [Protocol] error Connection
>> >>>> qpid.server-ip:5672-client-ip:44457 closed by error: Queue
>> <queue-name>:
>> >>>> MessageStoreImpl::store() failed: jexception 0x0202
>> >>>> jcntl::handle_aio_wait() threw JERR_JCNTL_AIOCMPLWAIT: Timeout
>> waiting for
>> >>>> AIOs to complete.
>> >>>>
>> (/home/rganapavarapu/rpmbuild/BUILD/qpid-cpp-1.35.0/src/qpid/linearstore/MessageStoreImpl.cpp:1211)(501)
>> >>>> 2018-10-28 12:27:01 [Protocol] error Connection
>> >>>> qpid.server-ip:5672-client-ip:44457 closed by error:
>> illegal-argument:
>> >>>> Value for replyText is too large(320)
>> >>>>
>> >>>> Thanks,
>> >>>> Ram
>> >>>>
>> >>>>
>> >>>> On Mon, Nov 5, 2018 at 3:34 PM rammohan ganapavarapu <
>> >>>> rammohanga...@gmail.com> wrote:
>> >>>>
>> >>>>> No, local disk.
>> >>>>>
>> >>>>> On Mon, Nov 5, 2018 at 3:26 PM Gordon Sim <g...@redhat.com> wrote:
>> >>>>>
>> >>>>>> On 05/11/18 22:58, rammohan ganapavarapu wrote:
>> >>>>>>> Gordon,
>> >>>>>>>
>> >>>>>>> We are using java client 0.28 version and qpidd-cpp 1.35 version
>> >>>>>>> (qpid-cpp-server-1.35.0-1.el7.x86_64), i dont know at what
>> scenario
>> >>>>>> its
>> >>>>>>> happening but after i restart broker and if we wait for few days
>> its
>> >>>>>>> happening again. From the above logs do you have any pointers to
>> >>>>>> check?
>> >>>>>>
>> >>>>>> Are you using NFS?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
>> >>>>>> For additional commands, e-mail: users-h...@qpid.apache.org
>> >>>>>>
>> >>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
>> For additional commands, e-mail: users-h...@qpid.apache.org
>>
>>

Reply via email to