2015-10-16 10:02 GMT+02:00 Kendall Green <kendallar...@gmail.com>: > Thank you for the clarifications in this and other rsyslog message threads. > > After realizing what causes rsyslog to attempt random read operation which > fails, the issue became reproducible and was able to isolate the debug > output which will transpose for the results from separate tests. The > problem persists regardless how configured, with queue and output regular > file, or without queues, also tested with output dynaFile. > > When start rsyslog, logger writes files to omfile output file or dynafile > to dfs-fuse mount, there are no issues until another command or program > access the same path which interrupts some sort lock on the filesystem. > > If rsyslog is started, and then simply list the directory contents at the > same hierarchy level as the omfile output, run "ls /dfs-fuse-data/test" > - even though there's no files yet, it will now generate the error when > logging which is part of the dfs-fuse library that does validation is using > sequential access. > > 1234.567891012:dfs-test-file queue:Reg/w0: omfile: write to stream, > pData->pStrm 0x7f1b1234567a0, lenBuf 61, strt data Oct 9 12:00:00 > testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 1/2 > > 1234.567891024:dfs-test-file queue:Reg/w0: strm 0x7f1b1234567a0: file 44( > dfs-appliance_test.log) flush, buflen 61 > 1234.567891048:dfs-test-file queue:Reg/w0: strmPhysWrite, stream > 0x0a1b1234567a0, len 61 > 1234.567892537:dfs-test-file queue:Reg/w0: log file (44) write error 95: > Operation not supported > 1234.567892559:dfs-test-file queue:Reg/w0: Action 64 transitioned to > state: rdy > 1234.567892570:dfs-test-file queue:Reg/w0: actionCommit, in retry loop, > iRet 0 > 1234.567892581:dfs-test-file queue:Reg/w0: regular consumer finished, > iret=0, szlog 0 sz phys 1 > 1234.567892592:dfs-test-file queue:Reg/w0: DeleteProcessedBatch: we > deleted 1 objects and enqueued 0 objects > 1234.567892603:dfs-test-file queue:Reg/w0: doDeleteBatch: delete batch > from store, new sizes: log 0, phys 0 > 1234.567892606:dfs-test-file queue:Reg/w0: regular consumer finished, > iret=4, szlog 0 sz phys 0 > 1234.567892627:dfs-test-file queue:Reg/w0: dfs-test-file queue:Reg/w0: > worker IDLE, waiting for work. > > The next log to parse after this write error, is the message which appears > in /var/log/messages: > <11>Oct 9 12:00:00 fuse_dfs: ERROR > /src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random > access write to a file 61 != 0 for /dfs-fuse-data/test/dfs- > appliance_test.log > > If there were a lot of logs attempting to write afterward, depending on the > rate, there would be an error for each message that was attempted to write, > and failed, then is deleted instead. > > After it has failed, a service rsyslog restart, or a HUP the system will > flush logs that congested into the queue before the point of write failing, > but any log after the error is generated is not written and deleted as > described by debug output. When there is a high volume of logs sent in, it > appears to write without issue until introducing the error by wc the file > and see the char len, which matches the exact point of the dfsTell > validation in the error message *61 in example above. > > As you can imagine, it is a problem to try and avoid running any command on > directory rsyslog omfile is outputting to without breaking the stream > process. > DynaFile was tested, to see if the entire path is affected or only the top > level which files are actively being written. Unfortunately, any directory > rsyslog wrote to from the point of startup appears to have persistence even > though the template is separating the tree structure based on date and time > as workaround. > > 5678.910111111:dfs-test-file main Q:Reg/w0: omfile: file to log to: / > dfs-fuse-data/test/2015/10/10/12_00_dfs-appliance_test.log > 5678.910111213:dfs-test-file queue:Reg/w0: omfile: write to stream, > pData->pStrm 0x7f2c7654321b1, lenBuf 62, strt data Oct 10 12:00:00 > testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2 > > 5678.910111314:dfs-test-file main Q:Reg/w0: strm 0x7f2c7654321b1: file > 44(12_00_dfs-appliance_test.log) flush, buflen 62 > 5678.910111415:dfs-test-file main Q:Reg/w0: strmPhysWrite, > stream 0x7f2c7654321b1, len 62 > 5678.910111516:dfs-test-file main Q:Reg/w0: log file (44) write error 95: > Operation not supported > 5678.910111617:dfs-test-file main Q:Reg/w0: Action 60 transitioned to > state: rdy > 5678.910111718:dfs-test-file main Q:Reg/w0: actionCommit, in retry loop, > iRet 0 > 5678.910111819:dfs-test-file main Q:Reg/w0: actionCommitAll: action 62, > state 1, nbr to commit 0 isTransactional 1 > 5678.910111920:dfs-test-file main Q:Reg/w0: doTransaction: have > commitTransaction IF, using that, pWrkrInfo, 0x7f2c775432160 > 5678.910112021:dfs-test-file main Q:Reg/w0: entering > actionCallCommitTransaction(), state: itx, actionNbr, 62, nMsgs 1 > 5678.910112122:dfs-test-file main Q:Reg/w0: omfile: write to stream, > pData->pStrm 0x7f2c7654345c1, len 62, strt data Oct 10 12:00:00 > testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2 > 5678.910112223:dfs-test-file main Q:Reg/w0: strm 0x7f2c7654345c1: file > 42(messages) flush, buflen 62 > 5678.910112324:dfs-test-file main Q:Reg/w0: strmPhysWrite, stream > 0x7f2c7654345c1, len 62 > 5678.910112326:dfs-test-file main Q:Reg/w0: strm 0x7f2c7654345c1: file 42 > write wrote 62 bytes > 5678.910112328:dfs-test-file main Q:Reg/w0: Action 62 transition to state: > rdy > 5678.910112625:dfs-test-file main Q:Reg/w0: Action 62 transition to state: > itx > 5678.910112626:dfs-test-file main Q:Reg/w0: Action 62 transition to state: > rdy > 5678.910112628:dfs-test-file main Q:Reg/w0: actionCommit, in retry loop, > iRet 0 > 5678.910112629:dfs-test-file main Q:Reg/w0: actionCommitAll: action 66, > state 1, nbr to commit 0 isTransactional 1 > 5678.910112654:dfs-test-file main Q:Reg/w0: processBATCH: batch of 1 > elements has been processed. > 5678.910112664:dfs-test-file main Q:Reg/w0: regular consumer finished, > iret=0, szlog 1 sz phys 2 > 5678.910112667:dfs-test-file main Q:Reg/w0: DeleteProcessedBatch: we > deleted 1 objects and enqueued 0 objects > 5678.910112702:dfs-test-file main Q:Reg/w0: doDeleteBatch: delete batch > from store, new sizes: log 1, phys 1 > 5678.910112704:dfs-test-file main Q:Reg/w0: processBATCH: batch of 1 > elements must be processed > 5678.910112805:dfs-test-file main Q:Reg/w0: processBATCH: next msg > 0:<11>Oct 10 12:00:00 fuse_dfs: ERROR > /src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random > access write to a file 62 != 0 for /dfs-fuse-data/test/2015/10/10/12_00_dfs- > appliance_test.log > > The error would show 'write to file 123 != 61', if sent the first log (61 > chars) were in successfully before ls the directory and sending second > log(62 chars). > > I hope these debugs and details for reproducing the issue helps identify root > cause. Although the numbers at the start of line are made up when > transposing, ####.#########, the rest of the debug content is complete > chain of events.
sounds strongly like a bug in the file system driver to me. As I said, we simply do open - write - close Or am I overlooking something here? Rainer > > Thanks for all the help! > > > > > On Tue, Oct 6, 2015 at 4:04 AM, Rainer Gerhards <rgerha...@hq.adiscon.com> > wrote: > >> 2015-10-06 8:26 GMT+02:00 David Lang <da...@lang.hm>: >> > On Tue, 6 Oct 2015, Rainer Gerhards wrote: >> > >> >> 2015-10-06 7:27 GMT+02:00 David Lang <da...@lang.hm>: >> >>> >> >>> On Tue, 6 Oct 2015, Rainer Gerhards wrote: >> >>> >> >>>> 2015-10-06 0:09 GMT+02:00 Kendall Green <kendallar...@gmail.com>: >> >>>>> >> >>>>> >> >>>>> I've set asyncWriting="off" and the only difference is, that with >> 1000 >> >>>>> message test, all were delivered on HUP and/or service restart, but >> >>>>> apparently not on flushInterval or buffer full. I'm currently running >> >>>>> another test with 100k messages iterating about 10 logs/sec, and >> >>>>> watching >> >>>>> the output for number of lines. The output file is created on >> dfs-fuse >> >>>>> mount upon receipt of the first log, but 0 lines written. >> >>>>> >> >>>>> This current test will hopefully show how many logs get processed >> >>>>> without >> >>>>> intervention over the next ~3 hours, and if still no logs appear will >> >>>>> see >> >>>>> how many are written upon HUP signal. >> >>>>> >> >>>>> I'm unsure how the asyncWriting off impacts the rest of the config, >> and >> >>>>> if >> >>>>> the current config has disabled compression... or not with no zip. >> >>>>> >> >>>>> This note is from omfile asyncWriting doc: >> >>>>> *Note that in order to enable FlushInterval, AsyncWriting must be set >> >>>>> to >> >>>>> “on”. Otherwise, the flush interval will be ignored. Also note that >> >>>>> when >> >>>>> FlushOnTXEnd is “on” but AsyncWriting is off, output will only be >> >>>>> written >> >>>>> when the buffer is full. This may take several hours, or even >> require a >> >>>>> rsyslog shutdown. However, a buffer flush can be forced in that case >> by >> >>>>> sending rsyslogd a HUP signal.* >> >>>>> >> >>>>> Please help clarify how these parameters correspond and what the >> >>>>> settings >> >>>>> can achieve expedited log transport, avoiding long delays waiting for >> >>>>> buffering or flush intervals. This goal is why have chosen low >> settings >> >>>>> for >> >>>>> IOBufferSize and flushInterval. >> >>>> >> >>>> >> >>>> >> >>>> Without async writing, everything is written sync on when it arrives >> >>>> (more precisely at commit of the current batch) aka "as quickly as >> >>>> possible". Thus, these params do not exist because they have no >> >>>> meaning. Note that with a very large batch and a small buffer, partial >> >>>> records may be written. >> >>> >> >>> >> >>> >> >>> although without the checkpoint parameter the writes are to the OS and >> >>> they >> >>> get flushed to disk at a later point. >> >> >> >> >> >> you mean "sync", right? >> > >> > >> > I may be misunderstanding, but i thought the checkpoint (or checkpoint >> > interval) parameter controleld how frequently you did the sync (so >> turning >> > sync on didn't make it per every message) >> >> That checkopoint interval is for queues (which share the same util >> object), not for omfile. There, what you say is right. For omfile (if >> not set to async writing), the sync happens at the end of each batch, >> which usually means after each write because usually a full batch goes >> into the write buffer. If the batch has more data than the write >> buffer fits, all buffer writes are done and then the flush. >> >> I've also updated the doc to be more precise: >> >> >> https://github.com/rsyslog/rsyslog-doc/commit/b2bb5d93787f9662eb41499c86f881d2bd649dce >> >> Rainer >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.