Thank you for the clarifications in this and other rsyslog message threads.

After realizing what causes rsyslog to attempt random read operation which
fails, the issue became reproducible and was able to isolate the debug
output which will transpose for the results from separate tests. The
problem persists regardless how configured, with queue and output regular
file, or without queues, also tested with output dynaFile.

When start rsyslog, logger writes files to omfile output file or dynafile
to dfs-fuse mount, there are no issues until another command or program
access the same path which interrupts some sort lock on the filesystem.

If rsyslog is started, and then simply list the directory contents at the
same hierarchy level as the omfile output, run "ls /dfs-fuse-data/test"
 - even though there's no files yet, it will now generate the error when
logging which is part of the dfs-fuse library that does validation is using
sequential access.

1234.567891012:dfs-test-file queue:Reg/w0:  omfile: write to stream,
pData->pStrm 0x7f1b1234567a0, lenBuf 61, strt data Oct    9  12:00:00
 testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 1/2

1234.567891024:dfs-test-file queue:Reg/w0:  strm 0x7f1b1234567a0: file 44(
dfs-appliance_test.log) flush, buflen 61
1234.567891048:dfs-test-file queue:Reg/w0:  strmPhysWrite, stream
0x0a1b1234567a0,  len 61
1234.567892537:dfs-test-file queue:Reg/w0:  log file (44) write error 95:
Operation not supported
1234.567892559:dfs-test-file queue:Reg/w0:  Action 64 transitioned to
state: rdy
1234.567892570:dfs-test-file queue:Reg/w0:  actionCommit,  in retry loop,
 iRet 0
1234.567892581:dfs-test-file queue:Reg/w0:  regular consumer finished,
 iret=0, szlog 0 sz phys 1
1234.567892592:dfs-test-file queue:Reg/w0:  DeleteProcessedBatch: we
deleted 1 objects and enqueued 0 objects
1234.567892603:dfs-test-file queue:Reg/w0:  doDeleteBatch:  delete batch
from store, new sizes: log 0, phys 0
1234.567892606:dfs-test-file queue:Reg/w0:  regular consumer finished,
 iret=4,  szlog 0 sz phys 0
1234.567892627:dfs-test-file queue:Reg/w0:  dfs-test-file queue:Reg/w0:
worker IDLE, waiting for work.

The next log to parse after this write error, is the message which appears
in /var/log/messages:
<11>Oct    9  12:00:00 fuse_dfs: ERROR
/src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random
access write to a file 61 != 0 for /dfs-fuse-data/test/dfs-
appliance_test.log

If there were a lot of logs attempting to write afterward, depending on the
rate, there would be an error for each message that was attempted to write,
and failed, then is deleted instead.

After it has failed, a service rsyslog restart, or a HUP the system will
flush logs that congested into the queue before the point of write failing,
but any log after the error is generated is not written and deleted as
described by debug output. When there is a high volume of logs sent in, it
appears to write without issue until introducing the error by wc the file
and see the char len, which matches the exact point of the dfsTell
validation in the error message *61 in example above.

As you can imagine, it is a problem to try and avoid running any command on
directory rsyslog omfile is outputting to without breaking the stream
process.
DynaFile was tested, to see if the entire path is affected or only the top
level which files are actively being written. Unfortunately, any directory
rsyslog wrote to from the point of startup appears to have persistence even
though the template is separating the tree structure based on date and time
as workaround.

5678.910111111:dfs-test-file main Q:Reg/w0:  omfile: file to log to: /
dfs-fuse-data/test/2015/10/10/12_00_dfs-appliance_test.log
5678.910111213:dfs-test-file queue:Reg/w0:  omfile: write to stream,
pData->pStrm 0x7f2c7654321b1, lenBuf 62, strt data Oct    10  12:00:00
 testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2

5678.910111314:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654321b1: file
44(12_00_dfs-appliance_test.log) flush, buflen 62
5678.910111415:dfs-test-file main Q:Reg/w0:  strmPhysWrite,
stream 0x7f2c7654321b1,  len 62
5678.910111516:dfs-test-file main Q:Reg/w0:  log file (44) write error 95:
Operation not supported
5678.910111617:dfs-test-file main Q:Reg/w0:  Action 60 transitioned to
state: rdy
5678.910111718:dfs-test-file main Q:Reg/w0:  actionCommit,  in retry loop,
 iRet 0
5678.910111819:dfs-test-file main Q:Reg/w0:  actionCommitAll: action 62,
state 1, nbr to commit 0 isTransactional 1
5678.910111920:dfs-test-file main Q:Reg/w0:  doTransaction: have
commitTransaction IF, using that, pWrkrInfo, 0x7f2c775432160
5678.910112021:dfs-test-file main Q:Reg/w0: entering
actionCallCommitTransaction(), state: itx, actionNbr, 62, nMsgs 1
5678.910112122:dfs-test-file main Q:Reg/w0: omfile: write to stream,
pData->pStrm 0x7f2c7654345c1, len 62, strt data  Oct    10  12:00:00
 testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2
5678.910112223:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654345c1: file
42(messages) flush, buflen 62
5678.910112324:dfs-test-file main Q:Reg/w0:  strmPhysWrite, stream
0x7f2c7654345c1, len 62
5678.910112326:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654345c1: file 42
write wrote 62 bytes
5678.910112328:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
rdy
5678.910112625:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
itx
5678.910112626:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
rdy
5678.910112628:dfs-test-file main Q:Reg/w0:  actionCommit,  in retry loop,
 iRet 0
5678.910112629:dfs-test-file main Q:Reg/w0:  actionCommitAll:  action 66,
state 1, nbr to commit 0 isTransactional 1
5678.910112654:dfs-test-file main Q:Reg/w0:  processBATCH:  batch of 1
elements has been processed.
5678.910112664:dfs-test-file main Q:Reg/w0:  regular consumer finished,
 iret=0, szlog 1 sz phys 2
5678.910112667:dfs-test-file main Q:Reg/w0:  DeleteProcessedBatch: we
deleted 1 objects and enqueued 0 objects
5678.910112702:dfs-test-file main Q:Reg/w0:  doDeleteBatch:  delete batch
from store, new sizes: log 1, phys 1
5678.910112704:dfs-test-file main Q:Reg/w0:  processBATCH:  batch of 1
elements must be processed
5678.910112805:dfs-test-file main Q:Reg/w0:  processBATCH:  next msg
0:<11>Oct    10  12:00:00  fuse_dfs: ERROR
/src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random
access write to a file 62 != 0 for /dfs-fuse-data/test/2015/10/10/12_00_dfs-
appliance_test.log

The error would show 'write to file 123 != 61', if sent the first log (61
chars) were in successfully before ls the directory and sending second
log(62 chars).

I hope these debugs and details for reproducing the issue helps identify root
cause. Although the numbers at the start of line are made up when
transposing, ####.#########, the rest of the debug content is complete
chain of events.

Thanks for all the help!




On Tue, Oct 6, 2015 at 4:04 AM, Rainer Gerhards <rgerha...@hq.adiscon.com>
wrote:

> 2015-10-06 8:26 GMT+02:00 David Lang <da...@lang.hm>:
> > On Tue, 6 Oct 2015, Rainer Gerhards wrote:
> >
> >> 2015-10-06 7:27 GMT+02:00 David Lang <da...@lang.hm>:
> >>>
> >>> On Tue, 6 Oct 2015, Rainer Gerhards wrote:
> >>>
> >>>> 2015-10-06 0:09 GMT+02:00 Kendall Green <kendallar...@gmail.com>:
> >>>>>
> >>>>>
> >>>>> I've set asyncWriting="off" and the only difference is, that with
> 1000
> >>>>> message test, all were delivered on HUP and/or service restart, but
> >>>>> apparently not on flushInterval or buffer full. I'm currently running
> >>>>> another test with 100k messages iterating about 10 logs/sec, and
> >>>>> watching
> >>>>> the output for number of lines. The output file is created on
> dfs-fuse
> >>>>> mount upon receipt of the first log, but 0 lines written.
> >>>>>
> >>>>> This current test will hopefully show how many logs get processed
> >>>>> without
> >>>>> intervention over the next ~3 hours, and if still no logs appear will
> >>>>> see
> >>>>> how many are written upon HUP signal.
> >>>>>
> >>>>> I'm unsure how the asyncWriting off impacts the rest of the config,
> and
> >>>>> if
> >>>>> the current config has disabled compression... or not with no zip.
> >>>>>
> >>>>> This note is from omfile asyncWriting doc:
> >>>>> *Note that in order to enable FlushInterval, AsyncWriting must be set
> >>>>> to
> >>>>> “on”. Otherwise, the flush interval will be ignored. Also note that
> >>>>> when
> >>>>> FlushOnTXEnd is “on” but AsyncWriting is off, output will only be
> >>>>> written
> >>>>> when the buffer is full. This may take several hours, or even
> require a
> >>>>> rsyslog shutdown. However, a buffer flush can be forced in that case
> by
> >>>>> sending rsyslogd a HUP signal.*
> >>>>>
> >>>>> Please help clarify how these parameters correspond and what the
> >>>>> settings
> >>>>> can achieve expedited log transport, avoiding long delays waiting for
> >>>>> buffering or flush intervals. This goal is why have chosen low
> settings
> >>>>> for
> >>>>> IOBufferSize and flushInterval.
> >>>>
> >>>>
> >>>>
> >>>> Without async writing, everything is written sync on when it arrives
> >>>> (more precisely at commit of the current batch) aka "as quickly as
> >>>> possible". Thus, these params do not exist because they have no
> >>>> meaning. Note that with a very large batch and a small buffer, partial
> >>>> records may be written.
> >>>
> >>>
> >>>
> >>> although without the checkpoint parameter the writes are to the OS and
> >>> they
> >>> get flushed to disk at a later point.
> >>
> >>
> >> you mean "sync", right?
> >
> >
> > I may be misunderstanding, but i thought the checkpoint (or checkpoint
> > interval) parameter controleld how frequently you did the sync (so
> turning
> > sync on didn't make it per every message)
>
> That checkopoint interval is for queues (which share the same util
> object), not for omfile. There, what you say is right. For omfile (if
> not set to async writing), the sync happens at the end of each batch,
> which usually means after each write because usually a full batch goes
> into the write buffer. If the batch has more data than the write
> buffer fits, all buffer writes are done and then the flush.
>
> I've also updated the doc to be more precise:
>
>
> https://github.com/rsyslog/rsyslog-doc/commit/b2bb5d93787f9662eb41499c86f881d2bd649dce
>
> Rainer
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to