2015-10-16 10:02 GMT+02:00 Kendall Green <kendallar...@gmail.com>:
> Thank you for the clarifications in this and other rsyslog message threads.
>
> After realizing what causes rsyslog to attempt random read operation which
> fails, the issue became reproducible and was able to isolate the debug
> output which will transpose for the results from separate tests. The
> problem persists regardless how configured, with queue and output regular
> file, or without queues, also tested with output dynaFile.
>
> When start rsyslog, logger writes files to omfile output file or dynafile
> to dfs-fuse mount, there are no issues until another command or program
> access the same path which interrupts some sort lock on the filesystem.
>
> If rsyslog is started, and then simply list the directory contents at the
> same hierarchy level as the omfile output, run "ls /dfs-fuse-data/test"
>  - even though there's no files yet, it will now generate the error when
> logging which is part of the dfs-fuse library that does validation is using
> sequential access.
>
> 1234.567891012:dfs-test-file queue:Reg/w0:  omfile: write to stream,
> pData->pStrm 0x7f1b1234567a0, lenBuf 61, strt data Oct    9  12:00:00
>  testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 1/2
>
> 1234.567891024:dfs-test-file queue:Reg/w0:  strm 0x7f1b1234567a0: file 44(
> dfs-appliance_test.log) flush, buflen 61
> 1234.567891048:dfs-test-file queue:Reg/w0:  strmPhysWrite, stream
> 0x0a1b1234567a0,  len 61
> 1234.567892537:dfs-test-file queue:Reg/w0:  log file (44) write error 95:
> Operation not supported
> 1234.567892559:dfs-test-file queue:Reg/w0:  Action 64 transitioned to
> state: rdy
> 1234.567892570:dfs-test-file queue:Reg/w0:  actionCommit,  in retry loop,
>  iRet 0
> 1234.567892581:dfs-test-file queue:Reg/w0:  regular consumer finished,
>  iret=0, szlog 0 sz phys 1
> 1234.567892592:dfs-test-file queue:Reg/w0:  DeleteProcessedBatch: we
> deleted 1 objects and enqueued 0 objects
> 1234.567892603:dfs-test-file queue:Reg/w0:  doDeleteBatch:  delete batch
> from store, new sizes: log 0, phys 0
> 1234.567892606:dfs-test-file queue:Reg/w0:  regular consumer finished,
>  iret=4,  szlog 0 sz phys 0
> 1234.567892627:dfs-test-file queue:Reg/w0:  dfs-test-file queue:Reg/w0:
> worker IDLE, waiting for work.
>
> The next log to parse after this write error, is the message which appears
> in /var/log/messages:
> <11>Oct    9  12:00:00 fuse_dfs: ERROR
> /src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random
> access write to a file 61 != 0 for /dfs-fuse-data/test/dfs-
> appliance_test.log
>
> If there were a lot of logs attempting to write afterward, depending on the
> rate, there would be an error for each message that was attempted to write,
> and failed, then is deleted instead.
>
> After it has failed, a service rsyslog restart, or a HUP the system will
> flush logs that congested into the queue before the point of write failing,
> but any log after the error is generated is not written and deleted as
> described by debug output. When there is a high volume of logs sent in, it
> appears to write without issue until introducing the error by wc the file
> and see the char len, which matches the exact point of the dfsTell
> validation in the error message *61 in example above.
>
> As you can imagine, it is a problem to try and avoid running any command on
> directory rsyslog omfile is outputting to without breaking the stream
> process.
> DynaFile was tested, to see if the entire path is affected or only the top
> level which files are actively being written. Unfortunately, any directory
> rsyslog wrote to from the point of startup appears to have persistence even
> though the template is separating the tree structure based on date and time
> as workaround.
>
> 5678.910111111:dfs-test-file main Q:Reg/w0:  omfile: file to log to: /
> dfs-fuse-data/test/2015/10/10/12_00_dfs-appliance_test.log
> 5678.910111213:dfs-test-file queue:Reg/w0:  omfile: write to stream,
> pData->pStrm 0x7f2c7654321b1, lenBuf 62, strt data Oct    10  12:00:00
>  testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2
>
> 5678.910111314:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654321b1: file
> 44(12_00_dfs-appliance_test.log) flush, buflen 62
> 5678.910111415:dfs-test-file main Q:Reg/w0:  strmPhysWrite,
> stream 0x7f2c7654321b1,  len 62
> 5678.910111516:dfs-test-file main Q:Reg/w0:  log file (44) write error 95:
> Operation not supported
> 5678.910111617:dfs-test-file main Q:Reg/w0:  Action 60 transitioned to
> state: rdy
> 5678.910111718:dfs-test-file main Q:Reg/w0:  actionCommit,  in retry loop,
>  iRet 0
> 5678.910111819:dfs-test-file main Q:Reg/w0:  actionCommitAll: action 62,
> state 1, nbr to commit 0 isTransactional 1
> 5678.910111920:dfs-test-file main Q:Reg/w0:  doTransaction: have
> commitTransaction IF, using that, pWrkrInfo, 0x7f2c775432160
> 5678.910112021:dfs-test-file main Q:Reg/w0: entering
> actionCallCommitTransaction(), state: itx, actionNbr, 62, nMsgs 1
> 5678.910112122:dfs-test-file main Q:Reg/w0: omfile: write to stream,
> pData->pStrm 0x7f2c7654345c1, len 62, strt data  Oct    10  12:00:00
>  testsrv01 testusr: -DFS_FUSE_DRIVER_TEST 2/2
> 5678.910112223:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654345c1: file
> 42(messages) flush, buflen 62
> 5678.910112324:dfs-test-file main Q:Reg/w0:  strmPhysWrite, stream
> 0x7f2c7654345c1, len 62
> 5678.910112326:dfs-test-file main Q:Reg/w0:  strm 0x7f2c7654345c1: file 42
> write wrote 62 bytes
> 5678.910112328:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
> rdy
> 5678.910112625:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
> itx
> 5678.910112626:dfs-test-file main Q:Reg/w0:  Action 62 transition to state:
> rdy
> 5678.910112628:dfs-test-file main Q:Reg/w0:  actionCommit,  in retry loop,
>  iRet 0
> 5678.910112629:dfs-test-file main Q:Reg/w0:  actionCommitAll:  action 66,
> state 1, nbr to commit 0 isTransactional 1
> 5678.910112654:dfs-test-file main Q:Reg/w0:  processBATCH:  batch of 1
> elements has been processed.
> 5678.910112664:dfs-test-file main Q:Reg/w0:  regular consumer finished,
>  iret=0, szlog 1 sz phys 2
> 5678.910112667:dfs-test-file main Q:Reg/w0:  DeleteProcessedBatch: we
> deleted 1 objects and enqueued 0 objects
> 5678.910112702:dfs-test-file main Q:Reg/w0:  doDeleteBatch:  delete batch
> from store, new sizes: log 1, phys 1
> 5678.910112704:dfs-test-file main Q:Reg/w0:  processBATCH:  batch of 1
> elements must be processed
> 5678.910112805:dfs-test-file main Q:Reg/w0:  processBATCH:  next msg
> 0:<11>Oct    10  12:00:00  fuse_dfs: ERROR
> /src/main/native/fuse-dfs/fuse_impls_write.c:57 User trying to random
> access write to a file 62 != 0 for /dfs-fuse-data/test/2015/10/10/12_00_dfs-
> appliance_test.log
>
> The error would show 'write to file 123 != 61', if sent the first log (61
> chars) were in successfully before ls the directory and sending second
> log(62 chars).
>
> I hope these debugs and details for reproducing the issue helps identify root
> cause. Although the numbers at the start of line are made up when
> transposing, ####.#########, the rest of the debug content is complete
> chain of events.


sounds strongly like a bug in the file system driver to me. As I said,
we simply do

open - write - close

Or am I overlooking something here?
Rainer
>
> Thanks for all the help!
>
>
>
>
> On Tue, Oct 6, 2015 at 4:04 AM, Rainer Gerhards <rgerha...@hq.adiscon.com>
> wrote:
>
>> 2015-10-06 8:26 GMT+02:00 David Lang <da...@lang.hm>:
>> > On Tue, 6 Oct 2015, Rainer Gerhards wrote:
>> >
>> >> 2015-10-06 7:27 GMT+02:00 David Lang <da...@lang.hm>:
>> >>>
>> >>> On Tue, 6 Oct 2015, Rainer Gerhards wrote:
>> >>>
>> >>>> 2015-10-06 0:09 GMT+02:00 Kendall Green <kendallar...@gmail.com>:
>> >>>>>
>> >>>>>
>> >>>>> I've set asyncWriting="off" and the only difference is, that with
>> 1000
>> >>>>> message test, all were delivered on HUP and/or service restart, but
>> >>>>> apparently not on flushInterval or buffer full. I'm currently running
>> >>>>> another test with 100k messages iterating about 10 logs/sec, and
>> >>>>> watching
>> >>>>> the output for number of lines. The output file is created on
>> dfs-fuse
>> >>>>> mount upon receipt of the first log, but 0 lines written.
>> >>>>>
>> >>>>> This current test will hopefully show how many logs get processed
>> >>>>> without
>> >>>>> intervention over the next ~3 hours, and if still no logs appear will
>> >>>>> see
>> >>>>> how many are written upon HUP signal.
>> >>>>>
>> >>>>> I'm unsure how the asyncWriting off impacts the rest of the config,
>> and
>> >>>>> if
>> >>>>> the current config has disabled compression... or not with no zip.
>> >>>>>
>> >>>>> This note is from omfile asyncWriting doc:
>> >>>>> *Note that in order to enable FlushInterval, AsyncWriting must be set
>> >>>>> to
>> >>>>> “on”. Otherwise, the flush interval will be ignored. Also note that
>> >>>>> when
>> >>>>> FlushOnTXEnd is “on” but AsyncWriting is off, output will only be
>> >>>>> written
>> >>>>> when the buffer is full. This may take several hours, or even
>> require a
>> >>>>> rsyslog shutdown. However, a buffer flush can be forced in that case
>> by
>> >>>>> sending rsyslogd a HUP signal.*
>> >>>>>
>> >>>>> Please help clarify how these parameters correspond and what the
>> >>>>> settings
>> >>>>> can achieve expedited log transport, avoiding long delays waiting for
>> >>>>> buffering or flush intervals. This goal is why have chosen low
>> settings
>> >>>>> for
>> >>>>> IOBufferSize and flushInterval.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Without async writing, everything is written sync on when it arrives
>> >>>> (more precisely at commit of the current batch) aka "as quickly as
>> >>>> possible". Thus, these params do not exist because they have no
>> >>>> meaning. Note that with a very large batch and a small buffer, partial
>> >>>> records may be written.
>> >>>
>> >>>
>> >>>
>> >>> although without the checkpoint parameter the writes are to the OS and
>> >>> they
>> >>> get flushed to disk at a later point.
>> >>
>> >>
>> >> you mean "sync", right?
>> >
>> >
>> > I may be misunderstanding, but i thought the checkpoint (or checkpoint
>> > interval) parameter controleld how frequently you did the sync (so
>> turning
>> > sync on didn't make it per every message)
>>
>> That checkopoint interval is for queues (which share the same util
>> object), not for omfile. There, what you say is right. For omfile (if
>> not set to async writing), the sync happens at the end of each batch,
>> which usually means after each write because usually a full batch goes
>> into the write buffer. If the batch has more data than the write
>> buffer fits, all buffer writes are done and then the flush.
>>
>> I've also updated the doc to be more precise:
>>
>>
>> https://github.com/rsyslog/rsyslog-doc/commit/b2bb5d93787f9662eb41499c86f881d2bd649dce
>>
>> Rainer
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
> LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to