One more comment: It would be good if the error code SA_AIS_ERR_TIMEOUT could be avoided. I think it can: by modifying the slave thread so that it undoes a write operation if the master thread timed out before the write was finished. Then the slave thread can use lseek() and ftruncate() to undo the write.
regards, Anders Widell 2013-08-19 10:07, Lennart Lund skrev: > Summary: logsv: Fix hanging main thread when file i/o don't return > Review request for Trac Ticket(s): #9 > Peer Reviewer(s): Madhurika Koppula, (Anders Widell, Hans Feldt) > Pull request to: NA > Affected branch(es): devel (4.4) > Development branch: <<IF ANY GIVE THE REPO URL>> > > > -------------------------------- > Impacted area Impact y/n > -------------------------------- > Docs n > Build system n > RPM/packaging n > Configuration files n > Startup scripts n > SAF services y > OpenSAF services n > Core libraries n > Samples n > Tests n > Other n > > > Comments (indicate scope for each "y" above): > --------------------------------------------- > In order to protect the log server "main thread" (MT) from hanging if a file > operation > like write, mkdir etc. does not return, all such operations are done in a > separate > "file thread" (FT). > Functions running in the "Main Thread" (MT) that needs file system operations > handle over the execution to the FT when file handling has to be done. > Execution > is then given back to the MT again. If a file operation does not return FT > will hang but > MT will time out the FT and resume. A timeout will be handled as a file > operation fail. > The MT can detect if the FT is hanging and new requests for file operations > will be "failed". > > Note: > This review request contains all patches. Some of the "old" patches are > concatenated (qfold) > Old patches New patch > 1 - 7 1 > 8 2 > 9 - 11 3 > 12 4 > 13 5 > 14 6 > 15 7 > > Patch 5 - 7 are new for this review. > > > changeset b32ee924b9716330c8d7b54f5556fc235c31fbab > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:12:30 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 1 > > Generic thread handling: > - Generic thread handling > - Convert functions to use threaded file handling > - Handling of object implementer rejects > - Invalidate stream fd if errno EBADF when writing log record > - Fix Error handling for too long path (> PATH_MAX) > - Functions that uses a handler in file thread has got extension _h > > changeset ed70f6043029ad9c7ea5f55439130023acea13bc > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:13:22 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] part 2 > > - Fix review remarks and some findings from test > - Fix some findings found when using code analyze tool > - Cleanup of TRACE and LOG > - Add information for contributors/maintainers about file system > handling in > the Log-service README file > > changeset 1ab74048f572d3ecb843651dea627399e77a0afc > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:13:56 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 3 > > - Remove unnecessary data copying in log_file_api() and > file_hndl_thread() > - Return SA_AIS_ERR_TIMEOUT if the write operation time out when a log > record shall be written. If the file thread is already "hanging" when a > write is requested no attempt to write is made and SA_AIS_ERR_TRY_AGAIN > is > returned as before. > - Try to recover file thread by recreating it if it hangs for a long > time. > - Recover if bad file descriptor or stale NFS handle. > > - Always reinitialize/reopen log files if a write operation fails, > timeout > of file thread (hanging file system) included. > - Handle synchronization between nodes when log files cannot be created > before > a switch over without using any new flag that has to be checkpointed > (remove "files_initialized" flag) > - Incorrect handling of "partial write" is fixed. See #536 > > - Open log files with O_NONBLOCK. Answer client with AIS_ERR_TIMEOUT if > EWOULDBLOCK/EAGAIN (record may be parially written) > > changeset 9891f1e38d7c32cb5be4f929101548dc466c79b9 > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:18:18 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 4 > > - Make timeouts for file hdl configurable in Log service configuration > object > > changeset dd80bd737715084537c0affe959ba619d0752fba > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:19:03 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 5 > > - Fix error in lgs_make_reldir_h(). Root directory can be corrupt if > file > thread is hanging. > > changeset 15498547b6da54ecdecf2f6e2806d60027d291dc > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:20:52 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 6 > > - Remove thread recovery handling (kill and restart thread) > > changeset 08a97594d0471b611baff692bd8704392a6d6f50 > Author: Lennart Lund <lennart.l...@ericsson.com> > Date: Mon, 19 Aug 2013 09:22:09 +0200 > > logsv: Fix hanging main thread when file i/o don't return. [#9] Part 7 > > - Update saflogger to handle SA_AIS_ERR_TIMEOUT > > > Added Files: > ------------ > README_LOGENH > osaf/services/saf/logsv/lgs/lgs_file.c > osaf/services/saf/logsv/lgs/lgs_file.h > osaf/services/saf/logsv/lgs/lgs_filehdl.c > osaf/services/saf/logsv/lgs/lgs_filehdl.h > > > Removed Files: > -------------- > README_LOGENH > > > Complete diffstat: > ------------------ > osaf/services/saf/logsv/README | 23 +++ > osaf/services/saf/logsv/lgs/Makefile.am | 8 +- > osaf/services/saf/logsv/lgs/lgs.h | 1 + > osaf/services/saf/logsv/lgs/lgs_cb.h | 2 + > osaf/services/saf/logsv/lgs/lgs_evt.c | 5 +- > osaf/services/saf/logsv/lgs/lgs_evt.h | 4 + > osaf/services/saf/logsv/lgs/lgs_file.c | 416 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > osaf/services/saf/logsv/lgs/lgs_file.h | 71 ++++++++++ > osaf/services/saf/logsv/lgs/lgs_filehdl.c | 612 > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > osaf/services/saf/logsv/lgs/lgs_filehdl.h | 162 +++++++++++++++++++++++ > osaf/services/saf/logsv/lgs/lgs_imm.c | 227 > ++++++++++++++++++++++++-------- > osaf/services/saf/logsv/lgs/lgs_main.c | 12 +- > osaf/services/saf/logsv/lgs/lgs_mbcsv.c | 7 + > osaf/services/saf/logsv/lgs/lgs_mbcsv.h | 3 + > osaf/services/saf/logsv/lgs/lgs_stream.c | 594 > +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------- > osaf/services/saf/logsv/lgs/lgs_stream.h | 4 +- > osaf/services/saf/logsv/lgs/lgs_util.c | 446 > +++++++++++++++++++++++++++++++++++++++++----------------------- > osaf/services/saf/logsv/lgs/lgs_util.h | 21 ++- > 18 files changed, 2147 insertions(+), 471 deletions(-) > > > Testing Commands: > ----------------- > 1. Regession test >> logtest > 2. Switch over test (using alarm stream) >> saflogger -l -s crit "alarm message 1" >> cat repl_opensaf/saflog/saLogAlarm_SOME_DATE.log > Printout containing "alarm message 1" >> immadm -o 7 safSi=SC-2N,safApp=OpenSAF >> saflogger -l -s crit "alarm message 2" >> cat repl_opensaf/saflog/saLogAlarm_SOME_DATE.log > Printout contaning "alarm message 1" and "alarm message 2" > 3. Redo tests after node start with simulated > unavailable filesystem for the log service > - Activate simulated unavailable file system by uncommenting > the LLD_DELAY_TST define in file lgs_file.c in the log server. > This means to "hang" the "file thread" for some tme during system > start. > - Rebuild the log server. > - Remove old log files in repl-opensaf/saflog/ > - Start the cluster with the rebuilt log server. > Note: The repl_opensaf/saflog directory is empty after system > start. The .cfg and .log files for alarm, notoify and system > that normally can be found is missing since they could not be > created during system start. However files for respective log > stream will be created when writing log records. > - Re-run test 1 and 2 > > Note: > The current logtest is using a hard-coded log root path that maybe > will not work on some systems and that is not the default path that > can be found in the log service configuration object. A ticket > for the logtest is written [#541]. A fix exists and patches can > be found in a review request for ticket #541 > > > Testing, Expected Results: > -------------------------- > 1. Regression test with no fail. > 2. "alarm message 1" and "alarm message 1" found in the same file. > 3.1. Regression test with no fail. > 3.2. "alarm message 1" and "alarm message 1" found in the same file. > > > Conditions of Submission: > ------------------------- > <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> > > > Arch Built Started Linux distro > ------------------------------------------- > mips n n > mips64 n n > x86 n n > x86_64 n n > powerpc n n > powerpc64 n n > > > Reviewer Checklist: > ------------------- > [Submitters: make sure that your review doesn't trigger any checkmarks!] > > > Your checkin has not passed review because (see checked entries): > > ___ Your RR template is generally incomplete; it has too many blank entries > that need proper data filled in. > > ___ You have failed to nominate the proper persons for review and push. > > ___ Your patches do not have proper short+long header > > ___ You have grammar/spelling in your header that is unacceptable. > > ___ You have exceeded a sensible line length in your headers/comments/text. > > ___ You have failed to put in a proper Trac Ticket # into your commits. > > ___ You have incorrectly put/left internal data in your comments/files > (i.e. internal bug tracking tool IDs, product names etc) > > ___ You have not given any evidence of testing beyond basic build tests. > Demonstrate some level of runtime or other sanity testing. > > ___ You have ^M present in some of your files. These have to be removed. > > ___ You have needlessly changed whitespace or added whitespace crimes > like trailing spaces, or spaces before tabs. > > ___ You have mixed real technical changes with whitespace and other > cosmetic code cleanup changes. These have to be separate commits. > > ___ You need to refactor your submission into logical chunks; there is > too much content into a single commit. > > ___ You have extraneous garbage in your review (merge commits etc) > > ___ You have giant attachments which should never have been sent; > Instead you should place your content in a public tree to be pulled. > > ___ You have too many commits attached to an e-mail; resend as threaded > commits, or place in a public tree for a pull. > > ___ You have resent this content multiple times without a clear indication > of what has changed between each re-send. > > ___ You have failed to adequately and individually address all of the > comments and change requests that were proposed in the initial review. > > ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) > > ___ Your computer have a badly configured date and time; confusing the > the threaded patch review. > > ___ Your changes affect IPC mechanism, and you don't present any results > for in-service upgradability test. > > ___ Your changes affect user manual and documentation, your patch series > do not contain the patch that updates the Doxygen manual. > ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel