One more comment:

It would be good if the error code SA_AIS_ERR_TIMEOUT could be avoided. 
I think it can: by modifying the slave thread so that it undoes a write 
operation if the master thread timed out before the write was finished. 
Then the slave thread can use lseek() and ftruncate() to undo the write.

regards,
Anders Widell

2013-08-19 10:07, Lennart Lund skrev:
> Summary: logsv: Fix hanging main thread when file i/o don't return
> Review request for Trac Ticket(s): #9
> Peer Reviewer(s): Madhurika Koppula, (Anders Widell, Hans Feldt)
> Pull request to: NA
> Affected branch(es): devel (4.4)
> Development branch: <<IF ANY GIVE THE REPO URL>>
>
>
> --------------------------------
> Impacted area       Impact y/n
> --------------------------------
>   Docs                    n
>   Build system            n
>   RPM/packaging           n
>   Configuration files     n
>   Startup scripts         n
>   SAF services            y
>   OpenSAF services        n
>   Core libraries          n
>   Samples                 n
>   Tests                   n
>   Other                   n
>
>
> Comments (indicate scope for each "y" above):
> ---------------------------------------------
> In order to protect the log server "main thread" (MT) from hanging if a file 
> operation
> like write, mkdir etc. does not return, all such operations are done in a 
> separate
> "file thread" (FT).
> Functions running in the "Main Thread" (MT) that needs file system operations
> handle over the execution to the FT when file handling has to be done. 
> Execution
> is then given back to the MT again. If a file operation does not return FT 
> will hang but
> MT will time out the FT and resume. A timeout will be handled as a file 
> operation fail.
> The MT can detect if the FT is hanging and new requests for file operations 
> will be "failed".
>
> Note:
> This review request contains all patches. Some of the "old" patches are 
> concatenated (qfold)
> Old patches       New patch
> 1 - 7             1
> 8                 2
> 9 - 11            3
> 12                4
> 13                5
> 14                6
> 15                7
>
> Patch 5 - 7 are new for this review.
>
>
> changeset b32ee924b9716330c8d7b54f5556fc235c31fbab
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:12:30 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 1
>
>       Generic thread handling:
>       - Generic thread handling
>       - Convert functions to use threaded file handling
>       - Handling of object implementer rejects
>       - Invalidate stream fd if errno EBADF when writing log record
>       - Fix Error handling for too long path (> PATH_MAX)
>       - Functions that uses a handler in file thread has got extension _h
>
> changeset ed70f6043029ad9c7ea5f55439130023acea13bc
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:13:22 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] part 2
>
>       - Fix review remarks and some findings from test
>       - Fix some findings found when using code analyze tool
>       - Cleanup of TRACE and LOG
>       - Add information for contributors/maintainers about file system 
> handling in
>       the Log-service README file
>
> changeset 1ab74048f572d3ecb843651dea627399e77a0afc
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:13:56 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 3
>
>       - Remove unnecessary data copying in log_file_api() and 
> file_hndl_thread()
>       - Return SA_AIS_ERR_TIMEOUT if the write operation time out when a log
>       record shall be written. If the file thread is already "hanging" when a
>       write is requested no attempt to write is made and SA_AIS_ERR_TRY_AGAIN 
> is
>       returned as before.
>       - Try to recover file thread by recreating it if it hangs for a long 
> time.
>       - Recover if bad file descriptor or stale NFS handle.
>
>       - Always reinitialize/reopen log files if a write operation fails, 
> timeout
>       of file thread (hanging file system) included.
>       - Handle synchronization between nodes when log files cannot be created 
> before
>       a switch over without using any new flag that has to be checkpointed
>       (remove "files_initialized" flag)
>       - Incorrect handling of "partial write" is fixed. See #536
>
>       - Open log files with O_NONBLOCK. Answer client with AIS_ERR_TIMEOUT if
>       EWOULDBLOCK/EAGAIN (record may be parially written)
>
> changeset 9891f1e38d7c32cb5be4f929101548dc466c79b9
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:18:18 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 4
>
>       - Make timeouts for file hdl configurable in Log service configuration
>       object
>
> changeset dd80bd737715084537c0affe959ba619d0752fba
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:19:03 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 5
>
>       - Fix error in lgs_make_reldir_h(). Root directory can be corrupt if 
> file
>       thread is hanging.
>
> changeset 15498547b6da54ecdecf2f6e2806d60027d291dc
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:20:52 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 6
>
>       - Remove thread recovery handling (kill and restart thread)
>
> changeset 08a97594d0471b611baff692bd8704392a6d6f50
> Author:       Lennart Lund <lennart.l...@ericsson.com>
> Date: Mon, 19 Aug 2013 09:22:09 +0200
>
>       logsv: Fix hanging main thread when file i/o don't return. [#9] Part 7
>
>       - Update saflogger to handle SA_AIS_ERR_TIMEOUT
>
>
> Added Files:
> ------------
>   README_LOGENH
>   osaf/services/saf/logsv/lgs/lgs_file.c
>   osaf/services/saf/logsv/lgs/lgs_file.h
>   osaf/services/saf/logsv/lgs/lgs_filehdl.c
>   osaf/services/saf/logsv/lgs/lgs_filehdl.h
>
>
> Removed Files:
> --------------
>   README_LOGENH
>
>
> Complete diffstat:
> ------------------
>   osaf/services/saf/logsv/README            |   23 +++
>   osaf/services/saf/logsv/lgs/Makefile.am   |    8 +-
>   osaf/services/saf/logsv/lgs/lgs.h         |    1 +
>   osaf/services/saf/logsv/lgs/lgs_cb.h      |    2 +
>   osaf/services/saf/logsv/lgs/lgs_evt.c     |    5 +-
>   osaf/services/saf/logsv/lgs/lgs_evt.h     |    4 +
>   osaf/services/saf/logsv/lgs/lgs_file.c    |  416 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   osaf/services/saf/logsv/lgs/lgs_file.h    |   71 ++++++++++
>   osaf/services/saf/logsv/lgs/lgs_filehdl.c |  612 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   osaf/services/saf/logsv/lgs/lgs_filehdl.h |  162 +++++++++++++++++++++++
>   osaf/services/saf/logsv/lgs/lgs_imm.c     |  227 
> ++++++++++++++++++++++++--------
>   osaf/services/saf/logsv/lgs/lgs_main.c    |   12 +-
>   osaf/services/saf/logsv/lgs/lgs_mbcsv.c   |    7 +
>   osaf/services/saf/logsv/lgs/lgs_mbcsv.h   |    3 +
>   osaf/services/saf/logsv/lgs/lgs_stream.c  |  594 
> +++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------
>   osaf/services/saf/logsv/lgs/lgs_stream.h  |    4 +-
>   osaf/services/saf/logsv/lgs/lgs_util.c    |  446 
> +++++++++++++++++++++++++++++++++++++++++-----------------------
>   osaf/services/saf/logsv/lgs/lgs_util.h    |   21 ++-
>   18 files changed, 2147 insertions(+), 471 deletions(-)
>
>
> Testing Commands:
> -----------------
> 1. Regession test
>> logtest
> 2. Switch over test (using alarm stream)
>> saflogger -l -s crit "alarm message 1"
>> cat repl_opensaf/saflog/saLogAlarm_SOME_DATE.log
>   Printout containing "alarm message 1"
>> immadm -o 7 safSi=SC-2N,safApp=OpenSAF
>> saflogger -l -s crit "alarm message 2"
>> cat repl_opensaf/saflog/saLogAlarm_SOME_DATE.log
>   Printout contaning "alarm message 1" and "alarm message 2"
> 3. Redo tests after node start with simulated
>     unavailable filesystem for the log service
>   - Activate simulated unavailable file system by uncommenting
>     the LLD_DELAY_TST define in file lgs_file.c in the log server.
>     This means to "hang" the "file thread" for some tme during system
>     start.
>   - Rebuild the log server.
>   - Remove old log files in repl-opensaf/saflog/
>   - Start the cluster with the rebuilt log server.
>     Note: The repl_opensaf/saflog directory is empty after system
>           start. The .cfg and .log files for alarm, notoify and system
>           that normally can be found is missing since they could not be
>           created during system start. However files for respective log
>           stream will be created when writing log records.
>   - Re-run test 1 and 2
>
> Note:
> The current logtest is using a hard-coded log root path that maybe
> will not work on some systems and that is not the default path that
> can be found in the log service configuration object. A ticket
> for the logtest is written [#541]. A fix exists and patches can
> be found in a review request for ticket #541
>
>
> Testing, Expected Results:
> --------------------------
> 1.   Regression test with no fail.
> 2.   "alarm message 1" and "alarm message 1" found in the same file.
> 3.1. Regression test with no fail.
> 3.2. "alarm message 1" and "alarm message 1" found in the same file.
>
>
> Conditions of Submission:
> -------------------------
>   <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>>
>
>
> Arch      Built     Started    Linux distro
> -------------------------------------------
> mips        n          n
> mips64      n          n
> x86         n          n
> x86_64      n          n
> powerpc     n          n
> powerpc64   n          n
>
>
> Reviewer Checklist:
> -------------------
> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>
>
> Your checkin has not passed review because (see checked entries):
>
> ___ Your RR template is generally incomplete; it has too many blank entries
>      that need proper data filled in.
>
> ___ You have failed to nominate the proper persons for review and push.
>
> ___ Your patches do not have proper short+long header
>
> ___ You have grammar/spelling in your header that is unacceptable.
>
> ___ You have exceeded a sensible line length in your headers/comments/text.
>
> ___ You have failed to put in a proper Trac Ticket # into your commits.
>
> ___ You have incorrectly put/left internal data in your comments/files
>      (i.e. internal bug tracking tool IDs, product names etc)
>
> ___ You have not given any evidence of testing beyond basic build tests.
>      Demonstrate some level of runtime or other sanity testing.
>
> ___ You have ^M present in some of your files. These have to be removed.
>
> ___ You have needlessly changed whitespace or added whitespace crimes
>      like trailing spaces, or spaces before tabs.
>
> ___ You have mixed real technical changes with whitespace and other
>      cosmetic code cleanup changes. These have to be separate commits.
>
> ___ You need to refactor your submission into logical chunks; there is
>      too much content into a single commit.
>
> ___ You have extraneous garbage in your review (merge commits etc)
>
> ___ You have giant attachments which should never have been sent;
>      Instead you should place your content in a public tree to be pulled.
>
> ___ You have too many commits attached to an e-mail; resend as threaded
>      commits, or place in a public tree for a pull.
>
> ___ You have resent this content multiple times without a clear indication
>      of what has changed between each re-send.
>
> ___ You have failed to adequately and individually address all of the
>      comments and change requests that were proposed in the initial review.
>
> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>
> ___ Your computer have a badly configured date and time; confusing the
>      the threaded patch review.
>
> ___ Your changes affect IPC mechanism, and you don't present any results
>      for in-service upgradability test.
>
> ___ Your changes affect user manual and documentation, your patch series
>      do not contain the patch that updates the Doxygen manual.
>


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to