Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS 32bit issue) signal is 11

Axton Fri, 23 Sep 2011 13:06:11 -0700

Run SQL, Filter, and API logging with the logs set to create a backup.  When
the server crashes, it will copy the logs to a .bak file.  You can use these
logs to see what was happening up to the point of the crash.


To Mark's point, check that the system is set up to create a core file using
coreadm.  You can tell the system to create a core, not a create a core,
where to create it, etc. using this command.  A signal 11 (segmentation
fault) it something that the kernel can trap for and generate a core if the
system is configured to do so.

Analyze the core file and you can extrapolate some information, though it's
not a task for the faint of heart.

If you have multiple crashes, save the backup log files from each and see if
the same workflow is firing each time. If so, try to reproduce the issue
using the information available in the logs.

Axton

On Fri, Sep 23, 2011 at 1:23 PM, patrick zandi <remedy...@gmail.com> wrote:

> ** yes: itsm 7.1 P6 is running,
>
> I am looking to upgrade Patch levels without any promises..  but I gotta
> test on a dev system..
> .. Which is another issue: of issues...
>
> Thanks guys.. I am glad you all caught one thing I did not check.. and I am
> not out of my mind.. yet!
>
>
>
> On Fri, Sep 23, 2011 at 2:07 PM, Ben Chernys <
> ben.cher...@softwaretoolhouse.com> wrote:
>
>> **
>>
>> OK.  So, it ain’t a denied malloc.  You got a bug that somehow you are
>> exposing and seemingly no one else is.  Tough one.  BMC could give you a
>> debug build, you’d generate a core, and give it to some BMC folks.  That
>> would require some very high level of support I would assume (though I have
>> seen that ages ago).****
>>
>> ** **
>>
>> I still think the best bet is to trace and hunt around filters to see if
>> anything unusual is the culprit.  As Mark said, make sure your system is
>> taking cores.  You have a process ending with SIGSEGV; that always ends up
>> with a core.  I’ve never seen a system where it didn’t but then I am no Unix
>> admin.  ****
>>
>> ** **
>>
>> There are other options such as plugging in stuff to log system status at
>> intervals etc.  But your best bet is definitely to turn on logging and
>> review logs.  Try to reproduce this in a safe (non-production) environment.
>> ****
>>
>> ** **
>>
>> BTW, the OS or armonitor doesn’t terminate the process.  This is a
>> hardware interrupt that the OS catches.  ARS will have a signal handler and
>> that gets control.  That is what prints the Signal 11 trace.****
>>
>> ** **
>>
>> Are you running ITSM?  You didn’t say.  But code paths within the arserver
>> will vary even with the same ARS applications for so many reasons as to make
>> it astronomical.  ****
>>
>> ** **
>>
>> Were there any filter / db / environment changes just before the symptoms
>> started showing up?  These are good areas to investigate as well.****
>>
>> ** **
>>
>> These types of problems can be solved (circumvented) quickly or can take
>> forever (ie not get solved).  But they are quite involved and interesting.
>> ****
>>
>> ** **
>>
>> Good luck!****
>>
>> Ben****
>>
>> ** **
>>
>> *From:* Action Request System discussion list(ARSList) [mailto:
>> arslist@ARSLIST.ORG] *On Behalf Of *patrick zandi
>> *Sent:* September-23-11 15:38
>>
>> *To:* arslist@ARSLIST.ORG
>> *Subject:* Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
>> 32bit issue) signal is 11****
>>
>> ** **
>>
>> ** Perplexing is the (no core bombs)
>> We are running fine and them Boom, we get that signal: and armonitor shuts
>> the arserverd down and restarts a new one..
>> What I was hoping was someone would Say I remember that.. and Patch X,Y,Z
>> fixed that..
>>
>> I did conplete logging and the log stops and the log starts (nothing after
>> the signal and startup).. Clear.. Clean..
>>
>> I looked at all the patches and fine the following:: but cannot put a
>> finger on any one of them (Specifically)  (NL--not likely)  (ML- most
>> likely)****
>>
>> SW00351599       The AR System server crashed when a filter used converted
>> values in a Set Field action.****
>>
>> SW00351922       The AR System server program terminated when building a
>> userList to send notifications.****
>>
>> ML-SW00356647       Too many filters executed recursively causing a stack
>> overflow, which resulted in failure of the AR System server.****
>>
>> SW00328336       The AR System server crashed while saving text to a Diary
>> field on which audit was enabled.****
>>
>> SW00328337       arserverd crashed while attempting to communicate with
>> the plug-in server through PluginServerCallWithRetry.****
>>
>> SW00337127       A memory leak issue occurred in the AR System server.***
>> *
>>
>> NL-SW00338411       The AR System server crashed during an archiving
>> process if the archive source form contained non-data fields such as Text,
>> Trim, Button, and so on.****
>>
>> NL-SW00346370       The AR System server crashed while processing an error
>> handling filter.****
>>
>> SW00314816       When a user performed a search on a Join form, but did
>> not have permissions to view all records returned in the result, the AR
>> System server crashed. ****
>>
>> NL--SW00322802       Creating an entry with a user name larger than 180
>> bytes caused the AR System server to crash when the status history was being
>> recorded and the initial status was not New.****
>>
>>
>>
>> ****
>>
>> On Fri, Sep 23, 2011 at 5:09 AM, Ben Chernys <
>> ben.cher...@softwaretoolhouse.com> wrote:****
>>
>> ** ****
>>
>> Hi Mark, Patrick,****
>>
>>  ****
>>
>> Signal 11 is SIGSEGV which is not necessarily a malloc failure though
>> indeed a malloc failure may lead to it.  It is not always possible to log
>> malloc failures – after all it takes some memory to cut a log record.  **
>> **
>>
>>  ****
>>
>> A segmentation violation is always the result of bad code (accessing
>> memory not allocated to the process or not in the processes address space –
>> which 0 is a candidate (malloc’s return value on failure)).  ****
>>
>>  ****
>>
>> That being said, it is possible to not trigger the execution path with
>> that bad code by altering filters etc, so definitely the route to go on is
>> along the lines that Mark talked:   the core is always a wealth of info –
>> even though ARS will not have debugging compiled in ;-)  I would also turn
>> on all logging.  SQL, API, Filter on the server, and unlimited, and pointing
>> to the same file until the next occurrence.  Then you will have a wealth of
>> ARS information to go through.  Generally something will stand out.****
>>
>>  ****
>>
>> Recursive filter loops are usually trapped by the maximum filter limit –
>> though if that is set high enough the process will run out of memory before
>> hitting up against that.  If yours is high, you could try setting it lower.
>> ****
>>
>>  ****
>>
>> You may also want to go to a higher patch level if one is available.  I am
>> no longer that familiar with the patches available on 7.1.****
>>
>>  ****
>>
>> Also, I know that memory on  Solaris may be restricted by the admin.  (I
>> forget the commands to determine this – but they will be easily found on the
>> web).  ulimits Perhaps?****
>>
>>  ****
>>
>> Cheers****
>>
>>  ****
>>
>> Ben Chernys
>>
>> Senior Software Architect
>> Software Tool House Inc.
>>
>> Canada / Deutschland / Germany
>> Mobile:      +49 171 380 2329    GMT + 1 + [ DST ]
>> Email:       Ben.Chernys _AT_ 
>> softwaretoolhouse.com<ben.cher...@softwaretoolhouse.com>
>> Web:         www.softwaretoolhouse.com
>>
>> Check out Software Tool House's free Diary Editor.
>>
>> *Meta-Update,* our premium ARS Data tool, lets you automate
>> your imports, migrations, *in no time at all*, without programming,
>> without staging forms, without merge workflow.
>> http://www.softwaretoolhouse.com/  ****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Action Request System discussion list(ARSList) [mailto:
>> arslist@ARSLIST.ORG] *On Behalf Of *Walters, Mark
>> *Sent:* September-23-11 09:08
>> *To:* arslist@ARSLIST.ORG
>> *Subject:* Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
>> 32bit issue) signal is 11****
>>
>>  ****
>>
>> ** ****
>>
>> It may be memory but I would expect to see malloc errors (ARERR 300) in
>> the arerror.log if this was the case.  The fact  you’re not seeing a stack
>> trace like this;****
>>
>>  ****
>>
>> Mon Sep 20 08:33:52 2010     6****
>>
>>   Timestamp: Mon Sep 20 2010 08:33:52.1865****
>>
>>   Thread Id: 4****
>>
>>   Version: 7.1.00 Patch 009 201009200800 ****
>>
>>   ServerName: test71****
>>
>>   Database: SQL -- Oracle****
>>
>>   Hardware: sun4u****
>>
>>   OS: SunOS 5.10****
>>
>>   RPC Id: 337****
>>
>>   RPC Call: 106 (GLXS)****
>>
>>   RPC Queue: 390600****
>>
>>   Client: User Demo from Remedy Administrator (protocol 13) at IP address
>> 192.168.1.54****
>>
>>   Form:****
>>
>>   Logging On:****
>>
>>  ****
>>
>> suggests it may be a recursive filter – on Solaris this often causes a
>> crash without logging anything useful.  Check to see whether there are any
>> core files in the server/bin directory as this is another symptom of this
>> type of crash on Solaris.  If cores are enabled (check with the OS coreadm
>> command) then the server may create them even though you’re not running a
>> debug build.****
>>
>>  ****
>>
>> If you do have some core files then run the pstack command against them
>> (pstack core) and you will be able to see the stack of each thread within
>> the server – if it is a recursive filter causing a stack overflow then one
>> of the threads should stand out as being much bigger than the others.
>> Depending on what you see you may then need to enable FILTER/SQL logging to
>> try and capture the workflow that is causing the crash.  It’s also worth
>> checking the Filter-Max-Stack value in ar.conf – various installers set this
>> to a very high value – try reducing it back down to 50 or so and this should
>> stop most filter recursion crashes and log an error instead.****
>>
>>  ****
>>
>> Mark****
>>
>>  ****
>>
>> I work for BMC, I don’t speak for them.****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Action Request System discussion list(ARSList) [mailto:
>> arslist@ARSLIST.ORG] *On Behalf Of *patrick zandi
>> *Sent:* 22 September 2011 21:07
>> *To:* arslist@ARSLIST.ORG
>> *Subject:* ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
>> 32bit issue) signal is 11****
>>
>>  ****
>>
>> ** Just a Quick Question:: ARS 7.1 P6 :: on solaris 10, I am seeing a
>> Operating system telling the ars to shutdown about every 4 -6 days..
>> not positive, nothing in debugging of logs at all, only in the
>> ARMONITOR.log  where it says.. ****
>>
>> 2011     ARMonitor child process (pid:15277) died with 11. And the signal
>> is 11.****
>>
>> ./arserverd****
>>
>>
>> Can I assume Signal 11 is Memory?  --- I have seen alot of memory issues
>> with a 11 signal in the arslist...
>>
>>
>> --
>> Patrick Zandi
>> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
>>
>> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
>>
>> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_****
>>
>>
>>
>>
>> --
>> Patrick Zandi
>> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
>>  _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_
>>
>
>
>
> --
> Patrick Zandi
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_
>

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"

Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS 32bit issue) signal is 11

Reply via email to