Re: Prod server down - services will not stay up - RESOLVED

Susan Palmer Fri, 13 Nov 2009 15:23:59 -0800

Thank you all for your suggestions.

It was actually an existing problem that manifested itself in a way that it
hasn't in 7 years.  It was fairly unrecognizable.  What was behind that is
unknown.


The short of it is, we have one form that due to 'bad' business reasons has
two ranges of field id 1.  On rare occasion over the last 7 years the next
ID gets incorrectly set by ar/database.  That sorry is too long to go into,
in fact there are problem some archive emails here that I've gone into it in
depth.  Normally it manifests itself by giving a unique key error.  We
correct the problem (less than 1 min) and live goes on.  No impact to anyone
except the one user.

Today we had the double unique situation of the ID reset and it took the
full system down.  When the system came back up and the person entering the
offending record hit save again, the system went down again.  She was very
persistent and tried about 20 times.  We never did see a unique key error so
we didn't look in that direction.

Last June we built all new servers, all new databases ... everything new.
And we still have the problem.  To add a little twist to it, today we did an
experiment on the dev server.  We've never had the issue there because the
thing we knows triggers the ID issue doesn't happen in dev.  When we
simulated the problem there, guess what ... it didn't happen on dev!

Same workflow, same forms, supposedly same server builds, supposedly same
database  builds.

Well we all have our little fluke things to live with.  Hopefully this fluke
will be going away next week because we've finally convinced the 'business'
to stop a bad practice and we're just waiting on a final test.  I hope
they're not just holding the proverbial 'tease' out there and then going to
yank it back.  That sounds a little harsh, but 4 hours troubleshooting today
can be harsh too!

Everyone have a great weekend!  And if you were at WWRUG with me .... it was
a great RUG and I'm already looking forward to next year!!  Great job by
Dan, Lenny, Joel and Phil ... we do appreciate the effort!

Thanks,
Susan

On Fri, Nov 13, 2009 at 2:33 PM, Darrell Reading <
darrell.reading...@wal-mart.com> wrote:

> **
> *More specifically, check maxsiz.  Is the arserverd core dumping?  What
> size is the core?*
>
>
> *Darrell Reading Systems Engineer*
> Phone 479.204.5739
> dere...@wal-mart.com
>
> Wal-Mart Stores, Inc.
> 805 Moberly Lane, MS-0560-68
> Bentonville, AR 72716
> *Save Money. Live Better*
>
>
>  ------------------------------
>  *From:* Action Request System discussion list(ARSList) [mailto:
> arsl...@arslist.org] *On Behalf Of *Susan Palmer
> *Sent:* Friday, November 13, 2009 14:13
>
> *To:* arslist@ARSLIST.ORG
> *Subject:* Re: Prod server down - services will not stay up
>
>   **
> We've turned on both sql and api logging now to capture the next event.
>
> How would the db space affect this?  They actually just expanded it last
> night.
>
> Working with the unix guys in the office and support, just not the same
> when you're not there.
>
>
>
>
> On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys <
> ben.cher...@softwaretoolhouse.com> wrote:
>
>> **
>> These are a bit of a pain to solve.  SQL logging on startup is the key.
>> The logs are quite big but usually the last lines will be pertinent.  You
>> also need to know the database structure of the meta-data - given by the
>> database reference guide.
>>
>> I'm afraid that these types of problems are not likely to get solved
>> whilst in a hotel room as once you have the idea of where the problem lies
>> (through the log) you then need to research the meta-data itself.  The sql
>> log will simply let you know the meta-data table that was last read and not
>> which record of that table caused the server to crash.
>>
>> 7.0.1 p2 seems a little low.  It is possible to patch the binaries?
>>
>> It is unlikely that simply allocating more database space is the problem.
>> You could also look at the temporary space and see that it was increased but
>> I would go with the logs first.
>>
>> 2 - 10 minutes will most surely be in the servers initial processing of
>> the meta-data.
>> Cheers
>> Ben
>>  ------------------------------
>>  *From:* Action Request System discussion list(ARSList) [mailto:
>> arsl...@arslist.org] *On Behalf Of *Susan Palmer
>> *Sent:* November 13, 2009 8:57 PM
>> *To:* arslist@ARSLIST.ORG
>> *Subject:* Re: Prod server down - services will not stay up
>>
>> **
>>  Thanks Ben
>>
>> We're having problems determining where the 11 is coming from.
>>
>> On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys <
>> ben.cher...@softwaretoolhouse.com> wrote:
>>
>>> **
>>>
>>>  PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
>>> about.   The 91 is another process not being able to communicate with the
>>> arserverd process.
>>>
>>> Cheers
>>> Ben
>>>
>>>  ------------------------------
>>> *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com]
>>> *Sent:* November 13, 2009 8:42 PM
>>> *To:* 'arslist@ARSLIST.ORG'
>>> *Subject:* RE: Prod server down - services will not stay up
>>>
>>>   The signal 11 is bad code - simple as that.  It's a "segmentation
>>> violation" which means that the server (arserverd) attempted to read or
>>> write to an address not allocated to its virtual space.  It can also be
>>> caused by a double free or two pointers to one block which has been freed.
>>> In any event, you cannot fix this without the ARS source code which I expect
>>> you would find hard to get.
>>>
>>> That being said, the easiest way to determine (and then circumvent) these
>>> types of things is to turn on SQL logging on the server before the system
>>> starts (through the ar.conf file).  The exact settings are in the
>>> configuring ARS guide.
>>>
>>> Then, when the blow up happens, see what the server was attempting to
>>> do.  You can usually spot some possible internal database inconsistencies
>>> (in ARS meta-data) in this way and then repair them manually through SQL
>>> before the ARS start-up.
>>>
>>> Additionally, there may be patches available that address the problem.
>>>
>>> Cheers
>>> Ben Chernys
>>>
>>>
>>>
>>>  ------------------------------
>>> *From:* Action Request System discussion list(ARSList) [mailto:
>>> arsl...@arslist.org] *On Behalf Of *Susan Palmer
>>> *Sent:* November 13, 2009 8:30 PM
>>> *To:* arslist@ARSLIST.ORG
>>> *Subject:* Prod server down - services will not stay up
>>>
>>> **
>>> Help !!
>>>
>>> Working with support but could use anyone else's input.  I'm at WWRUG so
>>> it's somewhat limiting.
>>>
>>> We did a truss log and and when the services drop (arerror 91) we see the
>>> following:
>>> 167
>>> /11:    read(54, "\0FE\0\006\0\0\0\0\01017".., 2064)    = 254
>>> /11:    write(54, "\0A1\0\006\0\0\0\0\003 ^".., 161)    = 161
>>> /11:    read(54, "\0F7\0\006\0\0\0\0\01017".., 2064)    = 247
>>> /11:        Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
>>> /11:          siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
>>> /11:        Received signal #11, SIGSEGV [caught]
>>> /11:          siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
>>>
>>> The services do restart automatically so armonitor is doing it's job.
>>> We've commented out everything from armonitor but the arserverd command.
>>>
>>> We stay up for between 2-10 minutes and then wham, we're down again.
>>> Obviously this just started this morning.
>>>
>>> unix sun solaris 10
>>> oracle 10g
>>> ars 7.0.1P2
>>>
>>> They did expand the database size last night if that has any bearing.
>>> But we can connect to the database successfully when ar is down.
>>>
>>> Nothing else helpful in arerror.log, only 91 error.
>>>
>>> I'm at the Hardrock hotel, call room 30601 if you have questions or can
>>> help!
>>>
>>> Thanks,
>>> Susan
>>>
>>>
>>>
>>>
>>>
>>> _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
>>> Are"_
>>>  _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
>>> Are"_
>>>
>>
>> _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
>> Are"_
>>  _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
>> Are"_
>>
>
> _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
> Are"_
>
> ------------------------------
>
>  ***********************************************************************
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately.
> **********************************************************************
> Wal-Mart Confidential
> ********************************************************************** *
>
> _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: "Where the Answers
> Are"_

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: "Where the Answers Are"

Re: Prod server down - services will not stay up - RESOLVED

Reply via email to