Pawel (privately) wrote:
> Hi guys,
>
> We are trying to find out details (locate processes=jobs which utilise 
> so much memory).
>
> We did not apply "export MALLOCTYPE=buckets" to .profiles as HelpDesk 
> suggested. This option is exported only on our one test area, which is 
> rarely processed.
>
> I read http://www.redbooks.ibm.com/redbooks/pdfs/sg247463.pdf and:
...
> It seems to me that we should try what Jim suggested: "export 
> MALLOCTYPE=watson" and perhaps some MALLOCOPTIONS settings.
>   
Yep, start without options though. I see your next post, but I will 
address that in a reply to that post.
> Let's leave it for the second, because something interesting shown on 
> our LIVE area today in the very morning. Session that runs TSM produced 
> following output:
> jsh techuser ~ -->START.TSM
> START.TSM
> Phantom process started on process id 2449420
>  [2449420] Done : tSA 1
> jsh techuser ~ -->Process ID 1232928 , port 794 , hangup
>   Program source name CLEAR.TOKENS , line 26
> Recursive debugger calls - program aborting
>   
Yeah - there is some serious problem going on, but to be honest this is 
probably just out of memory again. There is very little you can do as a 
programmer once you are out of memory. Even trying to print a message 
will run out of memory unless you know you can free some. Some people 
guard against this by keeping a small allocation of memory to be freed 
in case of an abort, in the hope they can use it to get enough memory to 
recover or show a message.

The key to this is the recursive debugger calls. This means that being 
out of memory the program tried to access a memory pointer that wasn't 
valid (it should not really do this, but as you can't really do anything 
at this point, it is a moot point) - there is a trap routine that sees 
this invalid access and it aborts the program and enters the debugger. 
However, being out of memory, the debugger tries to use some memory and 
accesses an invalid pointer, which means the trap triggers and it tries 
to enter the debugger, where it detects the recursion and just aborts as 
there is no other option.
> Process ID 5054700 , port 794 , hangup
>   Program source name CLEAR.TOKENS , line 27
> Recursive debugger calls - program aborting
> jBASE: Segmentation violation. Aborting
>   
This is the out of memory stuff.
> cp: ../bnk.data/int.data/DM.TEMP/DC.CARD.ISSUE.HIS.DM: No such file or 
> directory
>   
This is a bug in your application, which is not detecting that the 
program did not finish correctly and is trying to copy the results of it 
anyway.
> jBASE: Segmentation violation. Aborting
> jBASE: Attempting to free NULL pointer at 
> jediTransaction.c,1636(EB.TRANS.JBASE,
> 26)
> jsh techuser ~ -->Process ID 7737374 , port 805 , hangup
>   Program source name F.READ , line 7
>   
This is basically all the same thing. I will defer until answering your 
next message, which is where the problem is I think.
> You can ignore hangups, but I am worried about these jBASE errors 
> (Segmentation violation / Attempting to free NULL pointer at 
> jediTransaction.c,1636). This does not sound good to me. We do not know 
> which processes thrown these messages, but likely they were COB agents.
>   
IT is just taht you were out of memory and things are trying to clean 
up. These are the symptoms not the cause.
> I do not know yet wheter physical / swap memory run out yesterday on 
> PROD, but it quite unlikely (total memory of LIVE system is 2-3 times 
> bigger than on test machines).
>
> I would like to mention one fact from the past.
>
> During "start of year" (2nd January) processing we faced 1 "little" 
> problem:
> a) one of the single threaded jobs did a large transaction (over 900k of 
> changes) - we have already requested to improve this core EOY job
>   
Did that happen?
> b) then later one of the batch sessions (agent) failed with 
> SUBROUTINE_CALL_FAIL error. There was nothing wrong with our libraries - 
> called object was there and routine that failed was successfully called 
> by other COB agents. Only one COB agent noted SUBROUTINE_CALL_FAIL 
> error, which seemed to be very strange. We have raised that and CSHD 
> conclusion was: "agent run out of shared memory" 
Shared memory is not used for this. That was either their 
mis-description of the problem or you are recalling what happens on 
UniVerse ;-). Basically this can only really happen when there is no 
real memory, whic to confuse things, is called virtual :-), for the 
process to map in the subroutine, or allocate the descriptor for it and 
so on.

All your problems are likely caused by the answer to the next post you made.

> (ulimit is unlimited on 
> LIVE) so use "slibclean" periodically to reclaim memory.
>   
This is probably what they mean. There is a fundamental design issue 
with AIX and when it feels it can get rid of shared objects. Whatever 
IBM try to claim about this, they avoid the question "Hmm, then why does 
no other UNIX suffer from this issue?"

> I think now that agent which failed on 2nd January performed in previous 
> steps large transaction, means allocated large "transaction buffer" and 
> finally got SUBROUTINE_CALL_FAIL on one of the following jobs (not 
> immediately).
> That is why I suggested that "transaction buffer" may not get downsized 
> or leaks memory. I also guess that it may not be a leak, but default 
> MALLOC allocator fault. 
Yep. But of course something is causing you to need huge amounts of 
memory. I suspect that it is a bug in JQL, which we can find a work 
around for (am I on the clock yet?) in the next email.

> I am not sure if Watson will help, but reading 
> Jim's emails we will give it try.
>   
It will definitely help generally, but it is more likely to expose the 
real problem, which reading your next email, it seems it has :-)

Jim

--~--~---------~--~----~------------~-------~--~----~
Please read the posting guidelines at: 
http://groups.google.com/group/jBASE/web/Posting%20Guidelines

IMPORTANT: Type T24: at the start of the subject line for questions specific to 
Globus/T24

To post, send email to [email protected]
To unsubscribe, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/jBASE?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to