So I've been helping Fritz look into his -11/45 problem, and things have gotten to a point where I'd like to reach out for help, more eyes, etc.
I have to say, I spent almost a decade at the start of my career working on PDP-11 hardware ('new build' DMA devices, as well as fixing broken stuff), and software, and this is, I think, the most confusing and difficult problem I have _ever_ seen on one. Hence the above... What's _particularly_ confusing and difficult is that it seems like _three_ separate, un-related things all go wrong at exactly (2 of 3) or close to (the other) the same time. And the machine now passes all the diagnostics that have been thrown at it, particularly the KT11 and RK11 diagnostics (why this is important will become clear). So here's what we've found to date. The failure we're looking at is that an attempt to execute the 'ls' command under Unix V6 fails; it gets a memory mangement fault, and dumps core. AFAICT, the shell successfully forks, and its attempt to do an exec() of 'ls' sort of works (more below), but a few instructions in, we get the MM fault - but there's even more wrong when that happens (details toward the end below). I've been looking at the core dump produced by the process, which gives me the registers at the time of the trap, the user's stack, etc - but not a copy of the binary code - the 'ls' command is a so-called 'pure text', i.e. the binary is segregated into separate, potentially shared, read-only 'segment(s)' (only 1 in this case) of the PDP-11's User mode address space, and is not included in the process dump. (I use the term 'segment', which is actually what DEC called them in the first version of the PDP-11/45 processor handbook, because that's what they are, not pages, as pages are on most systems. I assume they changed to 'page' for marketing reasons. And please, can we hold debate about this and focus on the problem? Thanks! :-) I do have the ability to look at the binary that it _should_ be executing, by examining the command in its file. Also, Fritz has worked out that he can patch the MM trap vector (before trying to do the 'ls') to halt the machine when it happens, so he can read out all the KT11 registers, look at the actual program in main memory, etc. First oddity - the problem is dependent on the location of the command in main memory! If Fritz says "sleep 360 &", to run a trivial command in the background, and _then_ says 'ls' - it works (so we know the binary of 'ls' on disk is OK)! We _think_ this is because the process executing the 'sleep' takes up a chunk of main memory, and thus changes the location of the process executing the 'ls'. The problem is that I'm reluctant to try and change anything (e.g. to have the OS print out anything) because that will change the location of things, and we may (likely?) will not get the problem. With nothing changed, it _reliably_ fails - I've looked at two different core dumps, and all the essential data (registers, user stack etc) are identical. The KT11 registers all seems to be the same, too. So, on to details. I'm pretty sure the command only gets a few instructions in before it blows up. Here are the process' registers, and the _entire_ contents of the user mode stack: R0 177770 R1 0 R2 0 R3 0 R4 34 R5 444 SP 177760 PC 010210 060: 000000 000020 000001 177770 177774 177777 071554 000000 010210 turns out to be the first word in 'csv', which is an internal routine which PDP-11 C uses to build a stack frame - _every_ C routine starts with a "JSR R5, CSV" instruction as the first thing it does. So looking at the stack (which looks good; it contains a valid 'argc' and 'argv' that the process would be started with), and the registers, I'm pretty sure it does these starting instuctions OK: start: setd mov sp,r0 mov (r0),-(sp) tst (r0)+ mov r0,2(sp) jsr pc,_main _main: jsr r5,csv and then blows up on: csv: mov r5,r0 So it's the 8th instruction in that blows up (*): but not only is what's in memory at that location _not_ 'mov r5,r0', it also gets an MM trap that makes no sense. (*: In user mode: if you don't have an FPP, the first one will trap, which UNIX ignores.) Fritz has looked at the KT11 register when the trap happens, and the PARs and PDRs all look good. The SSRs contain: > SSR's: 040143 000000 010210 000000 SSR2 gives the PC at the time of the fault (again 010210); SSR0 shows: Abort - segment (page) length error User mode Segment (Page) 1 which is the first thing that's wrong - neither the instruction that's _supposed_ to be there (next), nor the one that's _actually_ there, contains any reference to segment 1! The _actual_ code it's trying to execute is: > 171600: 016162 004767 000224 000414 006700 006152 006702 006144 (Per UISA0, text base is 0161400, plus a PC of 010210, gives us 0171610, which is right in the middle there.) That does not, alas, look anything _at all_ like what's _supposed_ to be there, which is: 010200: 110024 10400 mov r4,r0 167 jmp 10226 (cret) 16 PC-> 10500 mov r5,r0 (start of CSV) 10605 mov sp,r5 10446 mov r4,-(sp) 10346 mov r3,-(sp) So somehow the command (at least, this part of it - Fritz is going to check on the first few instructions, but I'm pretty sure they will be OK) has gotten read in wrong - but that's the least of our problems! 06700 is 'SXT R0', and neither that nor 'MOV R5, R0' can _possibly_ cause an MM violation - least of all one on segment 1 (this code is in segment 0)! I could see there having been an error reading in the command binary (e.g. maybe the RK11 has an issue), but WTF is happening here? Just to make things triply confusing, R5 contains trash! The 'JSR R5, CSV' _should have put the old PC in R5; but that call to CSV is at 030, so R5 _should_ contain 034, not 0444. Needless to say, this is a real head-scratcher. What's confusing the heck out of me are the three separate issues, all happening together - R5 contains junk, the spurious (?) MM trap, etc. The bad command binary in main memory could be caused by any number of things: to get it, Unix reads file system blocks off the disk into buffers in low memory, and then writes them out to the user's memory with MTPI. So an RK11 glitch could be doing it, but also a KT11 problem, etc. I'm having a hard time seeing a common thread here - maybe a KT11 issue? But how would that cause R5 to contain trash? That should only involve the KB11. And the JSR R5, CSV must have been executed more-or-less OK, otherwise how did it wind up at CSV? I was wondering if some noise could be causing it - some sort of pattern sensitiity - but how is it bashing R5 _and_ causing a spurious MM trap? That's some glitch! Most of the data above (e.g. SSR contents at trap time) has been re-checked, and Fritz is going to check the rest (e.g. actual main memory contents for the start of the code, and the user's stack - to check that the process' core dump worked OK - although given the consistent stack contents, I'm expecting those to be good). I suggested to him that the time had come to apply the logic analyzer; I'd love to see (from the IR in the CPU) the instruction that faults, and where it came from. And also what the bus cycle is that's causing the fault; is it the instruction fetch (possibly) or something that instruction is trying to do? Does anyone have any comments/insight that could help work out what's going on here? Or suggestions on things to look at? If so, thanks! Noel