[osol-discuss] SCCS source code
Hi, I remember that there once was a note that SCCS will become OSS to the end of this year. I cannot find any time frame on SCCS any more. What is the current state? Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...
Hi! I have some trouble with my "new" Blade1000 from EBay... the problem is that I am getting zillions of "Corrected ECC Error" messages and I cannot set "diag-switch?" to "true" (the response is always "diag-switch? = false". Ok... I've tried the usual "trick" and removed the NVRAM chip which forces the machine to run the diag stuff. The output looks like this: -- snip -- Could not read diag-switch? from NVRAM! Could not read diag-level from NVRAM! Could not read mfg-mode from NVRAM! Could not read security-mode from NVRAM! @(#)OBP 4.2.2 2001/04/26 14:59 Clearing TLBs Done Power-On Reset Executing Power On SelfTest {0} {0}@(#)POST, v4.2.2 04/26/2001 07:23 PM {0}Soft POR to the whole system {1}Soft POR to the whole system {0}* Configure I2C controller 0 {0}* Configure I2C controller 1 {0}* I2C Controller Loopback Test {0}* Read JTag IDs of all ASICs {0} BBC JTag ID: 1483203b {0} SCSIJTag ID: 15060045 {0} I chip JTag ID: d1e203b {0} RIO JTag ID: 13e5d03b {0} Schizo JTag ID: 1424c06d {0} CPMSJTag ID: 1142903b {0} CPMSJTag ID: 1142903b {0} CPMSJTag ID: 1142903b {0} CPMSJTag ID: 1142903b {0} CPMSJTag ID: 1142903b {0} CPMSJTag ID: 1142903b {0}* Read JTag ID of FCAL {0} FC-AL JTag ID: 1000a12f {0}* Probing Seeprom on DIMMs and CPU modules {0}WARNING: DIMM 1 missing {0}WARNING: DIMM 3 missing {0}WARNING: DIMM 5 missing {0}WARNING: DIMM 7 missing {0}CPU0 Sensor package temperature 20 oC {0}CPU1 Sensor package temperature 20 oC {0}WARNING: Temperature sensor on UPA0 missing {0}WARNING: Temperature sensor on UPA1 missing {0}ERROR: TEST = * Probing Seeprom on DIMMs and CPU modules TESTID = 96 {0}H/W under test = I2C/Serial Proms {0} Slave not responded {0} Cannot read socketed seeprom U2101 {0} I2C bus 0 {0} I2C address a0 {0}* Probing Seeprom on DIMMs and CPU modules FAILED {0}POST failed {0}POST_END Could not read diag-switch? from NVRAM! Could not read diag-level from NVRAM! Could not read mfg-mode from NVRAM! Could not read security-mode from NVRAM! @(#)OBP 4.2.2 2001/04/26 14:59 Clearing TLBs Done POST Results: Cpu 0 %o0 ...0001 %o1 .07ff.f015.06d0 %o2 ... POST Results: Cpu 1 %o0 ...0001 %o1 .07ff.f015.0730 %o2 ... Membase: ... MemSize: ..0010. Init CPU arrays Done Init E$ tags Done Setup TLB Done MMUs ON Copy Done PC = .07ff.f000.37f8 PC = ...3878 Decompressing Done Size = ..0006.e440 ttya initialized Start Reason: Initialize Machine Configuring the machine: þ Could not read diag-switch? from NVRAM! Could not read diag-level from NVRAM! Could not read mfg-mode from NVRAM! Could not read security-mode from NVRAM! @(#)OBP 4.2.2 2001/04/26 14:59 Clearing TLBs Done Loading Configuration Membase: ... MemSize: ..4000. Init CPU arrays Done Init E$ tags Done Setup TLB Done MMUs ON Block Scrubbing Done Copy Done PC = .07ff.f000.37f8 PC = ...3878 Decompressing Done Size = ..0006.e440 ttya initialized Corrected ECC Error ok Corrected ECC Error -- snip -- Erm... that sounds like "bad memory", right ? Or is there any other possible issue which may cause this problem ? Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: Re: Solaris on an Ultra 10
Andrew Pattison wrote: >>* max supported hdd size ? --> (120GB) >> >> > >That was my next question. I didn't realise the onboard controller doesn't >support UDMA though - that strikes me as a bit odd, considering it must have >shipped with UDMA hard disks. > >I would be very interested in one of your magic cards. What sort of price did >you have in mind? I've got a qfe (quad fast ethernet) card that came with the >Ultra 10 if that's any use to you. ;-) > >Cheers > >Andrew. > > You are not the one_millionst, but the FIRST marTux user ever, who publically "admits" having booted it. And you even praised Xorg_for_sparc and requested a new version. You get my IDE_cmd649_with_JP0 card for free, because of this. Contact me in private and give me your address. You only need to pay(pal) the shipping fees. I will test the board again to be sure it is 100% ok Someone has booted marTux, I cannot believe it :-) -- Martin http://www.martux.org/RELEASES/sparcv9/ (LiveCD/DVD for sparc) http://www.martux.org/RELEASES/x86_and_x64/DVD/ (LiveDVD for x64/x86) ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...
Roland Mainz wrote: >Hi! > > > >I have some trouble with my "new" Blade1000 from EBay... the problem is >that I am getting zillions of "Corrected ECC Error" messages and I >cannot set "diag-switch?" to "true" (the response is always >"diag-switch? = false". > >Ok... I've tried the usual "trick" and removed the NVRAM chip which >forces the machine to run the diag stuff. The output looks like this: >-- snip -- >Could not read diag-switch? from NVRAM! >Could not read diag-level from NVRAM! >Could not read mfg-mode from NVRAM! >Could not read security-mode from NVRAM! >@(#)OBP 4.2.2 2001/04/26 14:59 >Clearing TLBs Done >Power-On Reset >Executing Power On SelfTest >{0} >{0}@(#)POST, v4.2.2 04/26/2001 07:23 PM >{0}Soft POR to the whole system >{1}Soft POR to the whole system >{0}* Configure I2C controller 0 >{0}* Configure I2C controller 1 >{0}* I2C Controller Loopback Test >{0}* Read JTag IDs of all ASICs >{0} BBC JTag ID: 1483203b >{0} SCSIJTag ID: 15060045 >{0} I chip JTag ID: d1e203b >{0} RIO JTag ID: 13e5d03b >{0} Schizo JTag ID: 1424c06d >{0} CPMSJTag ID: 1142903b >{0} CPMSJTag ID: 1142903b >{0} CPMSJTag ID: 1142903b >{0} CPMSJTag ID: 1142903b >{0} CPMSJTag ID: 1142903b >{0} CPMSJTag ID: 1142903b >{0}* Read JTag ID of FCAL >{0} FC-AL JTag ID: 1000a12f >{0}* Probing Seeprom on DIMMs and CPU modules >{0}WARNING: DIMM 1 missing >{0}WARNING: DIMM 3 missing >{0}WARNING: DIMM 5 missing >{0}WARNING: DIMM 7 missing >{0}CPU0 Sensor package temperature 20 oC >{0}CPU1 Sensor package temperature 20 oC >{0}WARNING: Temperature sensor on UPA0 missing >{0}WARNING: Temperature sensor on UPA1 missing >{0}ERROR: TEST = * Probing Seeprom on DIMMs and CPU modules TESTID = 96 >{0}H/W under test = I2C/Serial Proms >{0} Slave not responded >{0} Cannot read socketed seeprom U2101 >{0} I2C bus 0 >{0} I2C address a0 >{0}* Probing Seeprom on DIMMs and CPU modules FAILED >{0}POST failed >{0}POST_END > >Could not read diag-switch? from NVRAM! >Could not read diag-level from NVRAM! >Could not read mfg-mode from NVRAM! >Could not read security-mode from NVRAM! >@(#)OBP 4.2.2 2001/04/26 14:59 >Clearing TLBs Done >POST Results: Cpu 0 > %o0 ...0001 > %o1 .07ff.f015.06d0 > %o2 ... >POST Results: Cpu 1 > %o0 ...0001 > %o1 .07ff.f015.0730 > %o2 ... >Membase: ... >MemSize: ..0010. >Init CPU arrays Done >Init E$ tags Done >Setup TLB Done >MMUs ON >Copy Done >PC = .07ff.f000.37f8 >PC = ...3878 >Decompressing Done >Size = ..0006.e440 >ttya initialized >Start Reason: Initialize Machine >Configuring the machine: >þ >Could not read diag-switch? from NVRAM! >Could not read diag-level from NVRAM! >Could not read mfg-mode from NVRAM! >Could not read security-mode from NVRAM! >@(#)OBP 4.2.2 2001/04/26 14:59 >Clearing TLBs Done >Loading Configuration >Membase: ... >MemSize: ..4000. >Init CPU arrays Done >Init E$ tags Done >Setup TLB Done >MMUs ON >Block Scrubbing Done >Copy Done >PC = .07ff.f000.37f8 >PC = ...3878 >Decompressing Done >Size = ..0006.e440 >ttya initialized >Corrected ECC Error >ok Corrected ECC Error >-- snip -- > >Erm... that sounds like "bad memory", right ? Or is there any other >possible issue which may cause this problem ? > > Hi Roland, it looks pretty much that way. But I would never exclude potential other causes (i.e. bad external L2 cache on the cpu module), or whatever logic on the system board. Didn't you purchase it for EUR 249,- from "bausihausi" ?? No Non-DOA warranty?? I recently purchased very cheap spare memory for any potential worst case scenario. (a single X7050A 512MB Kit consisting of 4 modules: http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1 ). It's still located at the seller. I could have it forwarded to you for testing (and finding the bad module[s]) if you wish. The sb1k/sb2k is an extremely fast and reliable system normally. A true powerhouse. Only disadvantage: Well, it is a powerhouse. (and costs you circa EUR 1000,- per year if you run it 24x7, damn). Best and much luck, Martin ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...
Martin Bochnig wrote: > Roland Mainz wrote: > >I have some trouble with my "new" Blade1000 from EBay... the problem is > >that I am getting zillions of "Corrected ECC Error" messages and I > >cannot set "diag-switch?" to "true" (the response is always > >"diag-switch? = false". > > > >Ok... I've tried the usual "trick" and removed the NVRAM chip which > >forces the machine to run the diag stuff. The output looks like this: > >-- snip -- > >Could not read diag-switch? from NVRAM! > >Could not read diag-level from NVRAM! > >Could not read mfg-mode from NVRAM! > >Could not read security-mode from NVRAM! > >@(#)OBP 4.2.2 2001/04/26 14:59 > >Clearing TLBs Done > >Power-On Reset > >Executing Power On SelfTest [snip] > >Block Scrubbing Done > >Copy Done > >PC = .07ff.f000.37f8 > >PC = ...3878 > >Decompressing Done > >Size = ..0006.e440 > >ttya initialized > >Corrected ECC Error > >ok Corrected ECC Error > >-- snip -- > > > >Erm... that sounds like "bad memory", right ? Or is there any other > >possible issue which may cause this problem ? [snip] > it looks pretty much that way. > But I would never exclude potential other causes (i.e. bad external L2 cache > on the cpu module), or whatever logic on the system board. > Didn't you purchase it for EUR 249,- from "bausihausi" ?? > No Non-DOA warranty?? Erm... it's from QuandElektronik (AFAIK the same as "bausihausi") but purchased directly since it contains a 2nd CPU... more or less an attempt to recover from http://mail.opensolaris.org/pipermail/ksh93-integration-discuss/2006-July/000643.html using my last money for this year... ;-/ > I recently purchased very cheap spare memory for any potential worst case > scenario. > (a single X7050A 512MB Kit consisting of 4 modules: > http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1 > ). > It's still located at the seller. > I could have it forwarded to you for testing (and finding the bad module[s]) > if you wish. I am not sure whether this helps... I have (per german law or something like that) 14 days to send the machine back if it doesn't work and I already wasted eight days. Unless the memory magically appears here in Gießen within the next three days I am screwed (at least I have to pay for shipping which means I am bleeding badly and still have no SPARC at home to push my projects forward beyond what we currently have). > The sb1k/sb2k is an extremely fast and reliable system normally. > A true powerhouse. > Only disadvantage: Well, it is a powerhouse. > > (and costs you circa EUR 1000,- per year if you run it 24x7, damn). Known problem. But somehow I need such a machine at home... > Best and much luck, Thanks... I need it... ;-( Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...
Roland Mainz wrote: > Erm... it's from QuandElektronik (AFAIK the same as "bausihausi") but > >purchased directly since it contains a 2nd CPU... more or less an >attempt to recover from >http://mail.opensolaris.org/pipermail/ksh93-integration-discuss/2006-July/000643.html >using my last money for this year... ;-/ > > I know them (U60). Make some real noise and they _have_ to collect it from you for free (they use UPS and will send you a freeway ticket, as they did for my U60's bad DVD drive). > > >>I recently purchased very cheap spare memory for any potential worst case >>scenario. >>(a single X7050A 512MB Kit consisting of 4 modules: >>http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1 >> ). >>It's still located at the seller. >>I could have it forwarded to you for testing (and finding the bad module[s]) >>if you wish. >> >> > >I am not sure whether this helps... I have (per german law or something >like that) 14 days to send the machine back if it doesn't work and I >already wasted eight days. Unless the memory magically appears here in >Gießen within the next three days I am screwed (at least I have to pay >for shipping which means I am bleeding badly and still have no SPARC at >home to push my projects forward beyond what we currently have). > > > Mhh. A pity. I can send you a cheap slow old U5. Or an almost usable U10 333MHz_2MB! If this helps? You will that way have the system before Christmas ! Let's say EUR 33- ? I could send it out Sunday (Berlin). Martin ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] SCCS source code
On Sat, 16 Dec 2006, Joerg Schilling wrote: > I remember that there once was a note that SCCS will become OSS > to the end of this year. I cannot find any time frame on SCCS any more. > What is the current state? IIRC, this was discussed recently on program-discuss. I think the source is pretty much ready for release, but there's some paperwork that needs to be done first. In other words, I think it's almost there! -- Rich Teer, SCNA, SCSA, SCSECA, OpenSolaris CAB member . * * . * .* . . * . .* President, * . . /\ ( . . * Rite Online Inc. . . / .\ . * . .*. / * \ . . . /* o \ . Voice: +1 (250) 979-1638* '''||''' . URL: http://www.rite-group.com/rich ** ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: Re: Re: Solaris on an Ultra 10
Wow - thanks! :-) I post to this forum using the website, not the mailing list, so I don't have a note of your email address. Mine is apattison at gmail dot com if you want to email me. Thank again! Andrew. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Cluster size reported "Install Solaris 10 Software" installing DVD 1106
Within the GUI of the Cluster size being reported for the "Install Solaris 10 Software" installing from DVD of 10u3 (1106) shows when you click on the "Entire Group plus OEM" reports 4000.3mb vs just the Entire Group reports 4036.2mb, what is the difference of packages being installed between the two? What is missing from the OEM cluster, how you figure this out? This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] SiL3112 not being detected
Hello, Short version: I have SiL 3112 cards that are listed as expected by scanpci -v, but the 3112 does not appear at all in "prtconf -D", only the 6112 RAID controller chipset on the same card shows up. Long version: I have a machine with a SiL3512 in it that works fine. I needed four more SATA disks so I purchased a 3114 which did not work (and I saw suggestions that it only worked after flashing into legacy mode). I then purchased a couple of SiL 3112 based controllers that I "knew" would work. Unfortunately they don't. The SiL 3112 is detected fine by scanpci -v: pci bus 0x cardnum 0x09 function 0x00: vendor 0x1095 device 0x3112 Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller CardVendor 0x1095 card 0x6112 (Silicon Image, Inc. SiI 3112 SATARaid Controller) STATUS0x02b0 COMMAND 0x0007 CLASS 0x01 0x04 0x00 REVISION 0x02 BIST 0x00 HEADER 0x00 LATENCY 0x20 CACHE 0x08 BASE0 0xd501 addr 0xd500 I/O BASE1 0xd601 addr 0xd600 I/O BASE2 0xd701 addr 0xd700 I/O BASE3 0xd801 addr 0xd800 I/O BASE4 0xd901 addr 0xd900 I/O BASE5 0xee081000 addr 0xee081000 MEM MAX_LAT 0x00 MIN_GNT 0x00 INT_PIN 0x01 INT_LINE 0x0b However, prtconf -D just yields: pci1095,6112 The 1096,6112 seems to be the RAID controller chipset on the card, which I don't care about. The 1096,3112 is not listed anywhere at all in prtconf output. I have tried adding an entry to /etc/driver_aliases to map pci1095,6112 to "pci-ide" aswell as "ata" - no effect. It is interesting that the scanpci entry for the 3512 does not include a CardVendor: pci bus 0x cardnum 0x08 function 0x00: vendor 0x1095 device 0x3512 Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller STATUS0x02b0 COMMAND 0x0007 CLASS 0x01 0x80 0x00 REVISION 0x01 BIST 0x00 HEADER 0x00 LATENCY 0x20 CACHE 0x08 BASE0 0xd001 addr 0xd000 I/O BASE1 0xd101 addr 0xd100 I/O BASE2 0xd201 addr 0xd200 I/O BASE3 0xd301 addr 0xd300 I/O BASE4 0xd401 addr 0xd400 I/O BASE5 0xee08 addr 0xee08 MEM MAX_LAT 0x00 MIN_GNT 0x00 INT_PIN 0x01 INT_LINE 0x0a BYTE_00x02 BYTE_1 0x00 BYTE_2 0x00 BYTE_3 0x00 Could it be that the 3112 is somehow "hidden", from a driver point of view, behing the 6119? I am not that familiar with how PCI works. The prtconf -D tree for the 3512 looks like this: pci-ide, instance #0 (driver name: pci-ide) ide, instance #0 (driver name: ata) cmdk, instance #0 (driver name: cmdk) ide, instance #1 (driver name: ata) cmdk, instance #2 (driver name: cmdk) Is the 6112 supposed to be in the same position as the pci-ide node for the 3512? The card is a "VSCom" card, whatever that is. Suppose it's just re-branding or something. Any feedback would be appreciated. I am a Solaris newbie (having only done BSD/Linux in the past), so perhaps I am missing something obvious. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Trouble with "sendmail" and '"Smart" relay host' inB48 ...
Roland Mainz wrote: > I have a small problem with my "sendmail" setup in Solaris 11/B48/SPARC > and can't make heads or tails out of the problem. > In Solaris 8 we have the following configuration change: > -- snip -- > --- /etc/mail/sendmail.cf_original Wed Oct 11 02:19:29 2006 > +++ /etc/mail/sendmail.cf Wed Oct 11 02:19:57 2006 > @@ -87,7 +87,7 @@ > CP. > > # "Smart" relay host (may be null) > -DS > +DSmailout.uni-giessen.de > > > # operators that cannot be in local usernames (i.e., network > indicators) > -- snip -- > This makes sure that all emails are send to the server > "mailout.uni-giessen.de" - but in B48 sendmail tries to deliver the > emails directly to the matching hosts, bypassing > "mailout.uni-giessen.de"... > > ... does anyone know what may be wrong in this case ? I forgot to post the "solution" for this problem: I had to modify "local.mc"/"local.cf" instead of "sendmail.mc"/"sendmail.cf" to get this working. After figuring this out the setup works perfectly... :-) Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Unkillable Processes
I'm posting this to OS Discuss because I'm not sure where it would better be posted. If there is a more appropriate list please say so. I've run into a sad situation several times now, processes that can't be killed. In all cases it happened to be 'lighttpd'. It sits on CPU consuming cycles but I can't see what its doing. Even Dtrace is helpless. [nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target :::return { trace(timestamp); }' dtrace: failed to grab pid 4308: unanticipated system error [nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = count(); }' -p 4308 dtrace: failed to grab pid 4308: unanticipated system error [nicole:/] root# ps -ef | grep 4308 root 12640 2538 0 06:15:27 pts/8 0:00 grep 4308 webservd 4308 1 7 Dec 15 ? 65:00 /opt/patch/sbin/lighttpd -f /opt/batchblue/etc/lighttpd/lighttpd.conf [nicole:/] root# pstack 4308 pstack: cannot examine 4308: unanticipated system error [nicole:/] root# truss -p 4308 truss: unanticipated system error: 4308 [nicole:/] root# kill -9 4308 [nicole:/] root# ps -ef | grep 4308 webservd 4308 1 8 Dec 15 ? 65:36 /opt/patch/sbin/lighttpd -f /opt/batchblue/etc/lighttpd/lighttpd.conf No amount of tinkering, destruction, or killage can make this process go away. I can't attach a debugger and can't force it to core dump. When this happens in a Zone it makes matters worse. Attempting to reboot or halt the zone won't work because a process is still running inside of it. The zone then gets into this stuck state where its not up, but it still holding resources open. So I have several questions and concerns here: 1) How is it possible for a process to get into an unkillable state? 2) Is there some kind of scheduler magic that can be done to just dump the process however hostile? 3) Is there some way we can protect zones from this sort of issue? 4) Why can't any of Solaris's dozens of observability tools get a glimpse into this thing? Right now there is only one solution to these annoying problems: reboot the box and avoid using lighttpd where ever possible. I'd gladly patch lighttpd to keep this issue for happening but without some debugging I can't be sure of what code is to blame. I've found several bugs in the database that look similar to this in one way or another but most say to look at the comments, which sadly we can't see. I'm open to all ideas and theories. Thanks. benr. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Unkillable Processes
Hi Ben, Ben Rockwood wrote: I'm posting this to OS Discuss because I'm not sure where it would better be posted. If there is a more appropriate list please say so. I've run into a sad situation several times now, processes that can't be killed. In all cases it happened to be 'lighttpd'. It sits on CPU consuming cycles but I can't see what its doing. Even Dtrace is helpless. I'm sorry to say I'm familiar with the problem, but happy to say it's already been encountered and there is a solution. I don't know when it will make it in to a future build just yet, but it was a problem in sockfs with the sendfilev syscall. The bugid is 6455727. A workaround for now would be to modify your lighttpd.conf to have lighty use the writev syscall instead of sendfilev. Let me know if you need more info. More below. [nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target :::return { trace(timestamp); }' dtrace: failed to grab pid 4308: unanticipated system error [nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = count(); }' -p 4308 dtrace: failed to grab pid 4308: unanticipated system error [nicole:/] root# ps -ef | grep 4308 root 12640 2538 0 06:15:27 pts/8 0:00 grep 4308 webservd 4308 1 7 Dec 15 ? 65:00 /opt/patch/sbin/lighttpd -f /opt/batchblue/etc/lighttpd/lighttpd.conf [nicole:/] root# pstack 4308 pstack: cannot examine 4308: unanticipated system error [nicole:/] root# truss -p 4308 truss: unanticipated system error: 4308 [nicole:/] root# kill -9 4308 [nicole:/] root# ps -ef | grep 4308 webservd 4308 1 8 Dec 15 ? 65:36 /opt/patch/sbin/lighttpd -f /opt/batchblue/etc/lighttpd/lighttpd.conf No amount of tinkering, destruction, or killage can make this process go away. I can't attach a debugger and can't force it to core dump. When this happens in a Zone it makes matters worse. Attempting to reboot or halt the zone won't work because a process is still running inside of it. The zone then gets into this stuck state where its not up, but it still holding resources open. So I have several questions and concerns here: 1) How is it possible for a process to get into an unkillable state? Well, if something isn't handling the signal correctly 2) Is there some kind of scheduler magic that can be done to just dump the process however hostile? Not in this case to my knowledge. Others may be able to jump in on this one. All of the normal utilities (gcore, preap, etc.) assume signals are handled correctly. 3) Is there some way we can protect zones from this sort of issue? Yes, by being sure the signal is correctly handled in sockfs. Like I said, we have a fix already, but it isn't putback to the OpenSolaris kernel code yet. 4) Why can't any of Solaris's dozens of observability tools get a glimpse into this thing? Actually, a number of tools likely can, but they can't rely on the pid provider in this case. You could, for instance, get an idea of what the various kernel threads are and what their stacks are as one example. Right now there is only one solution to these annoying problems: reboot the box and avoid using lighttpd where ever possible. I'd gladly patch lighttpd to keep this issue for happening but without some debugging I can't be sure of what code is to blame. There is a workaround that would allow you to use sendfilev() and get a bump in performance at the same time. It turns out there is an ON private flag to sendfilev that allows the syscall to be non-blocking. I patched my lighttpd and it solved that problem. Do a search for SFV_NOWAIT if you want to see the details. However, you should be warned, it's not a stable interface. It may disappear in a future release, burn your house down and eat your cat-- so use it at your own risk. The best workaround for now is to set up your lighttpd.conf to use the writev() backend. network.backend="writev" or something like that. I've found several bugs in the database that look similar to this in one way or another but most say to look at the comments, which sadly we can't see. I'm open to all ideas and theories. Thanks. benr. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Unkillable Processes
Hey Ben, > I'm posting this to OS Discuss because I'm not sure where it would better be > posted. If there is a more appropriate list please say so. > > I've run into a sad situation several times now, processes that can't be > killed. In all cases it happened to be 'lighttpd'. It sits on CPU consuming > cycles but I can't see what its doing. Even Dtrace is helpless. > > [nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target > :::return { trace(timestamp); }' > dtrace: failed to grab pid 4308: unanticipated system error > [nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = > count(); }' -p 4308 > dtrace: failed to grab pid 4308: unanticipated system error > [nicole:/] root# ps -ef | grep 4308 > root 12640 2538 0 06:15:27 pts/8 0:00 grep 4308 > webservd 4308 1 7 Dec 15 ? 65:00 /opt/patch/sbin/lighttpd > -f /opt/batchblue/etc/lighttpd/lighttpd.conf > [nicole:/] root# pstack 4308 > pstack: cannot examine 4308: unanticipated system error > [nicole:/] root# truss -p 4308 > truss: unanticipated system error: 4308 > > [nicole:/] root# kill -9 4308 > [nicole:/] root# ps -ef | grep 4308 > webservd 4308 1 8 Dec 15 ? 65:36 /opt/patch/sbin/lighttpd > -f /opt/batchblue/etc/lighttpd/lighttpd.conf > > No amount of tinkering, destruction, or killage can make this process go > away. I can't attach a debugger and can't force it to core dump. You are no doubt spinning in the kernel, and this is no doubt a kernel bug. To figure out what's going on, do this: # dtrace -n profile-1234hz'/pid == 4308/[EMAIL PROTECTED]()] = count()}' That should tell you pretty clearly where you are and what's going on... - Bryan -- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Unkillable Processes
Hello Ben I just read all this and its spooky. That _anything_ can be stuck like this to a point where pulling the plug is the only option. I don't think that a kernel module gets loaded by this or needed. If it helps at all .. and I don't know if it does ... we have a release from just a few days ago : Dec 14 12:49 lighttpd-1.4.13,REV=2006.12.14 see : http://www.blastwave.org/packages.php/lighttpd don't know if that will help *after* you reboot and remove whatever you have there. The real problem is that the tools to kill that process are not working here at all and that is more frightening than anything else. do you see a pile of threads ? bash-3.00$ ps -ecflL -o user,pid,ppid,vsz,uid,s,lwp,rss,wchan,args | grep http root 8225 1 156600 0 S 1 6208 30003e72b06 /opt/csw/apache/bin/httpd nobody 8226 8225 168184 60001 S 1 18200 300050c6b54 /opt/csw/apache/bin/httpd nobody 8227 8225 169064 60001 S 1 19240 3000735e514 /opt/csw/apache/bin/httpd nobody 8234 8225 168112 60001 S 1 18240 3000ac0f314 /opt/csw/apache/bin/httpd nobody 8228 8225 168208 60001 S 1 18688 30001c4db54 /opt/csw/apache/bin/httpd nobody 8235 8225 168496 60001 S 1 18648 30004b12922 /opt/csw/apache/bin/httpd nobody 8233 8225 168072 60001 S 1 18264 300050c7954 /opt/csw/apache/bin/httpd nobody 8232 8225 168064 60001 S 1 18136 3000a1eebd4 /opt/csw/apache/bin/httpd nobody 8231 8225 168376 60001 S 1 18384 3000304a162 /opt/csw/apache/bin/httpd nobody 8230 8225 169032 60001 S 1 19224 300039ca654 /opt/csw/apache/bin/httpd nobody 8229 8225 168072 60001 S 1 18192 300050c7454 /opt/csw/apache/bin/httpd I'm just thinking out load here Dennis ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Unkillable Processes
On Sat, Dec 16, 2006 at 10:42:40PM -0800, Matt Ingenthron wrote: > Hi Ben, > > Ben Rockwood wrote: > > >I'm posting this to OS Discuss because I'm not sure where it would better > >be posted. If there is a more appropriate list please say so. > > > >I've run into a sad situation several times now, processes that can't be > >killed. In all cases it happened to be 'lighttpd'. It sits on CPU > >consuming cycles but I can't see what its doing. Even Dtrace is helpless. > > > > > I'm sorry to say I'm familiar with the problem, but happy to say it's > already been encountered and there is a solution. I don't know when it > will make it in to a future build just yet, but it was a problem in > sockfs with the sendfilev syscall. > > The bugid is 6455727. And there you have it. To verify Matt's hypothesis, run the DTrace one-liner that I sent you; if it's the bug that Matt suggested (which certainly matches the symptoms you describe), you'll see stack traces like this one in the output: issig_forreal+0x5c0() cv_wait_sig+0x190() snf_segmap+0x450() sosendfile64+0x2a0() sendvec64+0xf0() sendfilev+0x178() syscall_trap32+0xcc() - Bryan -- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Unkillable Processes
> On Sat, Dec 16, 2006 at 10:42:40PM -0800, Matt Ingenthron wrote: >> Hi Ben, >> >> Ben Rockwood wrote: >> >> >I'm posting this to OS Discuss because I'm not sure where it would better >> >be posted. If there is a more appropriate list please say so. >> > >> >I've run into a sad situation several times now, processes that can't be >> >killed. In all cases it happened to be 'lighttpd'. It sits on CPU >> >consuming cycles but I can't see what its doing. Even Dtrace is >> helpless. >> > >> > >> I'm sorry to say I'm familiar with the problem, but happy to say it's >> already been encountered and there is a solution. I don't know when it >> will make it in to a future build just yet, but it was a problem in >> sockfs with the sendfilev syscall. >> >> The bugid is 6455727. > > And there you have it. To verify Matt's hypothesis, run the DTrace > one-liner that I sent you; if it's the bug that Matt suggested (which > certainly matches the symptoms you describe), you'll see stack traces > like this one in the output: > >issig_forreal+0x5c0() >cv_wait_sig+0x190() >snf_segmap+0x450() >sosendfile64+0x2a0() >sendvec64+0xf0() >sendfilev+0x178() >syscall_trap32+0xcc() > gotta love that ... my gut was telling me that the process had left userland and gone into places un-killable. DTrace ... amazing. Dennis ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: Asterisk success
> > >is there any chance to actually include those > drivers in solaris > >express? the cards supported by it are very common. > has anyone talked > >to the developer about that? > > > Yes;l I think it's one of the reasons the "rh" driver > was renamed > "vfe"; they're being prepared for inclusion in > Solaris. > > Casper is there a schedule for this sort of thing somewhere (or a project name for the integration)? it would be nice to know when i don't have to make my own miniroot... This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org