date:20061216

[osol-discuss] SCCS source code

2006-12-16 Thread Joerg Schilling

Hi,

I remember that there once was a note that SCCS will become OSS
to the end of this year. I cannot find any time frame on SCCS any more.
What is the current state?

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...

2006-12-16 Thread Roland Mainz


Hi!



I have some trouble with my "new" Blade1000 from EBay... the problem is
that I am getting zillions of "Corrected ECC Error" messages and I
cannot set "diag-switch?" to "true" (the response is always
"diag-switch? = false".

Ok... I've tried the usual "trick" and removed the NVRAM chip which
forces the machine to run the diag stuff. The output looks like this:
-- snip --
Could not read diag-switch? from NVRAM!  
Could not read diag-level from NVRAM!  
Could not read mfg-mode from NVRAM!  
Could not read security-mode from NVRAM! 
@(#)OBP 4.2.2 2001/04/26 14:59
Clearing TLBs Done
Power-On Reset
Executing Power On SelfTest
{0}
{0}@(#)POST, v4.2.2  04/26/2001 07:23 PM
{0}Soft POR to the whole system
{1}Soft POR to the whole system
{0}* Configure I2C controller 0
{0}* Configure I2C controller 1
{0}* I2C Controller Loopback Test
{0}* Read JTag IDs of all ASICs
{0} BBC JTag ID: 1483203b
{0} SCSIJTag ID: 15060045
{0} I chip  JTag ID: d1e203b
{0} RIO JTag ID: 13e5d03b
{0} Schizo  JTag ID: 1424c06d
{0} CPMSJTag ID: 1142903b
{0} CPMSJTag ID: 1142903b
{0} CPMSJTag ID: 1142903b
{0} CPMSJTag ID: 1142903b
{0} CPMSJTag ID: 1142903b
{0} CPMSJTag ID: 1142903b
{0}* Read JTag ID of FCAL
{0} FC-AL   JTag ID: 1000a12f
{0}* Probing Seeprom on DIMMs and CPU modules
{0}WARNING: DIMM 1 missing
{0}WARNING: DIMM 3 missing
{0}WARNING: DIMM 5 missing
{0}WARNING: DIMM 7 missing
{0}CPU0 Sensor package temperature 20 oC
{0}CPU1 Sensor package temperature 20 oC
{0}WARNING: Temperature sensor on UPA0 missing
{0}WARNING: Temperature sensor on UPA1 missing
{0}ERROR: TEST = * Probing Seeprom on DIMMs and CPU modules TESTID = 96
{0}H/W under test = I2C/Serial Proms
{0} Slave not responded
{0} Cannot read socketed seeprom U2101
{0} I2C bus 0
{0} I2C address a0
{0}* Probing Seeprom on DIMMs and CPU modules FAILED
{0}POST failed
{0}POST_END

Could not read diag-switch? from NVRAM!  
Could not read diag-level from NVRAM!  
Could not read mfg-mode from NVRAM!  
Could not read security-mode from NVRAM! 
@(#)OBP 4.2.2 2001/04/26 14:59
Clearing TLBs Done
POST Results: Cpu 0
  %o0  ...0001 
  %o1  .07ff.f015.06d0 
  %o2  ... 
POST Results: Cpu 1
  %o0  ...0001 
  %o1  .07ff.f015.0730 
  %o2  ... 
Membase: ... 
MemSize: ..0010. 
Init CPU arrays Done
Init E$ tags Done
Setup TLB Done
MMUs ON
Copy Done
PC = .07ff.f000.37f8 
PC = ...3878 
Decompressing Done
Size = ..0006.e440 
ttya initialized
Start Reason: Initialize Machine
Configuring the machine:
þ
Could not read diag-switch? from NVRAM!  
Could not read diag-level from NVRAM!  
Could not read mfg-mode from NVRAM!  
Could not read security-mode from NVRAM! 
@(#)OBP 4.2.2 2001/04/26 14:59
Clearing TLBs Done
Loading Configuration
Membase: ... 
MemSize: ..4000. 
Init CPU arrays Done
Init E$ tags Done
Setup TLB Done
MMUs ON
Block Scrubbing Done
Copy Done
PC = .07ff.f000.37f8 
PC = ...3878 
Decompressing Done
Size = ..0006.e440 
ttya initialized
Corrected ECC Error
ok Corrected ECC Error
-- snip --

Erm... that sounds like "bad memory", right ? Or is there any other
possible issue which may cause this problem ?



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Re: Re: Solaris on an Ultra 10

2006-12-16 Thread Martin Bochnig

Andrew Pattison wrote:

>>* max supported hdd size ? --> (120GB)
>>
>>
>
>That was my next question. I didn't realise the onboard controller doesn't 
>support UDMA though - that strikes me as a bit odd, considering it must have 
>shipped with UDMA hard disks.
>
>I would be very interested in one of your magic cards. What sort of price did 
>you have in mind? I've got a qfe (quad fast ethernet) card that came with the 
>Ultra 10 if that's any use to you. ;-)
>
>Cheers
>
>Andrew.
>  
>

You are not the one_millionst, but the FIRST marTux user ever, who
publically "admits" having booted it.
And you even praised Xorg_for_sparc and requested a new version.

You get my IDE_cmd649_with_JP0 card for free, because of this.
Contact me in private and give me your address.
You only need to pay(pal) the shipping fees.
I will test the board again to be sure it is 100% ok

Someone has booted marTux, I cannot believe it :-)

--
Martin

http://www.martux.org/RELEASES/sparcv9/ (LiveCD/DVD for sparc)
http://www.martux.org/RELEASES/x86_and_x64/DVD/  (LiveDVD for x64/x86)

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...

2006-12-16 Thread Martin Bochnig

Roland Mainz wrote:

>Hi!
>
>
>
>I have some trouble with my "new" Blade1000 from EBay... the problem is
>that I am getting zillions of "Corrected ECC Error" messages and I
>cannot set "diag-switch?" to "true" (the response is always
>"diag-switch? = false".
>
>Ok... I've tried the usual "trick" and removed the NVRAM chip which
>forces the machine to run the diag stuff. The output looks like this:
>-- snip --
>Could not read diag-switch? from NVRAM!  
>Could not read diag-level from NVRAM!  
>Could not read mfg-mode from NVRAM!  
>Could not read security-mode from NVRAM! 
>@(#)OBP 4.2.2 2001/04/26 14:59
>Clearing TLBs Done
>Power-On Reset
>Executing Power On SelfTest
>{0}
>{0}@(#)POST, v4.2.2  04/26/2001 07:23 PM
>{0}Soft POR to the whole system
>{1}Soft POR to the whole system
>{0}* Configure I2C controller 0
>{0}* Configure I2C controller 1
>{0}* I2C Controller Loopback Test
>{0}* Read JTag IDs of all ASICs
>{0} BBC JTag ID: 1483203b
>{0} SCSIJTag ID: 15060045
>{0} I chip  JTag ID: d1e203b
>{0} RIO JTag ID: 13e5d03b
>{0} Schizo  JTag ID: 1424c06d
>{0} CPMSJTag ID: 1142903b
>{0} CPMSJTag ID: 1142903b
>{0} CPMSJTag ID: 1142903b
>{0} CPMSJTag ID: 1142903b
>{0} CPMSJTag ID: 1142903b
>{0} CPMSJTag ID: 1142903b
>{0}* Read JTag ID of FCAL
>{0} FC-AL   JTag ID: 1000a12f
>{0}* Probing Seeprom on DIMMs and CPU modules
>{0}WARNING: DIMM 1 missing
>{0}WARNING: DIMM 3 missing
>{0}WARNING: DIMM 5 missing
>{0}WARNING: DIMM 7 missing
>{0}CPU0 Sensor package temperature 20 oC
>{0}CPU1 Sensor package temperature 20 oC
>{0}WARNING: Temperature sensor on UPA0 missing
>{0}WARNING: Temperature sensor on UPA1 missing
>{0}ERROR: TEST = * Probing Seeprom on DIMMs and CPU modules TESTID = 96
>{0}H/W under test = I2C/Serial Proms
>{0} Slave not responded
>{0} Cannot read socketed seeprom U2101
>{0} I2C bus 0
>{0} I2C address a0
>{0}* Probing Seeprom on DIMMs and CPU modules FAILED
>{0}POST failed
>{0}POST_END
>
>Could not read diag-switch? from NVRAM!  
>Could not read diag-level from NVRAM!  
>Could not read mfg-mode from NVRAM!  
>Could not read security-mode from NVRAM! 
>@(#)OBP 4.2.2 2001/04/26 14:59
>Clearing TLBs Done
>POST Results: Cpu 0
>  %o0  ...0001 
>  %o1  .07ff.f015.06d0 
>  %o2  ... 
>POST Results: Cpu 1
>  %o0  ...0001 
>  %o1  .07ff.f015.0730 
>  %o2  ... 
>Membase: ... 
>MemSize: ..0010. 
>Init CPU arrays Done
>Init E$ tags Done
>Setup TLB Done
>MMUs ON
>Copy Done
>PC = .07ff.f000.37f8 
>PC = ...3878 
>Decompressing Done
>Size = ..0006.e440 
>ttya initialized
>Start Reason: Initialize Machine
>Configuring the machine:
>þ
>Could not read diag-switch? from NVRAM!  
>Could not read diag-level from NVRAM!  
>Could not read mfg-mode from NVRAM!  
>Could not read security-mode from NVRAM! 
>@(#)OBP 4.2.2 2001/04/26 14:59
>Clearing TLBs Done
>Loading Configuration
>Membase: ... 
>MemSize: ..4000. 
>Init CPU arrays Done
>Init E$ tags Done
>Setup TLB Done
>MMUs ON
>Block Scrubbing Done
>Copy Done
>PC = .07ff.f000.37f8 
>PC = ...3878 
>Decompressing Done
>Size = ..0006.e440 
>ttya initialized
>Corrected ECC Error
>ok Corrected ECC Error
>-- snip --
>
>Erm... that sounds like "bad memory", right ? Or is there any other
>possible issue which may cause this problem ?
>  
>

Hi Roland,

it looks pretty much that way.
But I would never exclude potential other causes (i.e. bad external L2 cache on 
the cpu module), or whatever logic on the system board.
Didn't you purchase it for EUR 249,- from "bausihausi" ??
No Non-DOA warranty??

I recently purchased very cheap spare memory for any potential worst case 
scenario.
(a single X7050A 512MB Kit consisting of 4 modules: 
http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1
 ).
It's still located at the seller.
I could have it forwarded to you for testing (and finding the bad module[s]) if 
you wish.

The sb1k/sb2k is an extremely fast and reliable system normally.
A true powerhouse.
Only disadvantage: Well, it is a powerhouse.

(and costs you circa EUR 1000,- per year if you run it 24x7, damn).


Best and much luck,
Martin




___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...

2006-12-16 Thread Roland Mainz

Martin Bochnig wrote:
> Roland Mainz wrote:
> >I have some trouble with my "new" Blade1000 from EBay... the problem is
> >that I am getting zillions of "Corrected ECC Error" messages and I
> >cannot set "diag-switch?" to "true" (the response is always
> >"diag-switch? = false".
> >
> >Ok... I've tried the usual "trick" and removed the NVRAM chip which
> >forces the machine to run the diag stuff. The output looks like this:
> >-- snip --
> >Could not read diag-switch? from NVRAM!
> >Could not read diag-level from NVRAM!
> >Could not read mfg-mode from NVRAM!
> >Could not read security-mode from NVRAM!
> >@(#)OBP 4.2.2 2001/04/26 14:59
> >Clearing TLBs Done
> >Power-On Reset
> >Executing Power On SelfTest
[snip]
> >Block Scrubbing Done
> >Copy Done
> >PC = .07ff.f000.37f8
> >PC = ...3878
> >Decompressing Done
> >Size = ..0006.e440
> >ttya initialized
> >Corrected ECC Error
> >ok Corrected ECC Error
> >-- snip --
> >
> >Erm... that sounds like "bad memory", right ? Or is there any other
> >possible issue which may cause this problem ?
[snip]
> it looks pretty much that way.
> But I would never exclude potential other causes (i.e. bad external L2 cache 
> on the cpu module), or whatever logic on the system board.
> Didn't you purchase it for EUR 249,- from "bausihausi" ??
> No Non-DOA warranty??

Erm... it's from QuandElektronik (AFAIK the same as "bausihausi") but
purchased directly since it contains a 2nd CPU... more or less an
attempt to recover from
http://mail.opensolaris.org/pipermail/ksh93-integration-discuss/2006-July/000643.html
using my last money for this year... ;-/

> I recently purchased very cheap spare memory for any potential worst case 
> scenario.
> (a single X7050A 512MB Kit consisting of 4 modules: 
> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1
>  ).
> It's still located at the seller.
> I could have it forwarded to you for testing (and finding the bad module[s]) 
> if you wish.

I am not sure whether this helps... I have (per german law or something
like that) 14 days to send the machine back if it doesn't work and I
already wasted eight days. Unless the memory magically appears here in
Gießen within the next three days I am screwed (at least I have to pay
for shipping which means I am bleeding badly and still have no SPARC at
home to push my projects forward beyond what we currently have).

> The sb1k/sb2k is an extremely fast and reliable system normally.
> A true powerhouse.
> Only disadvantage: Well, it is a powerhouse.
> 
> (and costs you circa EUR 1000,- per year if you run it 24x7, damn).

Known problem. But somehow I need such a machine at home...

> Best and much luck,

Thanks... I need it... ;-(



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...

2006-12-16 Thread Martin Bochnig

Roland Mainz wrote:

> Erm... it's from QuandElektronik (AFAIK the same as "bausihausi") but
>
>purchased directly since it contains a 2nd CPU... more or less an
>attempt to recover from
>http://mail.opensolaris.org/pipermail/ksh93-integration-discuss/2006-July/000643.html
>using my last money for this year... ;-/
>  
>

I know them (U60).
Make some real noise and they _have_ to collect it from you for free
(they use UPS and will send you a freeway ticket, as they did for my
U60's bad DVD drive).

>  
>
>>I recently purchased very cheap spare memory for any potential worst case 
>>scenario.
>>(a single X7050A 512MB Kit consisting of 4 modules: 
>>http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&ih=002&sspagename=STRK%3AMEWN%3AIT&viewitem=&item=120039890540&rd=1&rd=1
>> ).
>>It's still located at the seller.
>>I could have it forwarded to you for testing (and finding the bad module[s]) 
>>if you wish.
>>
>>
>
>I am not sure whether this helps... I have (per german law or something
>like that) 14 days to send the machine back if it doesn't work and I
>already wasted eight days. Unless the memory magically appears here in
>Gießen within the next three days I am screwed (at least I have to pay
>for shipping which means I am bleeding badly and still have no SPARC at
>home to push my projects forward beyond what we currently have).
>
>  
>

Mhh.
A pity.
I can send you a cheap slow old U5.
Or an almost usable U10 333MHz_2MB!
If this helps? You will that way have the system before Christmas !

Let's say EUR 33- ?

I could send it out Sunday (Berlin).

Martin
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] SCCS source code

2006-12-16 Thread Rich Teer

On Sat, 16 Dec 2006, Joerg Schilling wrote:

> I remember that there once was a note that SCCS will become OSS
> to the end of this year. I cannot find any time frame on SCCS any more.
> What is the current state?

IIRC, this was discussed recently on program-discuss.  I think the
source is pretty much ready for release, but there's some paperwork
that needs to be done first.  In other words, I think it's almost
there!

-- 
Rich Teer, SCNA, SCSA, SCSECA, OpenSolaris CAB member

.  *   * . * .* .
 .   *   .   .*
President,  * .  . /\ ( .  . *
Rite Online Inc. . .  / .\   . * .
.*.  / *  \  . .
  . /*   o \ .
Voice: +1 (250) 979-1638*   '''||'''   .
URL: http://www.rite-group.com/rich **
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Re: Re: Re: Solaris on an Ultra 10

2006-12-16 Thread Andrew Pattison

Wow - thanks! :-)

I post to this forum using the website, not the mailing list, so I don't have a 
note of your email address. Mine is apattison at gmail dot com if you want to 
email me.

Thank again!

Andrew.
 
 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Cluster size reported "Install Solaris 10 Software" installing DVD 1106

2006-12-16 Thread John Brewer

Within the GUI of the Cluster size being reported for the "Install Solaris 10 
Software"  installing from DVD of 10u3 (1106) shows when you click on the 
"Entire Group plus OEM" reports 4000.3mb vs just the Entire Group reports 
4036.2mb, what is the difference of packages being installed between the two? 
What is missing from the  
OEM cluster, how you figure this out?
 
 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] SiL3112 not being detected

2006-12-16 Thread Peter Schuller

Hello,

Short version:

I have SiL 3112 cards that are listed as expected by scanpci -v, but the 3112 
does not appear at all in "prtconf -D", only the 6112 RAID controller chipset 
on the same card shows up.

Long version:

I have a machine with a SiL3512 in it that works fine. I needed four more SATA 
disks so I purchased a 3114 which did not work (and I saw suggestions that it 
only worked after flashing into legacy mode). I then purchased a couple of 
SiL 3112 based controllers that I "knew" would work. Unfortunately they 
don't.

The SiL 3112 is detected fine by scanpci -v:

pci bus 0x cardnum 0x09 function 0x00: vendor 0x1095 device 0x3112
 Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller
 CardVendor 0x1095 card 0x6112 (Silicon Image, Inc. SiI 3112 SATARaid 
Controller)
  STATUS0x02b0  COMMAND 0x0007
  CLASS 0x01 0x04 0x00  REVISION 0x02
  BIST  0x00  HEADER 0x00  LATENCY 0x20  CACHE 0x08
  BASE0 0xd501  addr 0xd500  I/O
  BASE1 0xd601  addr 0xd600  I/O
  BASE2 0xd701  addr 0xd700  I/O
  BASE3 0xd801  addr 0xd800  I/O
  BASE4 0xd901  addr 0xd900  I/O
  BASE5 0xee081000  addr 0xee081000  MEM
  MAX_LAT   0x00  MIN_GNT 0x00  INT_PIN 0x01  INT_LINE 0x0b

However, prtconf -D just yields:

   pci1095,6112

The 1096,6112 seems to be the RAID controller chipset on the card, which I 
don't care about. The 1096,3112 is not listed anywhere at all in prtconf 
output.

I have tried adding an entry to /etc/driver_aliases to map pci1095,6112 
to "pci-ide" aswell as "ata" - no effect.

It is interesting that the scanpci entry for the 3512 does not include a 
CardVendor:

pci bus 0x cardnum 0x08 function 0x00: vendor 0x1095 device 0x3512
 Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller
  STATUS0x02b0  COMMAND 0x0007
  CLASS 0x01 0x80 0x00  REVISION 0x01
  BIST  0x00  HEADER 0x00  LATENCY 0x20  CACHE 0x08
  BASE0 0xd001  addr 0xd000  I/O
  BASE1 0xd101  addr 0xd100  I/O
  BASE2 0xd201  addr 0xd200  I/O
  BASE3 0xd301  addr 0xd300  I/O
  BASE4 0xd401  addr 0xd400  I/O
  BASE5 0xee08  addr 0xee08  MEM
  MAX_LAT   0x00  MIN_GNT 0x00  INT_PIN 0x01  INT_LINE 0x0a
  BYTE_00x02  BYTE_1  0x00  BYTE_2  0x00  BYTE_3  0x00

Could it be that the 3112 is somehow "hidden", from a driver point of view, 
behing the 6119? I am not that familiar with how PCI works. The prtconf -D 
tree for the 3512 looks like this:

pci-ide, instance #0 (driver name: pci-ide)
ide, instance #0 (driver name: ata)
cmdk, instance #0 (driver name: cmdk)
ide, instance #1 (driver name: ata)
cmdk, instance #2 (driver name: cmdk)

Is the 6112 supposed to be in the same position as the pci-ide node for the 
3512?

The card is a "VSCom" card, whatever that is. Suppose it's just re-branding or 
something.

Any feedback would be appreciated. I am a Solaris newbie (having only done 
BSD/Linux in the past), so perhaps I am missing something obvious.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Trouble with "sendmail" and '"Smart" relay host' inB48 ...

2006-12-16 Thread Roland Mainz

Roland Mainz wrote:
> I have a small problem with my "sendmail" setup in Solaris 11/B48/SPARC
> and can't make heads or tails out of the problem.
> In Solaris 8 we have the following configuration change:
> -- snip --
> --- /etc/mail/sendmail.cf_original  Wed Oct 11 02:19:29 2006
> +++ /etc/mail/sendmail.cf   Wed Oct 11 02:19:57 2006
> @@ -87,7 +87,7 @@
>  CP.
> 
>  # "Smart" relay host (may be null)
> -DS
> +DSmailout.uni-giessen.de
> 
> 
>  # operators that cannot be in local usernames (i.e., network
> indicators)
> -- snip --
> This makes sure that all emails are send to the server
> "mailout.uni-giessen.de" - but in B48 sendmail tries to deliver the
> emails directly to the matching hosts, bypassing
> "mailout.uni-giessen.de"...
> 
> ... does anyone know what may be wrong in this case ?

I forgot to post the "solution" for this problem:
I had to modify "local.mc"/"local.cf" instead of
"sendmail.mc"/"sendmail.cf" to get this working. After figuring this out
the setup works perfectly... :-)



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Unkillable Processes

2006-12-16 Thread Ben Rockwood

I'm posting this to OS Discuss because I'm not sure where it would better be 
posted.  If there is a more appropriate list please say so.

I've run into a sad situation several times now, processes that can't be 
killed.  In all cases it happened to be 'lighttpd'.  It sits on CPU consuming 
cycles but I can't see what its doing.  Even Dtrace is helpless.

[nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target :::return 
{ trace(timestamp); }'
dtrace: failed to grab pid 4308: unanticipated system error
[nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = 
count(); }' -p 4308
dtrace: failed to grab pid 4308: unanticipated system error
[nicole:/] root# ps -ef | grep 4308
root 12640  2538   0 06:15:27 pts/8   0:00 grep 4308
webservd  4308 1   7   Dec 15 ?  65:00 /opt/patch/sbin/lighttpd -f 
/opt/batchblue/etc/lighttpd/lighttpd.conf
[nicole:/] root# pstack 4308
pstack: cannot examine 4308: unanticipated system error
[nicole:/] root# truss -p 4308
truss: unanticipated system error: 4308

[nicole:/] root# kill -9 4308
[nicole:/] root# ps -ef | grep 4308
webservd  4308 1   8   Dec 15 ?  65:36 /opt/patch/sbin/lighttpd -f 
/opt/batchblue/etc/lighttpd/lighttpd.conf

No amount of tinkering, destruction, or killage can make this process go away.  
I can't attach a debugger and can't force it to core dump.

When this happens in a Zone it makes matters worse.  Attempting to reboot or 
halt the zone won't work because a process is still running inside of it.  The 
zone then gets into this stuck state where its not up, but it still holding 
resources open.

So I have several questions and concerns here:

1) How is it possible for a process to get into an unkillable state?
2) Is there some kind of scheduler magic that can be done to just dump the 
process however hostile?
3) Is there some way we can protect zones from this sort of issue?
4) Why can't any of Solaris's dozens of observability tools get a glimpse into 
this thing?

Right now there is only one solution to these annoying problems: reboot the box 
and avoid using lighttpd where ever possible.  I'd gladly patch lighttpd to 
keep this issue for happening but without some debugging I can't be sure of 
what code is to blame.

I've found several bugs in the database that look similar to this in one way or 
another but most say to look at the comments, which sadly we can't see.

I'm open to all ideas and theories.

Thanks.

benr.
 
 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Unkillable Processes

2006-12-16 Thread Matt Ingenthron


Hi Ben,

Ben Rockwood wrote:


I'm posting this to OS Discuss because I'm not sure where it would better be 
posted.  If there is a more appropriate list please say so.

I've run into a sad situation several times now, processes that can't be 
killed.  In all cases it happened to be 'lighttpd'.  It sits on CPU consuming 
cycles but I can't see what its doing.  Even Dtrace is helpless.
 

I'm sorry to say I'm familiar with the problem, but happy to say it's 
already been encountered and there is a solution.  I don't know when it 
will make it in to a future build just yet, but it was a problem in 
sockfs with the sendfilev syscall.


The bugid is 6455727.

A workaround for now would be to modify your lighttpd.conf to have 
lighty use the writev syscall instead of sendfilev.  Let me know if you 
need more info.


More below.


[nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target :::return 
{ trace(timestamp); }'
dtrace: failed to grab pid 4308: unanticipated system error
[nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = 
count(); }' -p 4308
dtrace: failed to grab pid 4308: unanticipated system error
[nicole:/] root# ps -ef | grep 4308
   root 12640  2538   0 06:15:27 pts/8   0:00 grep 4308
webservd  4308 1   7   Dec 15 ?  65:00 /opt/patch/sbin/lighttpd -f 
/opt/batchblue/etc/lighttpd/lighttpd.conf
[nicole:/] root# pstack 4308
pstack: cannot examine 4308: unanticipated system error
[nicole:/] root# truss -p 4308
truss: unanticipated system error: 4308

[nicole:/] root# kill -9 4308
[nicole:/] root# ps -ef | grep 4308
webservd  4308 1   8   Dec 15 ?  65:36 /opt/patch/sbin/lighttpd -f 
/opt/batchblue/etc/lighttpd/lighttpd.conf

No amount of tinkering, destruction, or killage can make this process go away.  
I can't attach a debugger and can't force it to core dump.

When this happens in a Zone it makes matters worse.  Attempting to reboot or 
halt the zone won't work because a process is still running inside of it.  The 
zone then gets into this stuck state where its not up, but it still holding 
resources open.

So I have several questions and concerns here:

1) How is it possible for a process to get into an unkillable state?
 


Well, if something isn't handling the signal correctly


2) Is there some kind of scheduler magic that can be done to just dump the 
process however hostile?
 

Not in this case to my knowledge.  Others may be able to jump in on this 
one.  All of the normal utilities (gcore, preap, etc.) assume signals 
are handled correctly.



3) Is there some way we can protect zones from this sort of issue?
 

Yes, by being sure the signal is correctly handled in sockfs.  Like I 
said, we have a fix already, but it isn't putback to the OpenSolaris 
kernel code yet.



4) Why can't any of Solaris's dozens of observability tools get a glimpse into 
this thing?
 

Actually, a number of tools likely can, but they can't rely on the pid 
provider in this case.  You could, for instance, get an idea of what the 
various kernel threads are and what their stacks are as one example.



Right now there is only one solution to these annoying problems: reboot the box 
and avoid using lighttpd where ever possible.  I'd gladly patch lighttpd to 
keep this issue for happening but without some debugging I can't be sure of 
what code is to blame.
 

There is a workaround that would allow you to use sendfilev() and get a 
bump in performance at the same time.  It turns out there is an ON 
private flag to sendfilev that allows the syscall to be non-blocking.  I 
patched my lighttpd and it solved that problem. Do a search for 
SFV_NOWAIT if you want to see the details.


However, you should be warned, it's not a stable interface.  It may 
disappear in a future release, burn your house down and eat your cat-- 
so use it at your own risk.


The best workaround for now is to set up your lighttpd.conf to use the 
writev() backend.  network.backend="writev" or something like that.



I've found several bugs in the database that look similar to this in one way or 
another but most say to look at the comments, which sadly we can't see.

I'm open to all ideas and theories.

Thanks.

benr.


This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org
 



___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Unkillable Processes

2006-12-16 Thread Bryan Cantrill


Hey Ben,

> I'm posting this to OS Discuss because I'm not sure where it would better be 
> posted.  If there is a more appropriate list please say so.
> 
> I've run into a sad situation several times now, processes that can't be 
> killed.  In all cases it happened to be 'lighttpd'.  It sits on CPU consuming 
> cycles but I can't see what its doing.  Even Dtrace is helpless.
> 
> [nicole:/] root# dtrace -F -p 4308 -n 'pid$target:::entry,pid$target 
> :::return { trace(timestamp); }'
> dtrace: failed to grab pid 4308: unanticipated system error
> [nicole:/] root# dtrace -n 'profile-1000hz /pid == $target/ { @[ustack()] = 
> count(); }' -p 4308
> dtrace: failed to grab pid 4308: unanticipated system error
> [nicole:/] root# ps -ef | grep 4308
> root 12640  2538   0 06:15:27 pts/8   0:00 grep 4308
> webservd  4308 1   7   Dec 15 ?  65:00 /opt/patch/sbin/lighttpd 
> -f /opt/batchblue/etc/lighttpd/lighttpd.conf
> [nicole:/] root# pstack 4308
> pstack: cannot examine 4308: unanticipated system error
> [nicole:/] root# truss -p 4308
> truss: unanticipated system error: 4308
> 
> [nicole:/] root# kill -9 4308
> [nicole:/] root# ps -ef | grep 4308
> webservd  4308 1   8   Dec 15 ?  65:36 /opt/patch/sbin/lighttpd 
> -f /opt/batchblue/etc/lighttpd/lighttpd.conf
> 
> No amount of tinkering, destruction, or killage can make this process go 
> away.  I can't attach a debugger and can't force it to core dump.

You are no doubt spinning in the kernel, and this is no doubt a kernel bug.
To figure out what's going on, do this:

  # dtrace -n profile-1234hz'/pid == 4308/[EMAIL PROTECTED]()] = count()}'

That should tell you pretty clearly where you are and what's going on...

- Bryan

--
Bryan Cantrill, Solaris Kernel Development.   http://blogs.sun.com/bmc
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Unkillable Processes

2006-12-16 Thread Dennis Clarke


Hello Ben

I just read all this and its spooky.

That _anything_ can be stuck like this to a point where pulling the plug
is the only option.

I don't think that a kernel module gets loaded by this or needed.

If it helps at all .. and I don't know if it does ... we have a release
from just a few days ago :

Dec 14 12:49  lighttpd-1.4.13,REV=2006.12.14

see : http://www.blastwave.org/packages.php/lighttpd

don't know if that will help *after* you reboot and remove
whatever you have there.

The real problem is that the tools to kill that process are not working here
at all and that is more frightening than anything else.

do you see a pile of threads ?

bash-3.00$ ps -ecflL -o user,pid,ppid,vsz,uid,s,lwp,rss,wchan,args | grep http
root  8225 1 156600 0 S  1 6208  30003e72b06
/opt/csw/apache/bin/httpd
  nobody  8226  8225 168184 60001 S  1 18200  300050c6b54
/opt/csw/apache/bin/httpd
  nobody  8227  8225 169064 60001 S  1 19240  3000735e514
/opt/csw/apache/bin/httpd
  nobody  8234  8225 168112 60001 S  1 18240  3000ac0f314
/opt/csw/apache/bin/httpd
  nobody  8228  8225 168208 60001 S  1 18688  30001c4db54
/opt/csw/apache/bin/httpd
  nobody  8235  8225 168496 60001 S  1 18648  30004b12922
/opt/csw/apache/bin/httpd
  nobody  8233  8225 168072 60001 S  1 18264  300050c7954
/opt/csw/apache/bin/httpd
  nobody  8232  8225 168064 60001 S  1 18136  3000a1eebd4
/opt/csw/apache/bin/httpd
  nobody  8231  8225 168376 60001 S  1 18384  3000304a162
/opt/csw/apache/bin/httpd
  nobody  8230  8225 169032 60001 S  1 19224  300039ca654
/opt/csw/apache/bin/httpd
  nobody  8229  8225 168072 60001 S  1 18192  300050c7454
/opt/csw/apache/bin/httpd

I'm just thinking out load here 

Dennis

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Unkillable Processes

2006-12-16 Thread Bryan Cantrill

On Sat, Dec 16, 2006 at 10:42:40PM -0800, Matt Ingenthron wrote:
> Hi Ben,
> 
> Ben Rockwood wrote:
> 
> >I'm posting this to OS Discuss because I'm not sure where it would better 
> >be posted.  If there is a more appropriate list please say so.
> >
> >I've run into a sad situation several times now, processes that can't be 
> >killed.  In all cases it happened to be 'lighttpd'.  It sits on CPU 
> >consuming cycles but I can't see what its doing.  Even Dtrace is helpless.
> > 
> >
> I'm sorry to say I'm familiar with the problem, but happy to say it's 
> already been encountered and there is a solution.  I don't know when it 
> will make it in to a future build just yet, but it was a problem in 
> sockfs with the sendfilev syscall.
> 
> The bugid is 6455727.

And there you have it.  To verify Matt's hypothesis, run the DTrace
one-liner that I sent you; if it's the bug that Matt suggested (which
certainly matches the symptoms you describe), you'll see stack traces
like this one in the output:

   issig_forreal+0x5c0()
   cv_wait_sig+0x190()  
   snf_segmap+0x450()   
   sosendfile64+0x2a0() 
   sendvec64+0xf0() 
   sendfilev+0x178()
   syscall_trap32+0xcc()   

- Bryan

--
Bryan Cantrill, Solaris Kernel Development.   http://blogs.sun.com/bmc
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Re: [osol-discuss] Unkillable Processes

2006-12-16 Thread Dennis Clarke


> On Sat, Dec 16, 2006 at 10:42:40PM -0800, Matt Ingenthron wrote:
>> Hi Ben,
>>
>> Ben Rockwood wrote:
>>
>> >I'm posting this to OS Discuss because I'm not sure where it would better
>> >be posted.  If there is a more appropriate list please say so.
>> >
>> >I've run into a sad situation several times now, processes that can't be
>> >killed.  In all cases it happened to be 'lighttpd'.  It sits on CPU
>> >consuming cycles but I can't see what its doing.  Even Dtrace is
>> helpless.
>> >
>> >
>> I'm sorry to say I'm familiar with the problem, but happy to say it's
>> already been encountered and there is a solution.  I don't know when it
>> will make it in to a future build just yet, but it was a problem in
>> sockfs with the sendfilev syscall.
>>
>> The bugid is 6455727.
>
> And there you have it.  To verify Matt's hypothesis, run the DTrace
> one-liner that I sent you; if it's the bug that Matt suggested (which
> certainly matches the symptoms you describe), you'll see stack traces
> like this one in the output:
>
>issig_forreal+0x5c0()
>cv_wait_sig+0x190()
>snf_segmap+0x450()
>sosendfile64+0x2a0()
>sendvec64+0xf0()
>sendfilev+0x178()
>syscall_trap32+0xcc()
>

gotta love that ... my gut was telling me that the process had left userland
and gone into places un-killable.

DTrace ... amazing.

Dennis


___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Re: Asterisk success

2006-12-16 Thread Peter Lees

> 
> >is there any chance to actually include those
> drivers in solaris
> >express? the cards supported by it are very common.
> has anyone talked
> >to the developer about that?
> 
> 
> Yes;l I think it's one of the reasons the "rh" driver
> was renamed
> "vfe"; they're being prepared for inclusion in
> Solaris.
> 
> Casper



is there a schedule for this sort of thing somewhere (or a project name for the 
integration)? it would be nice to know when i don't have to make my own 
miniroot...
 
 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] SCCS source code

[osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...

Re: [osol-discuss] Re: Re: Solaris on an Ultra 10

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new" Blade1000 from EBay...

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...

Re: [osol-discuss] Some fun ("Corrected ECC Error") with my "new"Blade1000 from EBay...

Re: [osol-discuss] SCCS source code

[osol-discuss] Re: Re: Re: Solaris on an Ultra 10

[osol-discuss] Cluster size reported "Install Solaris 10 Software" installing DVD 1106

[osol-discuss] SiL3112 not being detected

Re: [osol-discuss] Trouble with "sendmail" and '"Smart" relay host' inB48 ...

[osol-discuss] Unkillable Processes

Re: [osol-discuss] Unkillable Processes

Re: [osol-discuss] Unkillable Processes

Re: [osol-discuss] Unkillable Processes

Re: [osol-discuss] Unkillable Processes

Re: [osol-discuss] Unkillable Processes

[osol-discuss] Re: Asterisk success

18 matches

Site Navigation

Mail list logo

Footer information