[zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally we
see that one of the zones doesn't come up properly. You can log into the
zone but none of the /etc/rc3.d scripts have been run.

/var/adm/messages is completely empty and when running who -r to see the
run level it doesn't report anything.

# who -r
run-level Dec 1 09:17 last=

Anyone else seen anything similar? We are running Solaris 10 update 9.

Regards,
Derek
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

for 30-40 zone
what are the main host ram? and what kind of CPU? and how many CPU?
was everything on ZFS? what are the storage/HDD for zone root?
regards


On 12/1/2011 11:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally 
we see that one of the zones doesn't come up properly. You can log 
into the zone but none of the /etc/rc3.d scripts have been run.


/var/adm/messages is completely empty and when running who -r to see 
the run level it doesn't report anything.


# who -r
run-level Dec 1 09:17 last=

Anyone else seen anything similar? We are running Solaris 10 update 9.

Regards,
Derek


___
zones-discuss mailing list
zones-discuss@opensolaris.org


--
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

<>___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
System has 72GB RAM
xeon cpu - 2 socket - 4 core - 16 thread

zonereoot is on ufs filesystem on it's own drive, separate from OS.

Derek

On Thu, Dec 1, 2011 at 11:01 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." <
laot...@gmail.com> wrote:

>  for 30-40 zone
> what are the main host ram? and what kind of CPU? and how many CPU?
> was everything on ZFS? what are the storage/HDD for zone root?
> regards
>
>
>
> On 12/1/2011 11:39 AM, Derek McEachern wrote:
>
> Have a peculiar problem that I haven't seen before.
>
>  When starting a system that has about 35 - 40 zones on it occasionally
> we see that one of the zones doesn't come up properly. You can log into the
> zone but none of the /etc/rc3.d scripts have been run.
>
>  /var/adm/messages is completely empty and when running who -r to see the
> run level it doesn't report anything.
>
>  # who -r
> run-level Dec 1 09:17 last=
>
>  Anyone else seen anything similar? We are running Solaris 10 update 9.
>
>  Regards,
> Derek
>
>
> ___
> zones-discuss mailing listzones-disc...@opensolaris.org
>
>
> --
> Hung-Sheng Tsao Ph D.
> Founder & Principal
> HopBit GridComputing LLC
> cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/
>
>
> ___
> zones-discuss mailing list
> zones-discuss@opensolaris.org
>
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Mike Gerdts
On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
> Have a peculiar problem that I haven't seen before.
> 
> When starting a system that has about 35 - 40 zones on it occasionally we
> see that one of the zones doesn't come up properly. You can log into the
> zone but none of the /etc/rc3.d scripts have been run.
> 
> /var/adm/messages is completely empty and when running who -r to see the
> run level it doesn't report anything.

Take a look at the output of svcs -x.  Most likely you have a service
that svc:/milestone/multi-user-server:default depends on (directly or
indirectly) that has timed out and as such is in maintenance.  Because
the dependency is not satisfied, this milestone doesn't come up so the
rc3 scripts are not run.

My guess is the timeout is because so many zones are starting at once
that the disks are being thrashed.  The resulting I/O backlog slows down
the startup of services, which leads to timeouts, which lead to some
services failing to even try to start.

A google search and a 5 second read suggests that this link may be of
help to adjust the timeout of services that require a longer timeout:

http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

-- 
Mike Gerdts
Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/
___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Thanks Mike.

The more I look at this more I think it is load related. svcs -x only shows
that the LP print server is not running which I don't think has any impact
on what I'm seeing.

As for who not reporting what I would expect I tracked that down to someone
installing the gnu tools in /usr/local/bin and then setting default path to
reference those before /bin/ :-(

/bin/who -r shows the zone is at run level 3.

Looking at /var/svc/log/milestone-multi-user-server:default.log I can see
that some of the other services have most likely not completed before it
tries to run the rc scripts. It appears that the /usr filesystem hasn't yet
been mounted read/write and the appstart script is logging an error that
indicates rpc services are not completely running.

Executing legacy init script "/etc/rc3.d/S98apache".
(30)Read-only file system: httpd: could not open error log file
/usr/local/apache2/logs/error_log.
Unable to open logs
Legacy init script "/etc/rc3.d/S98apache" exited with return code 0.
Executing legacy init script "/etc/rc3.d/S99appstart".
ERROR: Unable to contact any server
Legacy init script "/etc/rc3.d/S99appstart" exited with return code 0.
[ Dec 1 09:17:13 Method "start" exited with status 0 ]

We have a process in place that only starts 3 zones at one time so we are
not doing all 40 at once but it could be that with this hardware even
trying 3 at a time is too much and we may need to drop to 2.

Derek

On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts  wrote:

> On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
> > Have a peculiar problem that I haven't seen before.
> >
> > When starting a system that has about 35 - 40 zones on it occasionally we
> > see that one of the zones doesn't come up properly. You can log into the
> > zone but none of the /etc/rc3.d scripts have been run.
> >
> > /var/adm/messages is completely empty and when running who -r to see the
> > run level it doesn't report anything.
>
> Take a look at the output of svcs -x.  Most likely you have a service
> that svc:/milestone/multi-user-server:default depends on (directly or
> indirectly) that has timed out and as such is in maintenance.  Because
> the dependency is not satisfied, this milestone doesn't come up so the
> rc3 scripts are not run.
>
> My guess is the timeout is because so many zones are starting at once
> that the disks are being thrashed.  The resulting I/O backlog slows down
> the startup of services, which leads to timeouts, which lead to some
> services failing to even try to start.
>
> A google search and a 5 second read suggests that this link may be of
> help to adjust the timeout of services that require a longer timeout:
>
> http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/
>
> --
> Mike Gerdts
> Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/
>
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 06:07 AM, Derek McEachern wrote:

System has 72GB RAM
xeon cpu - 2 socket - 4 core - 16 thread

zonereoot is on ufs filesystem on it's own drive, separate from OS.



That (UFS) is a strange choice for a recent Solaris 10 version.  You 
loose the useful zones/ZFS features such as cloning.


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

it seems that you could
1)improve your rc script to check the other dependence for apache
or
2)use SMF for apache that check other dependence
my 2c


On 12/1/2011 1:33 PM, Derek McEachern wrote:

Thanks Mike.

The more I look at this more I think it is load related. svcs -x only 
shows that the LP print server is not running which I don't think has 
any impact on what I'm seeing.


As for who not reporting what I would expect I tracked that down to 
someone installing the gnu tools in /usr/local/bin and then setting 
default path to reference those before /bin/ :-(


/bin/who -r shows the zone is at run level 3.

Looking at /var/svc/log/milestone-multi-user-server:default.log I can 
see that some of the other services have most likely not completed 
before it tries to run the rc scripts. It appears that the /usr 
filesystem hasn't yet been mounted read/write and the appstart script 
is logging an error that indicates rpc services are not completely 
running.


Executing legacy init script "/etc/rc3.d/S98apache".
(30)Read-only file system: httpd: could not open error log file 
/usr/local/apache2/logs/error_log.

Unable to open logs
Legacy init script "/etc/rc3.d/S98apache" exited with return code 0.
Executing legacy init script "/etc/rc3.d/S99appstart".
ERROR: Unable to contact any server
Legacy init script "/etc/rc3.d/S99appstart" exited with return code 0.
[ Dec 1 09:17:13 Method "start" exited with status 0 ]

We have a process in place that only starts 3 zones at one time so we 
are not doing all 40 at once but it could be that with this hardware 
even trying 3 at a time is too much and we may need to drop to 2.


Derek

On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts > wrote:


On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
> Have a peculiar problem that I haven't seen before.
>
> When starting a system that has about 35 - 40 zones on it
occasionally we
> see that one of the zones doesn't come up properly. You can log
into the
> zone but none of the /etc/rc3.d scripts have been run.
>
> /var/adm/messages is completely empty and when running who -r to
see the
> run level it doesn't report anything.

Take a look at the output of svcs -x.  Most likely you have a service
that svc:/milestone/multi-user-server:default depends on (directly or
indirectly) that has timed out and as such is in maintenance.  Because
the dependency is not satisfied, this milestone doesn't come up so the
rc3 scripts are not run.

My guess is the timeout is because so many zones are starting at once
that the disks are being thrashed.  The resulting I/O backlog
slows down
the startup of services, which leads to timeouts, which lead to some
services failing to even try to start.

A google search and a 5 second read suggests that this link may be of
help to adjust the timeout of services that require a longer timeout:

http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/

--
Mike Gerdts
Solaris Core OS / Zones http://blogs.oracle.com/zoneszone/




___
zones-discuss mailing list
zones-discuss@opensolaris.org


--
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

<>___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 05:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it occasionally 
we see that one of the zones doesn't come up properly. You can log 
into the zone but none of the /etc/rc3.d scripts have been run.



The same zone, or a random one?

What happens if you halt one or more zones before rebooting?  Is there a 
threshold where the problem begins to occur?


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
Random zone.

We've been testing to see if there is a threshold of trying to start too
many in parallel but so far we don't see anything.

We saw the problem trying to start 3 zones in parallel but it was very
intermittent. Like 1 out of every 4 tries at started all 40 zones we would
see 1 failure. We ran some tests starting 10 zones in parallel and so far
no errors. Our assumption was that if it was load related moving from 3 to
10 zones we would see problems.

Derek

On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins  wrote:

> On 12/ 2/11 05:39 AM, Derek McEachern wrote:
>
>> Have a peculiar problem that I haven't seen before.
>>
>> When starting a system that has about 35 - 40 zones on it occasionally we
>> see that one of the zones doesn't come up properly. You can log into the
>> zone but none of the /etc/rc3.d scripts have been run.
>>
>>  The same zone, or a random one?
>
> What happens if you halt one or more zones before rebooting?  Is there a
> threshold where the problem begins to occur?
>
> --
> Ian.
>
>
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
I agree, our script could certainly be improved to add logic to check for
these failures and handle them which we will probably end up doing.

Derek

On Thu, Dec 1, 2011 at 2:47 PM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." <
laot...@gmail.com> wrote:

>  it seems that you could
> 1)improve your rc script to check the other dependence for apache
> or
> 2)use SMF for apache that check other dependence
> my 2c
>
>
>
> On 12/1/2011 1:33 PM, Derek McEachern wrote:
>
> Thanks Mike.
>
>  The more I look at this more I think it is load related. svcs -x only
> shows that the LP print server is not running which I don't think has any
> impact on what I'm seeing.
>
>  As for who not reporting what I would expect I tracked that down to
> someone installing the gnu tools in /usr/local/bin and then setting default
> path to reference those before /bin/ :-(
>
>  /bin/who -r shows the zone is at run level 3.
>
>  Looking at /var/svc/log/milestone-multi-user-server:default.log I can
> see that some of the other services have most likely not completed before
> it tries to run the rc scripts. It appears that the /usr filesystem hasn't
> yet been mounted read/write and the appstart script is logging an error
> that indicates rpc services are not completely running.
>
>  Executing legacy init script "/etc/rc3.d/S98apache".
> (30)Read-only file system: httpd: could not open error log file
> /usr/local/apache2/logs/error_log.
> Unable to open logs
> Legacy init script "/etc/rc3.d/S98apache" exited with return code 0.
> Executing legacy init script "/etc/rc3.d/S99appstart".
> ERROR: Unable to contact any server
> Legacy init script "/etc/rc3.d/S99appstart" exited with return code 0.
> [ Dec 1 09:17:13 Method "start" exited with status 0 ]
>
>  We have a process in place that only starts 3 zones at one time so we
> are not doing all 40 at once but it could be that with this hardware even
> trying 3 at a time is too much and we may need to drop to 2.
>
>  Derek
>
>  On Thu, Dec 1, 2011 at 12:07 PM, Mike Gerdts wrote:
>
>> On Thu 01 Dec 2011 at 10:39AM, Derek McEachern wrote:
>> > Have a peculiar problem that I haven't seen before.
>> >
>> > When starting a system that has about 35 - 40 zones on it occasionally
>> we
>> > see that one of the zones doesn't come up properly. You can log into the
>> > zone but none of the /etc/rc3.d scripts have been run.
>> >
>> > /var/adm/messages is completely empty and when running who -r to see the
>> > run level it doesn't report anything.
>>
>>  Take a look at the output of svcs -x.  Most likely you have a service
>> that svc:/milestone/multi-user-server:default depends on (directly or
>> indirectly) that has timed out and as such is in maintenance.  Because
>> the dependency is not satisfied, this milestone doesn't come up so the
>> rc3 scripts are not run.
>>
>> My guess is the timeout is because so many zones are starting at once
>> that the disks are being thrashed.  The resulting I/O backlog slows down
>> the startup of services, which leads to timeouts, which lead to some
>> services failing to even try to start.
>>
>> A google search and a 5 second read suggests that this link may be of
>> help to adjust the timeout of services that require a longer timeout:
>>
>> http://www.runningunix.com/2009/01/changing-timeouts-on-smf-services/
>>
>> --
>> Mike Gerdts
>> Solaris Core OS / Zones
>> http://blogs.oracle.com/zoneszone/
>>
>
>
>
> ___
> zones-discuss mailing listzones-disc...@opensolaris.org
>
>
> --
> Hung-Sheng Tsao Ph D.
> Founder & Principal
> HopBit GridComputing LLC
> cell: 9734950840http://laotsao.wordpress.com/http://laotsao.blogspot.com/
>
>
> ___
> zones-discuss mailing list
> zones-discuss@opensolaris.org
>
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Derek McEachern
We haven't made the jump to zfs yet :-) We do loose some useful features
but haven't spent the time to port our stuff over to use zfs.

On Thu, Dec 1, 2011 at 2:47 PM, Ian Collins  wrote:

> On 12/ 2/11 06:07 AM, Derek McEachern wrote:
>
>> System has 72GB RAM
>> xeon cpu - 2 socket - 4 core - 16 thread
>>
>> zonereoot is on ufs filesystem on it's own drive, separate from OS.
>>
>>
> That (UFS) is a strange choice for a recent Solaris 10 version.  You loose
> the useful zones/ZFS features such as cloning.
>
> --
> Ian.
>
>
___
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 10:30 AM, Derek McEachern wrote:
On Thu, Dec 1, 2011 at 2:48 PM, Ian Collins > wrote:


On 12/ 2/11 05:39 AM, Derek McEachern wrote:

Have a peculiar problem that I haven't seen before.

When starting a system that has about 35 - 40 zones on it
occasionally we see that one of the zones doesn't come up
properly. You can log into the zone but none of the /etc/rc3.d
scripts have been run.

The same zone, or a random one?

What happens if you halt one or more zones before rebooting?  Is
there a threshold where the problem begins to occur?

Random zone.

We've been testing to see if there is a threshold of trying to start 
too many in parallel but so far we don't see anything.


We saw the problem trying to start 3 zones in parallel but it was very 
intermittent. Like 1 out of every 4 tries at started all 40 zones we 
would see 1 failure. We ran some tests starting 10 zones in parallel 
and so far no errors. Our assumption was that if it was load related 
moving from 3 to 10 zones we would see problems.


I have several systems that start 10 or more zones and I've never seen 
any problems.


I agree with the comment elsewhere that you should be using SMF rather 
than rc scripts to start services.


It is also possible to create SMF services with the appropriate 
dependencies to start your zones in the correct order.


--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org


Re: [zones-discuss] Zone Not Starting Properly?

2011-12-01 Thread Ian Collins

On 12/ 2/11 10:36 AM, Derek McEachern wrote:
We haven't made the jump to zfs yet :-) We do loose some useful 
features but haven't spent the time to port our stuff over to use zfs.




Make the jump sooner rather than later or you will flounder on Solaris 11.

--
Ian.

___
zones-discuss mailing list
zones-discuss@opensolaris.org