Thank you so much, Keith!

I have yet to check everything you advised but anyway now I know much more
interesting things about smartos booting process than I ever knew

Best regards, Valentin Zaretsky



On Sun, Jul 20, 2014 at 2:53 AM, Keith Wesolowski <
[email protected]> wrote:

> On Sat, Jul 19, 2014 at 11:19:03PM +0300, Valentine Zaretsky via
> smartos-discuss wrote:
>
> > SmartOS hang strangely: smartos itself, native VM's and KVM's continued
> > responding to ping on their IP's but nothing else worked.
> >
> > After hardware restart I cannot login to system: after getting root
> > password it waits for something and does not show shell prompt. VM's are
> > not running. But network interface comes up, ssh prints banner
> > "SSH-2.0-Sun_SSH_1.5" and the same way as on console hangs after getting
> > password from user.
> >
> > on client ssh -v stops on the following:
> >
> > debug1: kex: server->client aes128-ctr hmac-md5 none
> > debug1: kex: client->server aes128-ctr hmac-md5 none
> > debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<3072<8192) sent
> > debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
> >
> >
> > When I boot with noimport=true, I'm able to login with default password
> and
> > able to do zpool import zones. and pool seems to be in normal healthy
> status
>
> Most, but not all, instances like this where the system seems ok until
> you try to actually log in or do something with it are actually caused
> by problems in the disk subsystem.  These problems may be transient or
> persistent, and they may be caused by software bugs or by hardware or
> firmware issues; the latter are more common.  When you boot with
> noimport and then import, can you subsequently enable all services and
> then ssh in?  What does fmadm faulty show you?  If nothing, are there
> errors occurring that are precursors to fault diagnosis?  You can find
> that out via fmdump -e.  Anything in the logs (you'll need to import the
> pool first to read them, which is also the case with the FMA data).
>
> Failing all of that, I would recommend booting with -m milestone=none.
> You should be able to log in using the *platform* default root password
> (which is not the same as the one you set at setup time).  From there,
> you should be able to set up DTrace probes to monitor the progress of
> startup, then do 'svcadm milestone all' to start all the services.  DO
> NOT LOG OUT OF THE CONSOLE!  You will need it to monitor and debug the
> problem.  If all services (except of course console-login) seem to come
> up normally, you can then use your favourite tools -- DTrace, truss,
> mdb, etc. -- to debug the sshd server when you try to log in.  You'll
> likely need to iterate a few times to narrow your search for the problem
> as your understanding improves.
>
> This is a naive brute-force approach to debugging that almost always
> yields progress of some kind, even if it's negative progress.  If you
> can't learn anything at all this way, a last-ditch option (which likely
> won't work if the problem is with the disks or HBA) is to generate an
> NMI, which will cause the system to panic and create a crash dump.  If
> you then boot and import the pool, you should be able to run savecore to
> grab the dump, which can then be analysed to better understand why
> things were hanging.  How to generate an NMI is hardware-specific, and
> most desktop or consumer-type systems don't support it.  Among those
> that do, the most common way is to issue the IPMI 'chassis power diag'
> command remotely using ipmitool.  We ship this tool, and it's widely
> available on all POSIX-type OSs.  If your system doesn't have a BMC,
> or that doesn't work, consult your vendor-supplied docs.
>



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to