Hi,
I recently decided to go with OpenSolaris for our backup storage system.
However after a few days (or hours) the system seems to hang. I upgraded to
SNV131 to use ZFS deduplication and RAIDZ2. In the mean time I updated the
software to SNV132. The hangs have been occurring a few days after upgrade in
both versions.
In attachment an open top session when the system hangs screenshot taken
through the IPMI interface. The only process that is actively running is an
rdiff-backup process - a Python-based backup system that makes differential
backups. The scripts get kicked off by a cron job. Sometimes it runs through
and does a backup, other times it just hangs.
There are also SSH and SFTP sessions once in a while where people or scripts
upload/download/delete various stuff. At the moment of the screenshot hang no
SSH sessions were running but they have happened when SSH sessions were open as
well.
The backup destination is a ZFS pool with the following configuration:
pool: rpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors
pool: zpool1
state: ONLINE
scrub: scrub in progress for 0h26m, 0.01% done, 7286h37m to go
config:
NAME STATE READ WRITE CKSUM
zpool1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c7t0d1 ONLINE 0 0 0
c7t0d2 ONLINE 0 0 0
c7t0d3 ONLINE 0 0 0
c7t0d4 ONLINE 0 0 0
c7t0d5 ONLINE 0 0 0
c7t0d6 ONLINE 0 0 0
c7t0d7 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c7t1d1 ONLINE 0 0 0
c7t1d2 ONLINE 0 0 0
c7t1d3 ONLINE 0 0 0
c7t1d4 ONLINE 0 0 0
spares
c10t3d7 AVAIL
errors: No known data errors
zfs get all
NAME PROPERTY VALUE SOURCE
zpool1 type filesystem -
zpool1 creation Wed Jan 27 10:48 2010 -
zpool1 used 7.95T -
zpool1 available 12.5T -
zpool1 referenced 62.1K -
zpool1 compressratio 1.00x -
zpool1 mounted yes -
zpool1 quota none default
zpool1 reservation none default
zpool1 recordsize 128K default
zpool1 mountpoint /zpool1 default
zpool1 sharenfs off default
zpool1 checksum on default
zpool1 compression off default
zpool1 atime on default
zpool1 devices on default
zpool1 exec on default
zpool1 setuid on default
zpool1 readonly off default
zpool1 zoned off default
zpool1 snapdir hidden default
zpool1 aclmode groupmask default
zpool1 aclinherit restricted default
zpool1 canmount on default
zpool1 shareiscsi off default
zpool1 xattr on default
zpool1 copies 1 default
zpool1 version 3 -
zpool1 utf8only off -
zpool1 normalization none -
zpool1 casesensitivity sensitive -
zpool1 vscan off default
zpool1 nbmand off default
zpool1 sharesmb off default
zpool1 refquota none default
zpool1 refreservation none default
zpool1 primarycache all default
zpool1 secondarycache all default
zpool1 usedbysnapshots 0 -
zpool1 usedbydataset 62.1K -
zpool1 usedbychildren 7.95T -
zpool1 usedbyrefreservation 0 -
zpool1 logbias latency default
zpool1 dedup on local
zpool1 mlslabel none default
Another issue is the scrubbing taking forever. I started a scrub last week and
I believe there is a bug around that already with the de-duplication feature.
The system halted before the scrubbing as well so I don't think that is the
issue. I can't stop the scrubbing either - the command just hangs.
The hardware is:
System Configuration: Supermicro X8DT3
BIOS Configuration: American Megatrends Inc. 080015 09/24/2009
BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style)
==== Processor Sockets ====================================
Version Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 2
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 1
==== Memory Device Sockets ================================
Type Status Set Device Locator Bank Locator
----------- ------ --- ------------------- ----------------
other in use 0 P1-DIMM1A BANK0
other empty 0 P1-DIMM1B BANK1
other in use 0 P1-DIMM2A BANK2
other empty 0 P1-DIMM2B BANK3
other in use 0 P1-DIMM3A BANK4
other empty 0 P1-DIMM3B BANK5
other in use 0 P2-DIMM1A BANK6
other empty 0 P2-DIMM1B BANK7
other in use 0 P2-DIMM2A BANK8
other empty 0 P2-DIMM2B BANK9
other in use 0 P2-DIMM3A BANK10
other empty 0 P2-DIMM3B BANK11
==== On-Board Devices =====================================
==== Upgradeable Slots ====================================
ID Status Type Description
--- --------- ---------------- ----------------------------
1 available PCI PCI#1
2 in use PCI Express PCI-E#2
3 available PCI PCI#3
4 in use PCI Express PCI#4
5 available PCI Express PCI-E#5
6 in use PCI Express PCI-E#6
All 12 2TB SATA disks in the zpool1 array are pass-through on a SAS backplane
connected to a Areca 1640 controller, the spare sits on another Areca 1640
controller. The root pool is on the same controller with 2 Seagate 500GB disks
in RAID1 (on the controller).
I don't know what other information you would need to troubleshoot this but I
don't think this is normal behavior, even for a developers release.
--
This message posted from opensolaris.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: solaris-crash.jpg
Type: image/jpeg
Size: 154960 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/opensolaris-help/attachments/20100215/2cdb54d8/attachment-0001.jpg>