Re: [driver-discuss] Problem with Areca 1680 SAS Controller

Arnaud Brand Sun, 10 Jan 2010 10:06:59 -0800

Hi Garret,

Thanks for your help.


What do you mean by "unbundled driver" ? 
It's integrated in Opensolaris. 
The code was initially provided by areca but has been integrated and modified 
since.
Please see 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/scsi/adapters/arcmsr/arcmsr.c

I don't know wether this is the scsi timeout you refer to, but here are the 
messages found in /var/adm/messages when we remove a drive :
Jan  8 14:29:34 nc-tanktsm arcmsr: [ID 474728 kern.notice] arcmsr1: raid volume 
was kicked out
Jan  8 14:29:34 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x1) called for target 0 lun 7
Jan  8 14:29:34 nc-tanktsm arcmsr: [ID 169690 kern.notice] arcmsr1: block 
read/write command while raidvolume missing (cmd 0a for target 0 lun 7)
Jan  8 14:29:34 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x1) called for target 0 lun 7
Jan  8 14:29:35 nc-tanktsm arcmsr: [ID 989415 kern.notice] NOTICE: arcmsr1: 
T0L7 offlined
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 169690 kern.notice] arcmsr1: block 
read/write command while raidvolume missing (cmd 0a for target 0 lun 7)
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x1) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 182018 kern.warning] WARNING: arcmsr1: 
target reset not supported
Jan  8 14:29:36 nc-tanktsm last message repeated 1 time
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x0) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm last message repeated 1 time
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 182018 kern.warning] WARNING: arcmsr1: 
target reset not supported
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x0) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 169690 kern.notice] arcmsr1: block 
read/write command while raidvolume missing (cmd 0a for target 0 lun 7)
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x1) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 182018 kern.warning] WARNING: arcmsr1: 
target reset not supported
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x0) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 474728 kern.notice] arcmsr1: raid volume 
was kicked out

And then it goes on repeating
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 169690 kern.notice] arcmsr1: block 
read/write command while raidvolume missing (cmd 0a for target 0 lun 7)
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x1) called for target 0 lun 7
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 182018 kern.warning] WARNING: arcmsr1: 
target reset not supported
Jan  8 14:29:36 nc-tanktsm arcmsr: [ID 832694 kern.warning] WARNING: arcmsr1: 
tran reset (level 0x0) called for target 0 lun 7

Sometimes I get 
Jan  8 14:29:46 nc-tanktsm fmd: [ID 377184 daemon.error] 
syslog-msgs-message-template

And this one too
Jan  8 14:29:55 nc-tanktsm scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,3...@7/pci17d3,1...@0/d...@0,7 (sd8):
Jan  8 14:29:55 nc-tanktsm      SYNCHRONIZE CACHE command failed (5)

Or this one
Jan  7 14:25:54 nc-tanktsm scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,3...@7/pci17d3,1...@0/d...@0,7 (sd8):
Jan  7 14:25:54 nc-tanktsm      i/o to invalid geometry

And then it continues to repeat the four lines shown above (about the same 
number of times that they appear before the scsi message, I can't really tell, 
there are really a lot of them).


As a side note, since I've read about it in some other forums, I can't see the 
disk drives in cfgadm, I can only see the controllers.
Does this mean the driver's not well integrated with the system and/or that it 
does support hotplug ?

Thanks again,
Arnaud

-----Message d'origine-----
De : [email protected] [mailto:[email protected]] De la part de 
Garrett D'Amore
Envoyé : dimanche 10 janvier 2010 18:20
À : Arnaud Brand
Cc : [email protected]
Objet : Re: [driver-discuss] Problem with Areca 1680 SAS Controller

It does sound like a driver bug to me.  Removing the disks should cause 
[Arnaud Brand] 
SCSI timeouts, which ZFS ought to be able to cope with.   The problem is 
that if the timeouts don't happen and the jobs are left permanently 
outstanding, then you get into a hang situation.

I'm not familiar with the Areca driver -- is this an unbundled driver?

    - Garrett

Arnaud Brand wrote:
> Hi,
>
> Since no one seems to have any clues on the ZFS list and that my problem 
> seems to be driver related, I'm reposting to driver-discuss in the hope that 
> someone can help.
>
> Thanks and have a nice day,
> Arnaud
>
>
> De : Arnaud Brand 
> Envoyé : vendredi 8 janvier 2010 16:55
> À : [email protected]
> Cc : Arnaud Brand
> Objet : ZFS partially hangs when removing an rpool mirrored disk while having 
> some IO on another pool on another partition of the same disk
>
> Hello,
>
> Sorry for the (very) long subject but I've pinpointed the problem to this 
> exact situation.
> I know about the other threads related to hangs, but in my case there was no 
> « zfs destroy » involved, nor any compression or deduplication.
>
> To make a long story short, when 
> - a disk contains 2 partitions  (p1=32GB, p2=1800 GB) and 
> - p1 is used as part of a zfs mirror of rpool and 
> - p2 is used as part of a raidz (tested raidz1 and raidz2) of tank and 
> - some serious work  is underway on tank (tested write, copy, scrub),
> If you physically remove the disk, zfs partially hangs. Putting back the 
> physical disk does not help.
>
> For the long story :
>
> About the hardware :
> 1 x intel X25E (64GB SSD), 15x2TB SATA drives (7xWD, 8xHitachi), 2xQuadCore 
> Xeon, 12GB RAM, 2xAreca-1680 (8-ports SAS controller), tyan S7002 mainboard.
>
> About the software / firmware :
> Opensolaris b130 installed on the SSD drive, on the first 32 GB.
> The areca cards are configured as a JBOD and are running the latest release 
> firmware.
>
> Initial setup :
> We created a 32GB partition on all of the 2TB drives and mirrored the system 
> partition, giving us a 16-way rpool mirror.
> The rest of the 2TB drives's space was put in a second partition and used for 
> a raidz2 pool (named tank)
>
> Problem :
> Whenever we physically removed a disk from its tray while doing some speed 
> testing on the tank pool, the system hung.
> At that time I hadn't read all the thread about zfs hangs and couldn't 
> determine wether the system was hung or just zfs.
> In order to pinpoint the problem, we made another setup.
>
> Second setup :
> I  reduced the number of partitions in the rpool mirror down to 3 (p1 from 
> the SSD, p1 from a 2TB drive on the same controller as the SSD and p1 from a 
> 2TB drive on the other controller).
>
> Problem :
> When the system is quiet, I am able to physically remove any disk, plug it 
> back and resilver it.
> When I am putting some load on the tank pool, I can remove any disk that does 
> *not* contain the rpool mirror (I can plug it back and resilver it while the 
> load keeps running without noticeable performance impact).
>
> When I am putting some load on the tank pool, I cannot physically remove a 
> disk that also contains a mirror of the rpool or zfs partially hangs.
> When I say partially, I mean that :
> - zpool iostat -v tank 5 freezes 
> - if I run any zpool command related to rpool, I'm stuck (zpool clear rpool 
> c4t0d7s0 for example or zpool status rpool)
>
> I can't launch new programms, but already launched programs continue to run 
> (at least in an ssh session, since gnome becomes more and more frozen as you 
> move from window to window).
>
> From ssh sessions :
>
> - prstat shows that only gnome-system-monitor, xorg, ssh, bash and various 
> *stat utils (prstat, fstat, iostat, mpstat) are consumming some CPU.
>
> - zpool iostat -v tank 5 is frozen (It freezes when I issue a zpool clear 
> rpool c4t0d7s0 in another session)
>
> - iostat -xn is not stuck but shows all zeroes since the very moment zpool 
> iostat froze (which is quite strange if you look at fsstat ouput hereafter). 
> NB: when I say all zeroes, I really mea nit, it's not zero dot domething, its 
> zero dot zero.
>
> - mpstat shows normal activity (almost nothing since this is a test machine, 
> so only a few percent are used, but it still shows some activity and 
> refreshes correctly)
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   0    0   0  125   428  109  113    1    4    0    0   251    2   0   0  98
>   1    0   0   20    56   16   44    2    2    1    0   277   11   1   0  88
>   2    0   0  163   152   13  309    1    3    0    0  1370    4   0   0  96
>   3    0   0   19   111   41   90    0    4    0    0    80    0   0   0 100
>   4    0   0   69   192   17   66    0    3    0    0    20    0   0   0 100
>   5    0   0   10    61    7   92    0    4    0    0   167    0   0   0 100
>   6    0   0   96   191   25   74    0    4    1    0     5    0   0   0 100
>   7    0   0   16    58    6   63    0    3    1    0    59    0   0   0 100
>
> - fsstat -F 5 shows all zeroes but for the zfs line (the figures hereunder 
> stay almost the same over time)
> new  name   name  attr  attr lookup rddir  read read  write write
> file remov  chng   get   set    ops   ops   ops bytes   ops bytes
> 0     0     0 1,25K     0  2,51K     0   803 11,0M   473 11,0M zfs
>
> - disk leds show no activity
>
> - I cannot run any other command (neither from ssh, nor from gnome)
>
> - I cannot open another ssh session (I don't even get the login prompt in 
> putty)
>
> - I can successfully ping the machine
>
> - I cannot establish a new cifs session (the login prompt should not appear 
> since the machine is in an active directory domain, but when it's stuck the 
> prompt appear and I cannot authenticate. I guess it's related to ldap or 
> kerberos or whatever cannot be read on rpool), but an already active session 
> will stay open (last time I even managed to create a text file with a few 
> lines in it that was still there after I hard-rebooted).
>
> - after some time (an hour or so) my ssh sessions eventually stuck too.
>
> Having worked for quite some time with Opensolaris/ZFS (though not with so 
> much disks), I doubted the problem came from opensolaris and I already opened 
> a case with areca's tech support which is trying (at least they told me so) 
> to reproduce the problem. That's until I've read on zfs hangs issues.
>
> We've found a workaround : we're going to put one internal 2.5'' disk to 
> mirror rpool and dedicate the whole of the 2TB disks for tank, but :
> - I thought it might somehow be related to the other hang issues (and so it 
> might help developpers to hear about other similar issues)
> - I would really like to rule out an opensolaris bug so that I can bring 
> proofs to areca that their driver has a problem and either request them to 
> correct it or request my supplier to replace the cards with working ones.
>
> I think their driver is at fault because I found a message in 
> /var/adm/message saying "WARNING: arcmsr duplicate scsi_hba_pkt_comp(9F) on 
> same scsi_pkt(9S)" and immediately after that " WARNING: kstat rcnt == 0 when 
> exiting runq, please check". Then the system was hung. The comments in the 
> code introducing these changes tell us that drivers behaving this way could 
> panic the system (or at least that how I understood these comments).
>
> I've reached my limits here. 
> If anyone has ideas on how to determine what's going on, on other information 
> I should publish, on other things to run or to check, I'll be happy to test 
> them.
>
> If I can be of any help to help troubleshoot the zfs hang problems that other 
> are experiencing, I'd be happy to give a hand. 
>
> Thanks and have a nice day,
>
> Regards,
> Arnaud Brand
>
>
>
>
>
>
> _______________________________________________
> driver-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>   

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Problem with Areca 1680 SAS Controller

Reply via email to