Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
severity 629442 important thanks On 06/14/2011 07:21 PM, Ritesh Raj Sarraf wrote: Yes, that's what my suspicion is also. The per bdi writeback mechanism improves this situation to a great extent but I'm not sure if that is part of the Squeeze kernel. http://lwn.net/Articles/326552/ Any update on this issue ? I am lowering its severity to Important as it is a corner case. -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System signature.asc Description: OpenPGP digital signature
Processed: Re: Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
Processing commands for cont...@bugs.debian.org: severity 629442 important Bug #629442 [iscsitarget] iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped Severity set to 'important' from 'grave' thanks Stopping processing here. Please contact me if you need assistance. -- 629442: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
On 06/14/2011 05:12 PM, Massimiliano Ferrero wrote: Next time, when you try to test/re-create the bug, capture dstat output. The default dstat output is good enough to report us on the system state was during starvation. Hello, yesterday and tonight I performed some other tests, these are the results: 1) it seems I am not able to reproduce the bug on a test system the test system (san01) has the same processor (E5220) and amount of RAM 12 GB, but a smaller I/O system: an 8 channel 3ware controller with an 8 disks raid 5 array the system that presents the problem (san00) has a 24 channel controller and a 23 disks raid 6 array (+ 1 hot spare) both systems are connected through the same gigabit switches Is the OS/kernel also the same? there is another hw difference between the two environment: the nodes connected to san00 are high end hw, their network card is able to generate nearly 1 Gb/s of iscsi traffic the nodes connected to san01 are low end hw and their network card does not exceed 300 Mb/s so the system that presents the problem has both an I/O subsystem with higher performance and the machine that is doing iscsi traffic is able to generate more than 3 times i/o operations Unfortunately, dstat default output doesn't capture VM statistics. Do you have any idea what was the VM consumption when you saw the problem? Assuming it is buffered I/O, Your VM will soon be consumed (as I see from dstat logs that the CPU was in wait for a very long time) and then start paging. And this is the scenario, where linux tends to succumb, still. On your faulty setup, can you try unbuffered direct I/O and see if that can trigger the problem? I have low hopes that that will fail. To trigger (direct) I/O, you can use fio tool. It is already pre-packaged in Debian. at the moment I am not able to tell which of these aspects, or the sum of them, create the condition for the problem: I suspect that it's a mix of all these unfortunately at the moment I do not have hw similar to the one in production to perform a test in the same conditions. 2) san00 presents the problem event with deadline scheduler active on all logical volume exported through iscsi or used by the heavy load operation (dd) 3) on san00 I was able to reproduce the problem in a simpler condition than the one I described in the first mail: just one node connected through iscsi, the other node was restarting, no virtual machines running on the node, the node was performing one i/o intensive operation on one of the lv exported by iscsi/lvm (an fsck on one file system) during this operation I launched a dd on san00 and the iscsi connection was dropped after a few seconds I think it is the typical linux I/O controller problem, which I believe is a combination of I/O Scheduler + VM Subsystem. I am attaching 3 files: dstat output during the test and an extract of /var/log/messages and /var/log/syslog I have just filtered out information for non relevant services (nagios, dhcp, snmp, postfix, etc.) both for readability and confidentiality ietd was running with the following command line /usr/sbin/ietd --debug=255 so in the log we have debug information the problem can be seen in syslog at Jun 14 01:28:53 at Jun 14 01:34:06 I turned off the node for reboot and in the log there are some record regarding termination of iscsi sessions I do not see anything relevant in ietd debug log, just a restart of the connections in dstat output the dd operation was started around line 197 and was terminated at line 208 (I interrupted the operation as soon as I saw the problem) what I see in dstat output is the following: dd for some seconds (about 10) does not generate a lot of read and writes usr sys idl wai hiq siq| read writ| recv send| in out | int csw 7 2 56 35 0 0| 12M 14M| 22k 35k| 0 0 |441512k 12M read and 14M write, and this could be from the dd operation or the fsck performed through iscsi then there is a burst of write, I guess using the full I/O capacity of the controller and of the disks usr sys idl wai hiq siq| read writ| recv send| in out | int csw 35 7 35 22 0 1|8180k 325M| 38k 25k| 0 0 |686011k 2 3 59 36 0 1|3072B 541M| 20k 26k| 0 0 |5380 2747 3 4 64 30 0 0|5120B 473M| 21k 30k| 0 0 |475216k write 325M, 541M, 473M and this is exactly the moment when the problem arise could it be that the i/o operation are cached in memory and the problem presents when they are flushed to disk? Yes, that's what my suspicion is also. The per bdi writeback mechanism improves this situation to a great extent but I'm not sure if that is part of the Squeeze kernel. http://lwn.net/Articles/326552/ If from the logs does not come out any pointer to a potential solution the only other test I can think of is upgrading to a newer kernel, but I You can try this just to ascertain the
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
What is the version of the kernel, where you are seeing this problem? In my experience, older kernels, weren't good at handling huge workloads. In case this is a starvation issue (which I suspect), there should be some kernel messages spat. Can you check that? Also, if possible, can you reproduce the issue with ietd's debug mode enabled? Ritesh On 06/07/2011 12:04 AM, Massimiliano Ferrero wrote: Package: iscsitarget Version: 1.4.20.2-1 Severity: grave Justification: renders package unusable We have a system configured as follows: 1 CPU Xeon E5520 quad core 12 GB RAM 24 SATA 1TB disks connected to a 3ware 9650-24M controller The system (I will call it the storage server) is used, also, as an iscsi target for a two node red hat cluster (implemented with Debian machines): the storage server exports an iscsi lun that contains a clustered LVM volum group. The two nodes mounts the iscsi target, then use clvmd of red hat cluster to manage concurrent access to the logical volumes on the clustered volume group. The nodes are used for virtualization with xen with the virtual machines stored in the shared volume group, this allows for live migration of the virtual machines from one node to the other, each node has 32 cores and 32 GB ram. All the three machines (the storage server and the two nodes) are Debian based: the iscsi target server was upgraded from lenny to squeeze one week ago, the two nodes are still on Debian lenny. Tonight and today we experienced two cases of ietd hang-up: while we were doing two I/O intensive operations on the storage server we got the following errors: FIRST OCCURRENCE: Jun 5 23:19:19 san00 kernel: [334912.274409] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:20 san00 kernel: [334912.634797] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.660528] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.660811] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661105] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661357] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661570] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:58 san00 kernel: [334950.930122] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:844424984461824 (Function Complete) Jun 5 23:19:59 san00 kernel: [334951.602686] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:1688849914593792 (Function Complete) ... Jun 5 23:20:31 san00 kernel: [334983.635928] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:20:41 san00 kernel: [334993.616383] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) ... Jun 5 23:20:51 san00 kernel: [335003.596884] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:01 san00 kernel: [335013.577396] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:01 san00 kernel: [335013.577645] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:11 san00 kernel: [335023.557901] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:13 san00 kernel: [335025.386242] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:2533274844725760 (Function Complete) ... Jun 5 23:24:21 san00 kernel: [335213.072102] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:2814749821436416 (Function Complete) Jun 5 23:24:21 san00 kernel: [335213.075205] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:3659174751568384 (Function Complete) Jun 5 23:24:32 san00 kernel: [335224.457895] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:42 san00 kernel: [335234.437754] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:42 san00 kernel: [335234.437873] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:52 san00 kernel: [335244.418062] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:52 san00 kernel: [335244.418174] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:54
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
Il 11/06/2011 09:32, Ritesh Raj Sarraf ha scritto: What is the version of the kernel, where you are seeing this problem? In my experience, older kernels, weren't good at handling huge workloads. In case this is a starvation issue (which I suspect), there should be some kernel messages spat. Can you check that? Also, if possible, can you reproduce the issue with ietd's debug mode enabled? Sorry, tought I had indicated the kernel version: it is 2.6.32-5-amd64 Linux san00 2.6.32-5-amd64 #1 SMP Wed May 18 23:13:22 UTC 2011 x86_64 GNU/Linux dpkg -l 'linux-image*'|grep ii ii linux-image-2.6.32-5-amd64 2.6.32-34squeeze1 Linux 2.6.32 for 64-bit PCs Debian package version 2.6.32-34squeeze1 That should be the last kernel for a squeeze with security updates There are no other kernel messages in the problem timeframe, I have already cutpaste all the relevant messages I can try to reproduce the problem on a test machine and I will try that with debug enabled also, I will do more tests on the production system only if I am not able to reproduce the problem on the test system. Another test I want to perform is enable deadline scheduler for the logical volumes exported through iscsi, do you think this could influence the problem? Thank you for the support Massimiliano -- Massimiliano Ferrero Midhgard s.r.l. C/so Svizzera 185 bis c/o centro Piero della Francesca 10149 - Torino tel. +39-0117575375 fax +39-0117768576 e-mail: m.ferr...@midhgard.it sito web: http://www.midhgard.it -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
On 06/11/2011 01:37 PM, Massimiliano Ferrero wrote: That should be the last kernel for a squeeze with security updates There are no other kernel messages in the problem timeframe, I have already cutpaste all the relevant messages I can try to reproduce the problem on a test machine and I will try that with debug enabled also, I will do more tests on the production system only if I am not able to reproduce the problem on the test system. Another test I want to perform is enable deadline scheduler for the logical volumes exported through iscsi, do you think this could influence the problem? It should be worth a try. Try deadline and see if that helps. Lately, CFQ had had a lot of improvements. But those might have not made into the squeeze kernel. BTW, does this issue happen only on heavy load? Or is it triggered also on moderate load? I'm sure the moment when this issue occurs, the machine is completely unresponsive. -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System signature.asc Description: OpenPGP digital signature
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
Il 11/06/2011 10:41, Ritesh Raj Sarraf ha scritto: It should be worth a try. Try deadline and see if that helps. Lately, CFQ had had a lot of improvements. But those might have not made into the squeeze kernel. BTW, does this issue happen only on heavy load? Or is it triggered also on moderate load? I'm sure the moment when this issue occurs, the machine is completely unresponsive. No, it happens only on heavy load, the storage server performs also several other functions other than iscsi target, one of these is monitoring all the virtual machines with nagios and munin: munin updates run every 5 minutes with a lot of data written to disk, this has never been a problem. The only situation that triggers the problem is while performing an operation like dd or a file system check that is really i/o intensive on the storage server. The second time I experienced the problem I was connected with ssh to the system during the dd watching the log files: i saw real time the abort messages and I was able to stop the dd and react to the failure (restart cluster nodes, check virtual machines, etc.), this just to say that it was not necessary to restart the iscsi target and that the system was not unresponsive. I will update the bug as soon as I can test it on another machine. Gest regards Massimiliano -- Massimiliano Ferrero Midhgard s.r.l. C/so Svizzera 185 bis c/o centro Piero della Francesca 10149 - Torino tel. +39-0117575375 fax +39-0117768576 e-mail: m.ferr...@midhgard.it sito web: http://www.midhgard.it -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped
On 06/11/2011 05:45 PM, Massimiliano Ferrero wrote: No, it happens only on heavy load, the storage server performs also several other functions other than iscsi target, one of these is monitoring all the virtual machines with nagios and munin: munin updates run every 5 minutes with a lot of data written to disk, this has never been a problem. The only situation that triggers the problem is while performing an operation like dd or a file system check that is really i/o intensive on the storage server. The second time I experienced the problem I was connected with ssh to the system during the dd watching the log files: i saw real time the abort messages and I was able to stop the dd and react to the failure (restart cluster nodes, check virtual machines, etc.), this just to say that it was not necessary to restart the iscsi target and that the system was not unresponsive. I will update the bug as soon as I can test it on another machine. I'm afraid this won't be an easy bug to root cause. But let's give it a shot. Next time, when you try to test/re-create the bug, capture dstat output. The default dstat output is good enough to report us on the system state was during starvation. -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System signature.asc Description: OpenPGP digital signature