Your message dated Fri, 05 Aug 2011 17:42:54 +0530 with message-id <[email protected]> and subject line Re: Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped has caused the Debian Bug report #629442, regarding iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 629442: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: iscsitarget Version: 1.4.20.2-1 Severity: grave Justification: renders package unusable We have a system configured as follows: 1 CPU Xeon E5520 quad core 12 GB RAM 24 SATA 1TB disks connected to a 3ware 9650-24M controller The system (I will call it the storage server) is used, also, as an iscsi target for a two node red hat cluster (implemented with Debian machines): the storage server exports an iscsi lun that contains a clustered LVM volum group. The two nodes mounts the iscsi target, then use clvmd of red hat cluster to manage concurrent access to the logical volumes on the clustered volume group. The nodes are used for virtualization with xen with the virtual machines stored in the shared volume group, this allows for live migration of the virtual machines from one node to the other, each node has 32 cores and 32 GB ram. All the three machines (the storage server and the two nodes) are Debian based: the iscsi target server was upgraded from lenny to squeeze one week ago, the two nodes are still on Debian lenny. Tonight and today we experienced two cases of ietd hang-up: while we were doing two I/O intensive operations on the storage server we got the following errors: FIRST OCCURRENCE: Jun 5 23:19:19 san00 kernel: [334912.274409] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:20 san00 kernel: [334912.634797] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.660528] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.660811] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661105] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661357] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) Jun 5 23:19:28 san00 kernel: [334920.661570] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) Jun 5 23:19:58 san00 kernel: [334950.930122] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:844424984461824 (Function Complete) Jun 5 23:19:59 san00 kernel: [334951.602686] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:1688849914593792 (Function Complete) ... Jun 5 23:20:31 san00 kernel: [334983.635928] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:20:41 san00 kernel: [334993.616383] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) ... Jun 5 23:20:51 san00 kernel: [335003.596884] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:01 san00 kernel: [335013.577396] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:01 san00 kernel: [335013.577645] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:11 san00 kernel: [335023.557901] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) Jun 5 23:21:13 san00 kernel: [335025.386242] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:2533274844725760 (Function Complete) ... Jun 5 23:24:21 san00 kernel: [335213.072102] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:2814749821436416 (Function Complete) Jun 5 23:24:21 san00 kernel: [335213.075205] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:3659174751568384 (Function Complete) Jun 5 23:24:32 san00 kernel: [335224.457895] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:42 san00 kernel: [335234.437754] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:42 san00 kernel: [335234.437873] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:52 san00 kernel: [335244.418062] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:52 san00 kernel: [335244.418174] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:24:54 san00 kernel: [335246.525572] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:02 san00 kernel: [335254.398446] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:02 san00 kernel: [335254.398685] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:04 san00 kernel: [335256.506021] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:04 san00 kernel: [335256.506153] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:12 san00 kernel: [335264.378719] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:12 san00 kernel: [335264.378939] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:14 san00 kernel: [335266.486590] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:14 san00 kernel: [335266.486822] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:22 san00 kernel: [335274.357588] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:22 san00 kernel: [335274.357798] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:24 san00 kernel: [335276.465205] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:24 san00 kernel: [335276.465434] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:25:32 san00 kernel: [335284.339163] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) Jun 5 23:25:48 san00 kernel: [335300.095545] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) Jun 5 23:26:02 san00 kernel: [335314.469995] iscsi_trgt: Logical Unit Reset (05) issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) Jun 5 23:26:23 san00 kernel: [335335.475744] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:3940649728279040 (Function Complete) Jun 5 23:26:24 san00 kernel: [335336.176770] iscsi_trgt: Abort Task (01) issued on tid:3 lun:0 by sid:4503599681700352 (Function Complete) Jun 5 23:26:37 san00 kernel: [335349.087260] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) Jun 5 23:26:37 san00 kernel: [335349.087589] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) Jun 5 23:26:37 san00 kernel: [335349.087934] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) Jun 5 23:26:43 san00 kernel: [335355.121186] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) Jun 5 23:26:43 san00 kernel: [335355.121609] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) SECOND OCCURRENCE: Jun 6 13:31:29 san00 kernel: [385942.228071] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:6473924501897728 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.228256] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.228604] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.228903] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.229276] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.229600] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) Jun 6 13:31:29 san00 kernel: [385942.229948] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) On the first occurrence I was fscking and resizing a file system, on the second one I was dd-ing a 30G file on a logical volume. These operations were performed directly on the storage server on a portion of disk not exported by ietd. I suspect the problem to be related to the squeeze upgrade, because before upgrading I never saw such an error. We have 30+ vms in production on the cluster so this thing is highly disruptive: every time it happens all the virtual machines hang up, the file systems get corrupted, we have to restart the cluster, manually fsck every file system and restart the whole thing... I have a test cluster with a smaller disk system (8 disks connected to a 3ware controller), with debian squeeze installed and the same version of iscsistarget installed (1.4.20.2-1) and on this system the problem was never seen, but in this case we have not 30 virtual machines working through iscsi on the system. Tomorrow I will try to reproduce the problem on the test system by loading it with some test machines, at least to understand if the problem arises above a threshold of iscsi workload combined with an i/o intensive operation on the storage server. What can we look at on the production system? Could this be a bug? or a misconfiguration on our side? Here is /etc/iet/ietd.conf Target iqn.2002-02.it.midhgard:san00.pxeroot IncomingUser iscsiuser iscsipassword OutgoingUser Lun 0 Path=/dev/vg00/pxeroot,Type=blockio InitialR2T No ImmediateData Yes MaxRecvDataSegmentLength 8192 MaxXmitDataSegmentLength 65536 MaxBurstLength 262144 FirstBurstLength 65536 MaxOutstandingR2T 8 HeaderDigest None DataDigest None Wthreads 8 QueuedCommands 64 Target iqn.2002-02.it.midhgard:san00.lun0 IncomingUser iscsiuser iscsipassword OutgoingUser Lun 0 Path=/dev/vg00/lun0,Type=blockio InitialR2T No ImmediateData Yes MaxRecvDataSegmentLength 8192 MaxXmitDataSegmentLength 65536 MaxBurstLength 262144 FirstBurstLength 65536 MaxOutstandingR2T 8 HeaderDigest None DataDigest None Wthreads 8 QueuedCommands 64 Target iqn.2002-02.it.midhgard:san00.qdisk IncomingUser iscsiuser iscsipassword OutgoingUser Lun 0 Path=/dev/vg00/qdisk,Type=blockio InitialR2T No ImmediateData Yes MaxRecvDataSegmentLength 8192 MaxXmitDataSegmentLength 65536 MaxBurstLength 262144 FirstBurstLength 65536 MaxOutstandingR2T 8 HeaderDigest None DataDigest None Wthreads 8 QueuedCommands 64 The first target is not used anymore, the second is the one containing the clustered vg, the third one is used as quorum disk by red hat cluster. Any help is appreciated! Best regards Massimiliano Ferrero -- System Information: Debian Release: 6.0.1 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/8 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages iscsitarget depends on: ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib ii lsb-base 3.2-23.2squeeze1 Linux Standard Base 3.2 init scrip ii procps 1:3.2.8-9 /proc file system utilities Versions of packages iscsitarget recommends: pn iscsitarget-module <none> (no description available) Versions of packages iscsitarget suggests: ii iscsitarget-dkms 1.4.20.2-1 iSCSI Enterprise Target kernel mod pn iscsitarget-source <none> (no description available) -- Configuration Files: /etc/default/iscsitarget changed: ISCSITARGET_ENABLE=true /etc/iet/ietd.conf [Errno 13] Permission denied: u'/etc/iet/ietd.conf' /etc/iet/initiators.allow changed: ALL 127.0.0.1, 10.212.0.0/16, 10.213.0.0/16 /etc/ietd.conf [Errno 2] No such file or directory: u'/etc/ietd.conf' /etc/initiators.allow [Errno 2] No such file or directory: u'/etc/initiators.allow' /etc/initiators.deny [Errno 2] No such file or directory: u'/etc/initiators.deny' -- no debconf information
--- End Message ---
--- Begin Message ---Okay!! I'm marking this bug as done then. On 08/05/2011 02:38 AM, Massimiliano Ferrero wrote: > I think the issue as nothing to do with an ietd bug. > I think that the server log, even with debug active, just showed that > ietd was closing the connections, but the problem originates from the > initiators: there are timeouts in the connections and these are dropped. > > I has designed the architecture to boot initiators from usb keys (debian > over a jffs2 file system) and had the very bad idea to put swap over one > of the iscsi lvm volumes (one volume for each cluster node). > I think this was one source of problems: any problem on the swap volume > would result in process crashing on the node. > Yes, this would be bad. On heavy load, when paging is triggered, and the iSCSI volume is inaccessible, you'll run into many issues. Did you see the kernel spit page allocation errors? > Now we have bought disks for the nodes and the system is installed on > the disks, and I see no more errors on the swap space. > I think the bug can be closed, thank you for the support. > That explains > > We still have some problems of dropped connections, probably I have not > set correctly iscsi initiators timeout. > This is what I see on the initiator syslog > /var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401809] > connection1:0: detected conn error (1020) > /var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401850] > connection8:0: detected conn error (1020) > /var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401879] > connection2:0: detected conn error (1020) > /var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.402096] > connection4:0: detected conn error (1020) > /var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Listening for > connections on port 5666 > /var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Allowing ..snipped.... > In this example then the node was killed from the other node because it > could not write anymore on the quorum disk (that is over iscsi too). > Is there a cluster timeout set? How much is it? > > Is there any good pointer to examples of settings to be used with iscsi > and multipath? > At the moment on the initiator side I have > > node.session.timeo.replacement_timeout = 120 With multipath, you don't really need this long a timeout value. I'd recommend you to change this to 5 seconds. > node.conn[0].timeo.login_timeout = 15 > node.conn[0].timeo.logout_timeout = 15 > node.conn[0].timeo.noop_out_interval = 5 > node.conn[0].timeo.noop_out_timeout = 5 > node.session.err_timeo.abort_timeout = 15 > node.session.err_timeo.lu_reset_timeout = 20 > > and in /etc/multipath.conf > > defaults { > udev_dir /dev > polling_interval 10 > selector "round-robin 0" > path_grouping_policy multibus > getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n" > prio_callout /bin/true > path_checker directio > rr_min_io 100 > rr_weight priorities > failback immediate Your failback is immediate but your replacement_timeo is 120 seconds. The failback will not be immediate then. > #no_path_retry fail > user_friendly_names no > } > multipaths { > multipath { > wwid 149455400000000000359... > alias lun1 > } > multipath { > wwid 1494554000000000024d1... > alias lun2 > } > } > > > should I enable > no_path_retry fail > ? Why? Do you want the failures to propagate quick? Your actual problem, does the connections drop when you do heavy I/O ? PS: This bug will be marked closed but you can still continue discussing it here. -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System
signature.asc
Description: OpenPGP digital signature
--- End Message ---

