What is the version of the kernel, where you are seeing this problem? In my experience, older kernels, weren't good at handling huge workloads. In case this is a starvation issue (which I suspect), there should be some kernel messages spat. Can you check that?
Also, if possible, can you reproduce the issue with ietd's debug mode enabled? Ritesh On 06/07/2011 12:04 AM, Massimiliano Ferrero wrote: > Package: iscsitarget > Version: 1.4.20.2-1 > Severity: grave > Justification: renders package unusable > > > We have a system configured as follows: > 1 CPU Xeon E5520 quad core > 12 GB RAM > 24 SATA 1TB disks connected to a 3ware 9650-24M controller > > The system (I will call it the storage server) is used, also, as an iscsi > target for a two node red hat cluster > (implemented with Debian machines): the storage server exports an iscsi lun > that contains a clustered LVM volum group. > The two nodes mounts the iscsi target, then use clvmd of red hat cluster to > manage concurrent access to the logical > volumes on the clustered volume group. > The nodes are used for virtualization with xen with the virtual machines > stored in the shared volume group, this allows > for live migration of the virtual machines from one node to the other, each > node has 32 cores and 32 GB ram. > > All the three machines (the storage server and the two nodes) are Debian > based: > the iscsi target server was upgraded from lenny to squeeze one week ago, the > two nodes are still on Debian lenny. > > Tonight and today we experienced two cases of ietd hang-up: while we were > doing two I/O intensive operations on the storage > server we got the following errors: > > FIRST OCCURRENCE: > > Jun 5 23:19:19 san00 kernel: [334912.274409] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) > Jun 5 23:19:20 san00 kernel: [334912.634797] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) > Jun 5 23:19:28 san00 kernel: [334920.660528] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) > Jun 5 23:19:28 san00 kernel: [334920.660811] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) > Jun 5 23:19:28 san00 kernel: [334920.661105] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) > Jun 5 23:19:28 san00 kernel: [334920.661357] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete) > Jun 5 23:19:28 san00 kernel: [334920.661570] iscsi_trgt: Logical Unit Reset > (05) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete) > Jun 5 23:19:58 san00 kernel: [334950.930122] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:844424984461824 (Function Complete) > Jun 5 23:19:59 san00 kernel: [334951.602686] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:1688849914593792 (Function Complete) > ... > Jun 5 23:20:31 san00 kernel: [334983.635928] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > Jun 5 23:20:41 san00 kernel: [334993.616383] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > ... > Jun 5 23:20:51 san00 kernel: [335003.596884] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > Jun 5 23:21:01 san00 kernel: [335013.577396] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > Jun 5 23:21:01 san00 kernel: [335013.577645] iscsi_trgt: Logical Unit Reset > (05) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > Jun 5 23:21:11 san00 kernel: [335023.557901] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete) > Jun 5 23:21:13 san00 kernel: [335025.386242] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:2533274844725760 (Function Complete) > ... > Jun 5 23:24:21 san00 kernel: [335213.072102] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:2814749821436416 (Function Complete) > Jun 5 23:24:21 san00 kernel: [335213.075205] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:3659174751568384 (Function Complete) > Jun 5 23:24:32 san00 kernel: [335224.457895] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:24:42 san00 kernel: [335234.437754] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:24:42 san00 kernel: [335234.437873] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:24:52 san00 kernel: [335244.418062] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:24:52 san00 kernel: [335244.418174] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:24:54 san00 kernel: [335246.525572] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:02 san00 kernel: [335254.398446] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:02 san00 kernel: [335254.398685] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:04 san00 kernel: [335256.506021] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:04 san00 kernel: [335256.506153] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:12 san00 kernel: [335264.378719] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:12 san00 kernel: [335264.378939] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:14 san00 kernel: [335266.486590] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:14 san00 kernel: [335266.486822] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:22 san00 kernel: [335274.357588] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:22 san00 kernel: [335274.357798] iscsi_trgt: Logical Unit Reset > (05) issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:24 san00 kernel: [335276.465205] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:24 san00 kernel: [335276.465434] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:25:32 san00 kernel: [335284.339163] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete) > Jun 5 23:25:48 san00 kernel: [335300.095545] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:3377699758080512 (Function Complete) > Jun 5 23:26:02 san00 kernel: [335314.469995] iscsi_trgt: Logical Unit Reset > (05) issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) > Jun 5 23:26:23 san00 kernel: [335335.475744] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:3940649728279040 (Function Complete) > Jun 5 23:26:24 san00 kernel: [335336.176770] iscsi_trgt: Abort Task (01) > issued on tid:3 lun:0 by sid:4503599681700352 (Function Complete) > Jun 5 23:26:37 san00 kernel: [335349.087260] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) > Jun 5 23:26:37 san00 kernel: [335349.087589] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) > Jun 5 23:26:37 san00 kernel: [335349.087934] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:4222124688212480 (Function Complete) > Jun 5 23:26:43 san00 kernel: [335355.121186] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) > Jun 5 23:26:43 san00 kernel: [335355.121609] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:4785074641633792 (Function Complete) > > SECOND OCCURRENCE: > > Jun 6 13:31:29 san00 kernel: [385942.228071] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:6473924501897728 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.228256] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.228604] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.228903] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.229276] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.229600] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > Jun 6 13:31:29 san00 kernel: [385942.229948] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:7318349432029696 (Function Complete) > > On the first occurrence I was fscking and resizing a file system, on the > second one I was dd-ing a 30G file on a logical volume. > These operations were performed directly on the storage server on a portion > of disk not exported by ietd. > > I suspect the problem to be related to the squeeze upgrade, because before > upgrading I never saw such an error. > We have 30+ vms in production on the cluster so this thing is highly > disruptive: every time it happens all the virtual machines hang up, > the file systems get corrupted, we have to restart the cluster, manually fsck > every file system and restart the whole thing... > > I have a test cluster with a smaller disk system (8 disks connected to a > 3ware controller), with debian squeeze installed and the same version > of iscsistarget installed (1.4.20.2-1) and on this system the problem was > never seen, but in this case we have not 30 virtual machines working > through iscsi on the system. > Tomorrow I will try to reproduce the problem on the test system by loading it > with some test machines, at least to understand if the problem > arises above a threshold of iscsi workload combined with an i/o intensive > operation on the storage server. > > What can we look at on the production system? Could this be a bug? or a > misconfiguration on our side? > > Here is /etc/iet/ietd.conf > > Target iqn.2002-02.it.midhgard:san00.pxeroot > IncomingUser iscsiuser iscsipassword > OutgoingUser > Lun 0 Path=/dev/vg00/pxeroot,Type=blockio > InitialR2T No > ImmediateData Yes > MaxRecvDataSegmentLength 8192 > MaxXmitDataSegmentLength 65536 > MaxBurstLength 262144 > FirstBurstLength 65536 > MaxOutstandingR2T 8 > HeaderDigest None > DataDigest None > Wthreads 8 > QueuedCommands 64 > > Target iqn.2002-02.it.midhgard:san00.lun0 > IncomingUser iscsiuser iscsipassword > OutgoingUser > Lun 0 Path=/dev/vg00/lun0,Type=blockio > InitialR2T No > ImmediateData Yes > MaxRecvDataSegmentLength 8192 > MaxXmitDataSegmentLength 65536 > MaxBurstLength 262144 > FirstBurstLength 65536 > MaxOutstandingR2T 8 > HeaderDigest None > DataDigest None > Wthreads 8 > QueuedCommands 64 > > Target iqn.2002-02.it.midhgard:san00.qdisk > IncomingUser iscsiuser iscsipassword > OutgoingUser > Lun 0 Path=/dev/vg00/qdisk,Type=blockio > InitialR2T No > ImmediateData Yes > MaxRecvDataSegmentLength 8192 > MaxXmitDataSegmentLength 65536 > MaxBurstLength 262144 > FirstBurstLength 65536 > MaxOutstandingR2T 8 > HeaderDigest None > DataDigest None > Wthreads 8 > QueuedCommands 64 > > > The first target is not used anymore, the second is the one containing the > clustered vg, the third one is used as quorum disk by red hat cluster. > > Any help is appreciated! > Best regards > Massimiliano Ferrero > > > -- System Information: > Debian Release: 6.0.1 > APT prefers stable > APT policy: (500, 'stable') > Architecture: amd64 (x86_64) > > Kernel: Linux 2.6.32-5-amd64 (SMP w/8 CPU cores) > Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) > Shell: /bin/sh linked to /bin/bash > > Versions of packages iscsitarget depends on: > ii libc6 2.11.2-10 Embedded GNU C Library: Shared > lib > ii lsb-base 3.2-23.2squeeze1 Linux Standard Base 3.2 init > scrip > ii procps 1:3.2.8-9 /proc file system utilities > > Versions of packages iscsitarget recommends: > pn iscsitarget-module <none> (no description available) > > Versions of packages iscsitarget suggests: > ii iscsitarget-dkms 1.4.20.2-1 iSCSI Enterprise Target kernel > mod > pn iscsitarget-source <none> (no description available) > > -- Configuration Files: > /etc/default/iscsitarget changed: > ISCSITARGET_ENABLE=true > > /etc/iet/ietd.conf [Errno 13] Permission denied: u'/etc/iet/ietd.conf' > /etc/iet/initiators.allow changed: > ALL 127.0.0.1, 10.212.0.0/16, 10.213.0.0/16 > > /etc/ietd.conf [Errno 2] No such file or directory: u'/etc/ietd.conf' > /etc/initiators.allow [Errno 2] No such file or directory: > u'/etc/initiators.allow' > /etc/initiators.deny [Errno 2] No such file or directory: > u'/etc/initiators.deny' > > -- no debconf information > > > > _______________________________________________ > Pkg-iscsi-maintainers mailing list > pkg-iscsi-maintain...@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-iscsi-maintainers -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System
signature.asc
Description: OpenPGP digital signature