Re: [ewg] with the ipoib patches, debug prints spam the system log
On Thu, 2008-02-07 at 10:01 +0200, Eli Cohen wrote: On Thu, 2008-02-07 at 09:48 +0200, Or Gerlitz wrote: Or Gerlitz wrote: You have left somehow too many... debug prints in the last patches, please clean this up. See for example how the system log after less then a minute when ipoib debug prints are opened, it has one original print (ib0: Send unicast ARP to 0023) and all the rest are yours. Feb 6 14:39:23 kernel: ib0: posting zlen send, wrid = 4: head = 2756, tail = 2752 Feb 6 14:39:23 kernel: ib0: ipoib_ib_tx_timer_func-427: head = 2757 Hi Eli, Just a reminder to remove this for RC4, using last night snapshot I still see it. Or. I have to look at last night build - it should have been there already. Sorry - it will be in the next build. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: traffic jittery, send queue full reports from mthca driver
On Thu, 2008-02-07 at 09:42 +0200, Or Gerlitz wrote: ib_mthca :03:00.0: SQ 000404 full (756910656 head, 756910592 tail, 64 max, 0 nreq) ib0: failed to post zlen send OK, Eli, taking the kernel bits from OFED-1.3-20080206-0751.tgz I don't see these prints any more. When probing out the driver inorder to replace it with the drop, I have got the following: ib0: timing out; will leak address handles ib0: ib_dealloc_pd failed so, is it another issue or related to the room-for-zlen-in-ring-accounting fix? I am not sure but I will look into it. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: traffic jittery, send queue full reports from mthca driver
On Thu, 2008-02-07 at 09:42 +0200, Or Gerlitz wrote: ib_mthca :03:00.0: SQ 000404 full (756910656 head, 756910592 tail, 64 max, 0 nreq) ib0: failed to post zlen send OK, Eli, taking the kernel bits from OFED-1.3-20080206-0751.tgz I don't see these prints any more. When probing out the driver inorder to replace it with the drop, I have got the following: ib0: timing out; will leak address handles ib0: ib_dealloc_pd failed so, is it another issue or related to the room-for-zlen-in-ring-accounting fix? Or, does it happen on mthca or connectx? Does it happen when running iperf in the way you reported in bugzilla? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofa-general] [ANNOUNCE] open iSCSI over iSER target RPMis available
* READ: 920 MB/sec * WRITE: 850 MB/sec Not getting anything even remotely close to this. Are there more details on configuration somewhere? I followed the web page as indicated. Are you running iSCSI over TCP or iSCSI over iSER (over InfiniBand)? Our results are with iSER. Erez ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] with the ipoib patches, debug prints spam the system log
On Thu, 2008-02-07 at 09:48 +0200, Or Gerlitz wrote: Or Gerlitz wrote: You have left somehow too many... debug prints in the last patches, please clean this up. See for example how the system log after less then a minute when ipoib debug prints are opened, it has one original print (ib0: Send unicast ARP to 0023) and all the rest are yours. Feb 6 14:39:23 kernel: ib0: posting zlen send, wrid = 4: head = 2756, tail = 2752 Feb 6 14:39:23 kernel: ib0: ipoib_ib_tx_timer_func-427: head = 2757 Hi Eli, Just a reminder to remove this for RC4, using last night snapshot I still see it. Or. I have to look at last night build - it should have been there already. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] MVAPICH1 1.0.0 SRPM available
New srpm for MVAPICH1 was uploaded. Please check ~pasha/ofed_1_3/ (see latest.txt for the build number) Bugfix for: 883, 884, 888, 887, 889, 893 -- Pavel Shamis (Pasha) Mellanox Technologies ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Eli Cohen wrote: I have tried to reproduce this but when using ib_mthca and mlx4_ib and could not see this problem. Could you try to dig more into this and provide more details. Please reproduce the issue on our HCAs since we do not have any ehca Note that Eli tried the code when using the non-SRQ path Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Eli Cohen wrote: This problem was seen on a ehca that supports SRQ. Please reply how many scatter entries does ehca support when working in SRQ mode? Also any piece of info I might need to try and mimic ehca behaviour on Mellanox devices. I will appreciate if you can repeat the exact sequence of actions you do to reproduce this. Hello Eli, Ehca supports fewer than 16 s/g entries- hence the srq patch addresses that issue. The sequence of steps that I followed for the touch test: 1. On a freshly booted system, configure ib0 and assign an IP addresss 2. Switch to connected mode and change mtu 3. ping remote ib interface (already in CM mode) 4. modprobe -r ib_ehca I see a series of cascading failures in /var/log/messages, starting with the issue of not being able to destroy the cq (specifically rcq) Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg][PATCH][0/2] SRP multipath failover within 60 seconds,
On Thu, 2008-02-07 at 08:18 +0200, Vladimir Sokolovsky wrote: Vu Pham wrote: The following patches assist SRP/dm-multipath to failover within 60 seconds (bugzilla #577) without data corruption, read/write error [snip] Applied, kernel_patches/fixes/srp_2_disconnect_without_wait.patch kernel_patches/fixes/srp_3_qp_err_timer_reconnect_target.patch Are there plans for these (and the ones they build on) to make their way to the upstream kernel? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Ehca supports fewer than 16 s/g entries- hence the srq patch addresses that issue. The sequence of steps that I followed for the touch test: 1. On a freshly booted system, configure ib0 and assign an IP addresss 2. Switch to connected mode and change mtu 3. ping remote ib interface (already in CM mode) 4. modprobe -r ib_ehca I see a series of cascading failures in /var/log/messages, starting with the issue of not being able to destroy the cq (specifically rcq) I followed the procedure you describe with Arbel device. I changed the code such that it will publish 12 scatter entires for the SRQ. I did not see this problem however so I don't how to debug this. Could it be a problem in the ehca driver? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Update (Re: open iSCSI over iSER target RPM ...)
Update: [EMAIL PROTECTED] etc]# dd if=/dev/zero of=/big/local.file bs=256k count=10 10+0 records in 10+0 records out 2621440 bytes (26 GB) copied, 58.7484 seconds, 446 MB/s Better. I rebuilt OFED 1.2.5.5. Are there specific recommended tuning guides for iSER? Backing store in this case are real disks, and we can sink/source 750 MB/s on them, so I am not worried about disk IO bottlenecks, more worried about bad config of iSCSI/iSER. BTW: the 2TB LUN limit I asked about is still here in this code. Same machines (initiator and target) used for SRP reported correct LUN sizes. Here we are using the -868 open-iscsi initiator, and the tgt RPM announced. I would like to dig into this. This is what I am getting in dmesg for this iSER target: iscsi: registered transport (tcp) iscsi: registered transport (iser) iser: iser_connect:connecting to: 10.2.1.2, port 0xbc0c iser: iser_cma_handler:event 0 conn 81024b9f69c0 id 810209748c00 iser: iser_cma_handler:event 2 conn 81024b9f69c0 id 810209748c00 iser: iser_create_ib_conn_res:setting conn 81024b9f69c0 cma_id 810209748c00: fmr_pool 81024bfb32c0 qp 8101cb16d600 iser: iser_cma_handler:event 9 conn 81024b9f69c0 id 810209748c00 iser: iscsi_iser_ep_poll:ib conn 81024b9f69c0 rc = 1 scsi13 : iSCSI Initiator over iSER, v.0.1 iser: iscsi_iser_conn_bind:binding iscsi conn 81021b65fa90 to iser_conn 81024b9f69c0 Vendor: IET Model: ControllerRev: 0001 Type: RAID ANSI SCSI revision: 05 scsi 13:0:0:0: Attached scsi generic sg2 type 12 Vendor: IET Model: VIRTUAL-DISK Rev: 0001 Type: Direct-Access ANSI SCSI revision: 05 sdc : very big device. try to use READ CAPACITY(16). sdc : READ CAPACITY(16) failed. sdc : status=1, message=00, host=0, driver=08 sdc : use 0x as device size SCSI device sdc: 4294967296 512-byte hdwr sectors (2199023 MB) sdc: Write Protect is off sdc: Mode Sense: 79 00 00 08 SCSI device sdc: drive cache: write back sdc : very big device. try to use READ CAPACITY(16). sdc : READ CAPACITY(16) failed. sdc : status=1, message=00, host=0, driver=08 sdc : use 0x as device size SCSI device sdc: 4294967296 512-byte hdwr sectors (2199023 MB) sdc: Write Protect is off sdc: Mode Sense: 79 00 00 08 SCSI device sdc: drive cache: write back sdc: unknown partition table sd 13:0:0:1: Attached scsi disk sdc sd 13:0:0:1: Attached scsi generic sg3 type 0 and this is what we get in SRP scsi6 : SRP.T10:0008F104039862A4 Vendor: SCST_BIO Model: vdisk0Rev: 096 Type: Direct-Access ANSI SCSI revision: 04 sdc : very big device. try to use READ CAPACITY(16). SCSI device sdc: 12693355130 512-byte hdwr sectors (6498998 MB) sdc: Write Protect is off sdc: Mode Sense: 6b 00 10 08 SCSI device sdc: drive cache: write back w/ FUA This looks suspiciously like a 2^32 limit somewhere. Our exported device is [EMAIL PROTECTED] ~]# parted /dev/sdb print Model: Areca jrvs1 (scsi) Disk /dev/sdb: 6500GB Sector size (logical/physical): 512B/512B Partition Table: loop Number Start End SizeFile system Flags 1 0.00kB 6500GB 6500GB xfs and this is what tgtadm reports [EMAIL PROTECTED] ~]# tgtadm --lld iscsi --op show --mode target Target 1: iqn.2001-04.com.jr1-jackrabbit.small System information: Driver: iscsi Status: running I_T nexus information: I_T nexus: 4 Initiator: iqn.1996-04.voltaire.com:01:dfaa3fd Connection: 0 RDMA IP Address: 10.2.1.1 LUN information: LUN: 0 Type: controller SCSI ID: deadbeaf1:0 SCSI SN: beaf10 Size: 0 Online: No Poweron/Reset: Yes Removable media: No Backing store: No backing store LUN: 1 Type: disk SCSI ID: deadbeaf1:1 SCSI SN: beaf11 Size: 5T Online: Yes Poweron/Reset: No Removable media: No Backing store: /dev/sdb Account information: ACL information: 10.2.1.1 So it looks like the LUN 1 is approximately correct (5T ???) on the target, and incorrect when the initiator asks for it. Please note that I have successfully used the full 6+TB as an iSCSI target using the SCST-iscsi code, so I do know that the initiator works correctly. Is there a source RPM/tree for this target? Joe Landman wrote: Hi Erez Erez Zilber wrote: stgt (SCSI target) is an open-source framework for storage target drivers. It supports iSCSI over iSER among other storage target drivers. Voltaire added a git tree for stgt that will be added to OFED 1.4: http://www2.openfabrics.org/git/?p=~dorons/tgt.git;a=summary Until OFED 1.4 gets released, it is possible to install the stgt RPM on top of OFED 1.3. For more details about how to install and
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
This problem was seen on a ehca that supports SRQ. Please reply how many scatter entries does ehca support when working in SRQ mode? Also any piece of info I might need to try and mimic ehca behaviour on Mellanox devices. I will appreciate if you can repeat the exact sequence of actions you do to reproduce this. thanks. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofa-general] [ANNOUNCE] open iSCSI over iSER target RPMis available
Erez Zilber wrote: * READ: 920 MB/sec * WRITE: 850 MB/sec Not getting anything even remotely close to this. Are there more details on configuration somewhere? I followed the web page as indicated. Are you running iSCSI over TCP or iSCSI over iSER (over InfiniBand)? Our results are with iSER. I followed the instructions on the web pages that were pointed to for iSER. Are there updated pages? Is there a way to tell whether or not the RDMA path is being used? Thanks. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: [EMAIL PROTECTED] web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] OFED 1.3 RC4 release is available
Hi, OFED 1.3 RC3 release is available on http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3-rc4.tgz http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3-rc3.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ The RC5 (Gold) release is expected on February 18 Tziporet Vlad Release information: Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp * - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - Fedora C6:2.6.18-8.fc6* - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - OpenSuSE 10.3:2.6.22-*-* * - kernel.org: 2.6.23 and 2.6.24 * OSes that are partially tested Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED 1.3-RC3 === * Fixed 13 Bugs (see attachment) * MPI packages update: mvapich-1.0.0-1981.src.rpm * Updated libraries: * uDAPL 2.0.6 * libibcm 1.0.2 * librdmacm 1.0.6* I * IPoIB enhancements: * Non-SRQ for CM mode * 4K MTU support * Enhancements to improve small messages BW Tasks that should be completed for RC5: === 1. Fix critical and major bugs 2. Update all documents bug_id,bug_severity,op_sys,assigned_to,resolution,short_short_desc 794,normal,Other,[EMAIL PROTECTED],FIXED,Kernel panic while unload driver 883,normal,RHEL 4,[EMAIL PROTECTED],FIXED,mvapich gets killed during alltoall, 32nodes 884,normal,RHEL 4,[EMAIL PROTECTED],FIXED,mvapich doesn't report non-active ports 893,blocker,Other,[EMAIL PROTECTED],FIXED,Dynamic library supprot is broken 892,blocker,SLES 10,[EMAIL PROTECTED],FIXED,openibd does not remove cxgb3 module 897,critical,SLES 10,[EMAIL PROTECTED],FIXED,traffic is jittery, send queue full reports from mthca 891,critical,RHEL 4,[EMAIL PROTECTED],FIXED,ib_sa panics system when enabled 878,critical,Other,[EMAIL PROTECTED],FIXED,slow failover with bonding and connected mode 887,critical,All,[EMAIL PROTECTED],FIXED,IMB benchmark stuck 577,critical,All,[EMAIL PROTECTED],FIXED,SRP multipath failover too slow (minutes, not seconds) 761,major,Other,[EMAIL PROTECTED],FIXED,Poor and jittery UDP performance at small messages 889,minor,Other,[EMAIL PROTECTED],FIXED,Intel test stuck fortran-datatype-functional-MPI_Type_contiguous_idispls 888,minor,Other,[EMAIL PROTECTED],FIXED,OSU latency benchmark (old version with iteration and message size parameter) stuck sometime ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [Fwd: Re: [ofa-general] Problem with latest OFED 1.3 build... IPoIB and iPATH]
I forgot to CC EWG on my reply to Arlin Davis. ---BeginMessage--- We can reproduce the problem here. We haven't made any ib_ipath driver changes between RC3 and RC4 so some recent patch has broken us. I'm in the process of looking at it. On Wed, 2008-02-06 at 17:17 -0800, Arlin Davis wrote: I cannot ifconfig ib0 on ipath with using the latest build (ofed20080206). ifup ib0 SIOCSIFFLAGS: Invalid argument Failed to bring up ib0. ib0: failed to create own ah CA 'ipath0' CA type: InfiniPath_QLE7140 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x001175ffd75b System image GUID: 0x001175ffd75b Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 14 LMC: 0 SM lid: 1 Capability mask: 0x02010800 Port GUID: 0x001175ffd75b It works fine on mthca adapters. Anyone else see this problem? -arlin ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ---End Message--- ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [Fwd: Re: [ofa-general] Problem with latest OFED 1.3 build... IPoIB and iPATH]
Hello Ralph, What's ifconfig ib0 output? We can reproduce the problem here. We haven't made any ib_ipath driver changes between RC3 and RC4 so some recent patch has broken us. I'm in the process of looking at it. On Wed, 2008-02-06 at 17:17 -0800, Arlin Davis wrote: I cannot ifconfig ib0 on ipath with using the latest build (ofed20080206). ifup ib0 SIOCSIFFLAGS: Invalid argument Failed to bring up ib0. ib0: failed to create own ah int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index)) { ipoib_warn(priv, P_Key 0x%04x not found\n, priv-pkey); clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); return -1; } set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); ret = create_own_ah(priv); if (ret) { priv-own_ah = NULL; ipoib_warn(priv, failed to create own ah\n); return -1; } Looks like the ipath driver returns error from create_own_ah() call. Are you sure there is no ipath driver changes between RC3 and RC4? Which kernel did you hit this problem? What's the kernel PAGE_SIZE? thanks Shirley ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [Stgt-devel] Update (Re: open iSCSI over iSER target RPM ...)
On Thu, 07 Feb 2008 11:05:03 -0500 Joe Landman [EMAIL PROTECTED] wrote: Update: [EMAIL PROTECTED] etc]# dd if=/dev/zero of=/big/local.file bs=256k count=10 10+0 records in 10+0 records out 2621440 bytes (26 GB) copied, 58.7484 seconds, 446 MB/s Better. I rebuilt OFED 1.2.5.5. Are there specific recommended tuning guides for iSER? Backing store in this case are real disks, and we can sink/source 750 MB/s on them, so I am not worried about disk IO bottlenecks, more worried about bad config of iSCSI/iSER. BTW: the 2TB LUN limit I asked about is still here in this code. Same machines (initiator and target) used for SRP reported correct LUN sizes. Here we are using the -868 open-iscsi initiator, and the tgt RPM announced. I would like to dig into this. Thanks a lot, I thought that I tested tgt with 2TB devices but seems that I didn't. I'll try to fix the problem shortly. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [Fwd: Re: [ofa-general] Problem with latest OFED 1.3 build... IPoIB and iPATH]
On Thu, 2008-02-07 at 08:29 -0800, Shirley Ma wrote: On Thu, 2008-02-07 at 18:16 -0800, Ralph Campbell wrote: # cat /etc/*release Red Hat Enterprise Linux Server release 5 (Tikanga) # uname -r 2.6.18-8.el5 4K PAGE_SIZE I don't have ipath driver here. Otherwise I could try them out. A couple suggestions here, could you please try out? 1. try this on 64K page size, like RHEL5U1 to see whether you have the same issue. We don't have any systems with 64K page size at hand. 2. Can you put a debug message in ipath_create_ah() to see whether this is a memory allocation failure? I'm working on it. 3. How many IB cards in your system? If you have severals, just leave one ipath there to see whether you can hit this problem. only one card with one IB port. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [Fwd: Re: [ofa-general] Problem with latest OFED 1.3 build... IPoIB and iPATH]
On Thu, 2008-02-07 at 18:16 -0800, Ralph Campbell wrote: # cat /etc/*release Red Hat Enterprise Linux Server release 5 (Tikanga) # uname -r 2.6.18-8.el5 4K PAGE_SIZE I don't have ipath driver here. Otherwise I could try them out. A couple suggestions here, could you please try out? 1. try this on 64K page size, like RHEL5U1 to see whether you have the same issue. 2. Can you put a debug message in ipath_create_ah() to see whether this is a memory allocation failure? 3. How many IB cards in your system? If you have severals, just leave one ipath there to see whether you can hit this problem. Thanks Shirley ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg