Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Mike Christie

On 01/07/2010 12:57 AM, Jack Z wrote:

Hi Mike,

I use the default configuration of open-iscsi initiator and IET,
except I change the NOP interval to 500s and the
MaxRecvDataSegmentLength of IET to 262144 (default value is 8192).

And the network I'm using is a straight-through cable between two
laptops. I'm not sure whether the NICs support or choose to use jumbo
frames... But as you can see in my previous post, the TCP segments of
iperf traffic were mostly as large as 65000+ bytes but over 90% of the
iSCSI ones were only 1448 bytes. So I thought that TCP did support
large segments but it seemed that iSCSI or TCP chose not to use the
large ones but went with the small ones for some reason...



If you do ifconfig and look at the MTU you can see the size. With most 
net drivers you can then run something like


ifconfig ethXYZ mtu 8192

do this on both boxes.
-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: iscsi warning in 2.6.33-rc2

2010-01-07 Thread Tao Ma

It works. Thanks.

Tested-by: Tao Ma 

Mike Christie wrote:

On 12/29/2009 11:43 PM, Tao Ma wrote:

Hi all,
I met with this warning when using iscsi with the 2.6.33-rc2.

[ cut here ]
WARNING: at drivers/scsi/libiscsi_tcp.c:996






The other thing I can confirm is that I didn't meet with such problem in
2.6.32-rc8.



Sorry for the late reply. I am just getting back.

I think this is due to the kfifo changes that got merged recently.

For the iscsi side if there is nothing in the r2tqueue fifo at this time 
it is fine. It just means that there is nothing to process, so I think 
that the warn on should just not have been added.


Could you try the attached patch? I have only compile tested it. I am 
starting up some testing for the patch and the other fifo changes that 
got merged now.


-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Pasi Kärkkäinen
On Thu, Jan 07, 2010 at 10:05:57AM -0800, Jack Z wrote:
> Hi Pasi,
> 
> Thanks again for your reply!
> 
> > > > Try to play and experiment with these options:
> >
> > > > -B 64k (blocksize 64k, try also 4k)
> > > > -I BD (block device, direct IO (O_DIRECT))
> > > > -K 16 (16 threads, aka 16 outstanding IOs. -K 1 should be the same as 
> > > > dd)
> >
> > > > Examples:
> >
> > > > Sequential (linear) reads using blocksize 4k and 4 simultaneous 
> > > > threads, for 60 seconds:
> > > > disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 60 -r /dev/sdX
> >
> > > > Random writes:
> >
> > > > disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 60 -w /dev/sdX
> >
> > > > 30% random reads, 70% random writes:
> > > > disktest -r -w -D30:70 -K2 -E32 -B 8k -T 60 -pR -Ibd -PA /dev/md4
> >
> > > > Hopefully that helps..
> >
> > > That did help. I tried the following combinations of -B -K and -p at
> > > 20 ms RTT and the other options were -h 30 -I BD -P T -S0:(1 GB size)
> >
> > > -B 4k/64k -K 4/64 -p l
> >
> > > It seems that when I put -p l there the performance goes down
> > > drastically...
> >
> > That's really weird.. linear/sequential (-p l) should always be faster
> > than random.
> >
> > > -B 4k -K 4/64 -p r
> >
> > > The disk throughput is similar to the one I used in the previous post
> > > "disktest -w -S0:1k -B 1024 /dev/sdb " and it's much lower than dd
> > > could get.
> >
> > like said, weird.
> 
> I'll try to repeat more of these tests that yielded weird results.
> I'll let you know if anything new comes up. :)
> 

Yep.


> 
> > > -B 64k -K 4 -p r
> >
> > > The disk throughput is higher than the last one but still not as high
> > > as dd could get.
> >
> > > -B 64k -K 64 -p r
> >
> > > The disk throughput was boosted to 8.06 MB/s and the IOPS was 129.0.
> > > At the link layer, the traffic rate was 70.536 Mbps (the TCP baseline
> > > was 96.202 Mbps). At the same time, dd ( bs=64K count=(1 GB size)) got
> > > a throughput of 6.7 MB/s and the traffic rate on the link layer was
> > > 57.749 Mbps.
> >
> > Ok.
> >
> > 129 IOPS * 64kB = 8256 kB/sec, which pretty much matches the 8 MB/sec
> > you measured.
> >
> > this still means there was only 1 outstanding IO.. and definitely not 64 
> > (-K 64).
> 
> For this part, I did not quite understand... Following your previous
> calculations,
> 
> 1 s = 1000 ms
> 1000 / 129 = 7.75 ms/IO
> 
> And the link RTT is 20 ms.
> 
> 20/7.75 = 2.58 > 2. So there should be at least 2 outstanding IOs...
> Am I corrent...?
> 

That's correct. I was wrong. I was too busy when replying to you :)

> And for the 64 outstanding IOs, I'll try more experiments and see why
> that is not happening.
> 

It could be because of the IO elevator/scheduler.. see below.

> 
> > > Although not much, it was still an improvement and it was the first
> > > improvement I have ever seen since I started my experiments! Thank you
> > > very much!
> >
> > > As for
> >
> > > > Oh, also make sure you have 'oflag=direct' for dd.
> >
> > > The result was surprisingly low again... Do you think the reason might
> > > be that I was running dd on a device file (/dev/sdb), which did not
> > > have any partitions/file systems on it?
> >
> > > Thanks a lot!
> >
> > oflag=direct makes dd use O_DIRECT, aka bypass all kernel/initiator caches 
> > for writing.
> > iflag=direct would bypass all caches for reading.
> >
> > It shouldn't matter if you write or read from /dev/sda1 instead of /dev/sda.
> > As long as it's a raw block device, it shouldn't matter.
> > If you write/read to/from a filesystem, that obviously matters.
> >
> > What kind of target you are using for this benchmark?
> 
> It is the iSCSI Enterprise Target, which came with ubuntu 9.04.
> (iscsitarget (0.4.16+svn162-3ubuntu1)).
> 
> Thank you very much!
> 

Make sure you use 'deadline' elevator on the target machine!! This is
important, since the default 'cfq' doesn't perform well with IETD.

You can either set the target machine kernel option 'elevator=deadline'
in grub.conf and reboot, or then you can change the settings on the fly
like this:

echo deadline > /sys/block/sdX/queue/scheduler

do that for all the disks/devices you have in your target machine, ie.
replace sdX with each disk.

Also if you're using fileio on IETD, change it to blockio.


One more things: On the initiator machine you should use 'noop'
scheduler for the iSCSI disks.. so on the initiator do for each iSCSI disk:

echo noop > /sys/block/sdX/queue/scheduler

And benchmark again after setting correct schedulers/elevators on both
the target and initiator, and the blockio mode on IETD.

-- Pasi

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Jack Z
Hi Pasi,

Thanks again for your reply!

> > > Try to play and experiment with these options:
>
> > > -B 64k (blocksize 64k, try also 4k)
> > > -I BD (block device, direct IO (O_DIRECT))
> > > -K 16 (16 threads, aka 16 outstanding IOs. -K 1 should be the same as dd)
>
> > > Examples:
>
> > > Sequential (linear) reads using blocksize 4k and 4 simultaneous threads, 
> > > for 60 seconds:
> > > disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 60 -r /dev/sdX
>
> > > Random writes:
>
> > > disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 60 -w /dev/sdX
>
> > > 30% random reads, 70% random writes:
> > > disktest -r -w -D30:70 -K2 -E32 -B 8k -T 60 -pR -Ibd -PA /dev/md4
>
> > > Hopefully that helps..
>
> > That did help. I tried the following combinations of -B -K and -p at
> > 20 ms RTT and the other options were -h 30 -I BD -P T -S0:(1 GB size)
>
> > -B 4k/64k -K 4/64 -p l
>
> > It seems that when I put -p l there the performance goes down
> > drastically...
>
> That's really weird.. linear/sequential (-p l) should always be faster
> than random.
>
> > -B 4k -K 4/64 -p r
>
> > The disk throughput is similar to the one I used in the previous post
> > "disktest -w -S0:1k -B 1024 /dev/sdb " and it's much lower than dd
> > could get.
>
> like said, weird.

I'll try to repeat more of these tests that yielded weird results.
I'll let you know if anything new comes up. :)


> > -B 64k -K 4 -p r
>
> > The disk throughput is higher than the last one but still not as high
> > as dd could get.
>
> > -B 64k -K 64 -p r
>
> > The disk throughput was boosted to 8.06 MB/s and the IOPS was 129.0.
> > At the link layer, the traffic rate was 70.536 Mbps (the TCP baseline
> > was 96.202 Mbps). At the same time, dd ( bs=64K count=(1 GB size)) got
> > a throughput of 6.7 MB/s and the traffic rate on the link layer was
> > 57.749 Mbps.
>
> Ok.
>
> 129 IOPS * 64kB = 8256 kB/sec, which pretty much matches the 8 MB/sec
> you measured.
>
> this still means there was only 1 outstanding IO.. and definitely not 64 (-K 
> 64).

For this part, I did not quite understand... Following your previous
calculations,

1 s = 1000 ms
1000 / 129 = 7.75 ms/IO

And the link RTT is 20 ms.

20/7.75 = 2.58 > 2. So there should be at least 2 outstanding IOs...
Am I corrent...?

And for the 64 outstanding IOs, I'll try more experiments and see why
that is not happening.


> > Although not much, it was still an improvement and it was the first
> > improvement I have ever seen since I started my experiments! Thank you
> > very much!
>
> > As for
>
> > > Oh, also make sure you have 'oflag=direct' for dd.
>
> > The result was surprisingly low again... Do you think the reason might
> > be that I was running dd on a device file (/dev/sdb), which did not
> > have any partitions/file systems on it?
>
> > Thanks a lot!
>
> oflag=direct makes dd use O_DIRECT, aka bypass all kernel/initiator caches 
> for writing.
> iflag=direct would bypass all caches for reading.
>
> It shouldn't matter if you write or read from /dev/sda1 instead of /dev/sda.
> As long as it's a raw block device, it shouldn't matter.
> If you write/read to/from a filesystem, that obviously matters.
>
> What kind of target you are using for this benchmark?

It is the iSCSI Enterprise Target, which came with ubuntu 9.04.
(iscsitarget (0.4.16+svn162-3ubuntu1)).

Thank you very much!

Cheers,
Jack
-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Jack Z
Hi Ulrich,

Thanks for your reply!

>
> > I was testing the performance of open-iscsi initiator with IET target
> > over a 100Mbps Ethernet link with emulated rtt.  What I did was to do
> > raw disk sequential write by
>
> > $ dd if=/dev/zero of=/dev/sdb bs=1024 count=1048576
>
> > , in which /dev/sdb is the iSCSI device. I also measured TCP
> > throughput using iperf with the default setup except "-n 1024M". And I
> > got the following data on iSCSI throughput and TCP throughput v.s. rtt
>
> > rtt (ms)        iSCSI throughput by dd (MB/s)   TCP throughput by
> > iperf (Mbit/s)
> > 0.2               11.3
> > 94.3
> > 4                  11.1
> > 94.3
> > 8                  10.2
> > 94.3
> > 12                8.6
> > 94.2
> > 16                7.2
> > 94.2
> > 20                6.0
> > 94.1
>
> > local disk throughput by dd was 26.7 MB/s.
>
> > As shown in the table above, iSCSI throughput declined rapidly with
> > rtt increased from 0.2ms to 20ms. TCP throughput, however, only
> > dropped less than 1 percent.
>
> From what I know the (estimated) RTT (Round Trip Time) increases if a link 
> problem
> (i.e. lost packets) was detected (if other parameters are unchanged).

As explained at the beginning of my first thread, I was doing an
experiment. And the experiment was done on two laptops over a straight-
through cable. The RTT was increased intentionally, as I was measuring
the iSCSI performance against RTT changes. The other parameters of the
link, such as packet loss etc, were not changed and no packet loss was
observed when using ping over the link.

> > Then I used Wireshark to grab the traces of iSCSI and iperf and I
> > found lots of iSCSI PDUs were divided into TCP segments of 1448 bytes
> > but with iperf TCP segments could be as large as 65000+ bytes.
>
> How would you transport such a segmen unfragmented?

> > I also skimmed through the iSCSI specification, but it seemed no luck
> > there either...
>
> > I know the Ethernet MTU is 1500 byte long and that might be the reason
> > of the 1448 byte TCP segments, but iperf did get to send much larger
> > TCP segments of 65000+ bytes...
>
> over which layer 2?

As Mike suggested in his reply, this could be a jumbo frame. The
following is the data of a 65160 packet captured by Wireshark:

No. TimeSourceS_Port Destination
D_Port Protocol Info
266 0.13781010.0.0.1  56099  10.0.0.2
5001 TCP  56099 > 5001 [ACK] Seq=376505 Ack=1 Win=92 Len=65160
[Packet size limited during capture]

Frame 266 (65226 bytes on wire, 58 bytes captured)
Arrival Time: Jan  4, 2010 04:44:33.711762000
[Time delta from previous captured frame: 0.000206000 seconds]
[Time delta from previous displayed frame: 0.002861000 seconds]
[Time since reference or first frame: 0.13781 seconds]
Frame Number: 266
Frame Length: 65226 bytes
Capture Length: 58 bytes
[Frame is marked: True]
[Protocols in frame: eth:ip:tcp]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: HonHaiPr_0f:35:65 , Dst: Ibm_8d:59:02
Destination: Ibm_8d:59:02
Address: Ibm_8d:59:02
 ...0     = IG bit: Individual address
(unicast)
 ..0.     = LG bit: Globally unique
address (factory default)
Source: HonHaiPr_0f:35:65
Address: HonHaiPr_0f:35:65
 ...0     = IG bit: Individual address
(unicast)
 ..0.     = LG bit: Globally unique
address (factory default)
Type: IP (0x0800)
Internet Protocol, Src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.2 (10.0.0.2)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN:
0x00)
 00.. = Differentiated Services Codepoint: Default (0x00)
 ..0. = ECN-Capable Transport (ECT): 0
 ...0 = ECN-CE: 0
Total Length: 65212
Identification: 0x8729 (34601)
Flags: 0x04 (Don't Fragment)
0... = Reserved bit: Not set
.1.. = Don't fragment: Set
..0. = More fragments: Not set
Fragment offset: 0
Time to live: 64
Protocol: TCP (0x06)
Header checksum: 0xa10f [correct]
[Good: True]
[Bad : False]
Source: 10.0.0.1 (10.0.0.1)
Destination: 10.0.0.2 (10.0.0.2)
Transmission Control Protocol, Src Port: 56099 (56099), Dst Port:
commplex-link (5001), Seq: 376505, Ack: 1, Len: 65160
Source port: 56099 (56099)
Destination port: commplex-link (5001)
[Stream index: 0]
Sequence number: 376505(relative sequence number)
[Next sequence number: 441665(relative sequence number)]
Acknowledgement number: 1(relative ack number)
Header length: 32 bytes
Flags: 0x10 (ACK)
0...  = Congestion Window Reduced (CWR): Not set
.0..  = ECN-Echo: Not set
..0.  = Urgent: Not set
...1  = Acknowledgement: Set
 0... = Push: Not set
 .0.. 

Re: [PATCH] support NIC configuration in iBFT

2010-01-07 Thread Ulrich Windl
On getopt:

I always prefer to list options in an ordered way (alphabetically).

-   while ((ch = getopt_long(argc, argv, "i:t:g:a:p:d:u:w:U:W:bfvh",
+   while ((ch = getopt_long(argc, argv, "i:t:g:a:p:d:u:w:U:W:bnfvh",

On 7 Jan 2010 at 14:16, Alex Zeffertt wrote:

> Mike Christie wrote:
> > 
> > Thanks for doing this. Sorry for the late reply.
> > 
> > 
> > Just one comment on the patch. Could you move the code in the 'n' case
> > 
> > +   case 'n':
> > +   /*
> > +* Bring up NICs required by targets in iBFT
> > +* using IP addresses and routing info from iBFT.
> > +*/
> > 
> > ..
> > 
> > 
> > to some helper function, so it is not so crowed and a little easier to read?
> > 
> 
> No problem.  Please find a new patch attached.
> 
> Regards,
> 
> Alex
> 


-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: [PATCH] support NIC configuration in iBFT

2010-01-07 Thread Alex Zeffertt

Mike Christie wrote:


Thanks for doing this. Sorry for the late reply.


Just one comment on the patch. Could you move the code in the 'n' case

+   case 'n':
+   /*
+* Bring up NICs required by targets in iBFT
+* using IP addresses and routing info from iBFT.
+*/

..


to some helper function, so it is not so crowed and a little easier to read?



No problem.  Please find a new patch attached.

Regards,

Alex
-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.


iscsistart option to bring up NICs using configuration in iBFT.

For each target listed, iSCSI Boot Firmware Tables specify which NIC to use and
how it should be configured.  Until now this information has been ignored by
open-iscsi.  This patch enables iscsistart to apply the NIC configuration.

The new command "iscsiadm -n" applies the NIC configuration specified in the
iBFT for each valid target.

The primary benefit of this is that it allows the initrd to extract networking
information from the iBFT rather than hard code it.  If the initrd uses the iBFT
for networking info then when this info is modified via the BIOS it is not
necessary to rebuild the initrd.

Signed-off-by

diff --git a/usr/iscsistart.c b/usr/iscsistart.c
index 8482ad5..2ee2674 100644
--- a/usr/iscsistart.c
+++ b/usr/iscsistart.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "initiator.h"
 #include "iscsi_ipc.h"
@@ -73,6 +79,7 @@ static struct option const long_options[] = {
 	{"password_in", required_argument, NULL, 'W'},
 	{"debug", required_argument, NULL, 'd'},
 	{"fwparam_connect", no_argument, NULL, 'b'},
+	{"fwparam_network", no_argument, NULL, 'n'},
 	{"fwparam_print", no_argument, NULL, 'f'},
 	{"help", no_argument, NULL, 'h'},
 	{"version", no_argument, NULL, 'v'},
@@ -99,6 +106,7 @@ Open-iSCSI initiator.\n\
   -W, --password_in=N  set incoming password to N (optional\n\
   -d, --debug debuglevel   print debugging information \n\
   -b, --fwparam_connectcreate a session to the target\n\
+  -n, --fwparam_networkbring up the network as specified by iBFT\n\
   -f, --fwparam_print  print the iBFT to STDOUT \n\
   -h, --help   display this help and exit\n\
   -v, --versiondisplay version and exit\n\
@@ -199,6 +207,140 @@ static int setup_session(void)
 	return rc;
 }
 
+static int setup_nics(void)
+{
+	struct boot_context *context;
+	char *iface_prev = NULL;
+	int sock;
+	int ret;
+
+	/* Create socket for making networking changes */
+	if ((sock = socket(AF_INET, SOCK_DGRAM, 0)) == -1) {
+		perror("socket(AF_INET, SOCK_DGRAM, 0)");
+		exit(1);
+	}
+			
+	/* 
+	 * For each target in iBFT bring up required NIC and use routing
+	 * to force iSCSI traffic through correct NIC
+	 */
+	list_for_each_entry(context, &targets, list) {
+	
+		/* Bring up NIC with correct address  - unless it
+		 * has already been handled (2 targets in IBFT may share one NIC)
+		 */
+		struct sockaddr_in ipaddr = { .sin_family = AF_INET };
+		struct sockaddr_in netmask = { .sin_family = AF_INET };
+		struct sockaddr_in hostmask = { .sin_family = AF_INET };
+		struct sockaddr_in gateway = { .sin_family = AF_INET };
+		struct sockaddr_in tgt_ipaddr = { .sin_family = AF_INET };
+		struct rtentry rt;
+		struct ifreq ifr;
+
+		if (!strlen(context->iface)) {
+			printf("No iface in fw entry\n");
+			ret = -1;
+			continue;
+		}
+		if (!inet_aton(context->ipaddr, &ipaddr.sin_addr)) {
+			printf("Invalid or no ipaddr in fw entry\n");
+			ret = -1;
+			continue;
+		}
+
+		if (!inet_aton(context->mask, &netmask.sin_addr)) {
+			printf("Invalid or no netmask in fw entry\n");
+			ret = -1;
+			continue;
+		}
+		inet_aton("255.255.255.255", &hostmask.sin_addr);
+
+		if (!inet_aton(context->target_ipaddr, &tgt_ipaddr.sin_addr)) {
+			printf("Invalid or no target ipaddr in fw entry\n");
+			ret = -1;
+			continue;
+		}
+
+		/* Only set IP/NM if this is a new interface */
+		if (iface_prev == NULL || strcmp(context->iface, iface_prev)) {
+	
+			/* Note: test above works because there is a maximum of two targets in the iBFT */
+			iface_prev = context->iface;
+
+			/* TODO: create vlan if strlen(context->vlan) */
+
+			/* Bring up interface */
+			memset(&ifr, 0, sizeof(ifr));
+			strncpy(ifr.ifr_name, context->iface, IFNAMSIZ);
+			ifr.ifr_flags = IFF_UP | IFF_RUNNING;
+			if (ioctl(sock, SIOCSIFFLAGS, &ifr) < 0) {
+perror("ioctl(SIOCSIFFLAGS)");
+ret = -1;
+continue;
+			}
+			/* Set IP address */
+

Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Pasi Kärkkäinen
On Wed, Jan 06, 2010 at 11:59:37PM -0800, Jack Z wrote:
> Hi Pasi,
> 
> Thank you very much for your help. I really appreciate it!
> 
> On Jan 5, 12:58 pm, Pasi Kärkkäinen  wrote:
> > On Tue, Jan 05, 2010 at 02:05:03AM -0800, Jack Z wrote:
> >
> >
> > > > Try using some benchmarking tool that can do multiple outstanding IOs..
> > > > for example ltp disktest.
> >
> > > And I tried ltp disktest, too. But I'm not sure whether I used it
> > > right because the result was a little surprising...
> >
> > > I did
> >
> > > disktest -w -S0:1k -B 1024 /dev/sdb
> >
> > > (/dev/sdb is the iSCSI device file, no partition or file system on it)
> >
> > > And the result was:
> >
> > > | 2010/01/05-02:58:26 | START | 27293 | v1.4.2 | /dev/sdb | Start
> > > args: -w -S0:1024k -B 1024 -PA (-I b) (-N 8385867) (-K 4) (-c) (-p R)
> > > (-L 1048577) (-D 0:100) (-t 0:2m) (-o 0)
> > > | 2010/01/05-02:58:26 | INFO  | 27293 | v1.4.2 | /dev/sdb | Starting
> > > pass
> > > ^C| 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > > bytes written in 85578 transfers: 87631872
> > > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > > write throughput: 701055.0B/s (0.67MB/s), IOPS 684.6/s.
> > > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > > Write Time: 125 seconds (0d0h2m5s)
> > > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > > overall runtime: 152 seconds (0d0h2m32s)
> > > | 2010/01/05-03:00:58 | END   | 27293 | v1.4.2 | /dev/sdb | User
> > > Interrupt: Test Done (Passed)
> >
> > > As you can see, the throughput was only 0.67MB/s and only 85578
> > > written in 87631872 transfers...
> > > I also tweaked the options with "-p l" and/or "-I bd" (change seek
> > > pattern to linear and/or speficy IO type as block and direct IO) but
> > > no improvement happened...
> >
> > Hmm.. so it does 684 IO operations per second (IOPS), and each IO was 1k
> > in size, so it makes 684 kB/sec of throughput.
> >
> > 1000 milliseconds (1 second) divided by 684 IOPS is 1.46 milliseconds per 
> > IO..
> >
> > Are you sure you had 16ms of rtt?
> 
> Actually that was probably the output from 0.2 ms rtt instead of 16
> ms... I'm sorry for the mistake. I tried again the same command on a
> 16ms RTT, and the IOPS was mostly around 180.
> 

1000ms divided by 16ms rtt gives you 62,5 synchronous IOPS max.
So that means you had about 3 outstanding IOs running, since you
got 180 IOPS.

If I'm still following everything correctly :)

> 
> > Try to play and experiment with these options:
> >
> > -B 64k  (blocksize 64k, try also 4k)
> > -I BD (block device, direct IO (O_DIRECT))
> > -K 16 (16 threads, aka 16 outstanding IOs. -K 1 should be the same as dd)
> >
> > Examples:
> >
> > Sequential (linear) reads using blocksize 4k and 4 simultaneous threads, 
> > for 60 seconds:
> > disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 60 -r /dev/sdX
> >
> > Random writes:
> >
> > disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 60 -w /dev/sdX
> >
> > 30% random reads, 70% random writes:
> > disktest -r -w -D30:70 -K2 -E32 -B 8k -T 60 -pR -Ibd -PA /dev/md4
> >
> > Hopefully that helps..
> 
> That did help. I tried the following combinations of -B -K and -p at
> 20 ms RTT and the other options were -h 30 -I BD -P T -S0:(1 GB size)
> 
> -B 4k/64k -K 4/64 -p l
> 
> It seems that when I put -p l there the performance goes down
> drastically...
> 

That's really weird.. linear/sequential (-p l) should always be faster
than random.

> -B 4k -K 4/64 -p r
> 
> The disk throughput is similar to the one I used in the previous post
> "disktest -w -S0:1k -B 1024 /dev/sdb " and it's much lower than dd
> could get.
> 

like said, weird.

> -B 64k -K 4 -p r
> 
> The disk throughput is higher than the last one but still not as high
> as dd could get.
> 
> -B 64k -K 64 -p r
> 
> The disk throughput was boosted to 8.06 MB/s and the IOPS was 129.0.
> At the link layer, the traffic rate was 70.536 Mbps (the TCP baseline
> was 96.202 Mbps). At the same time, dd ( bs=64K count=(1 GB size)) got
> a throughput of 6.7 MB/s and the traffic rate on the link layer was
> 57.749 Mbps.
> 

Ok.

129 IOPS * 64kB = 8256 kB/sec, which pretty much matches the 8 MB/sec
you measured.

this still means there was only 1 outstanding IO.. and definitely not 64 (-K 
64).

> Although not much, it was still an improvement and it was the first
> improvement I have ever seen since I started my experiments! Thank you
> very much!
> 
> As for
> 
> > Oh, also make sure you have 'oflag=direct' for dd.
> 
> The result was surprisingly low again... Do you think the reason might
> be that I was running dd on a device file (/dev/sdb), which did not
> have any partitions/file systems on it?
> 
> Thanks a lot!
> 

oflag=direct makes dd use O_DIRECT, aka bypass all kernel/initiator caches for 
writing.
iflag=direct would bypass all caches for reading.

It shouldn't matter if you write or read from /dev/sda1 instead of /dev/sda. 
As long as it's a raw block device, i

Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Ulrich Windl
On 4 Jan 2010 at 6:54, Jack Z wrote:

> Hi all,
> 
> I was testing the performance of open-iscsi initiator with IET target
> over a 100Mbps Ethernet link with emulated rtt.  What I did was to do
> raw disk sequential write by
> 
> $ dd if=/dev/zero of=/dev/sdb bs=1024 count=1048576
> 
> , in which /dev/sdb is the iSCSI device. I also measured TCP
> throughput using iperf with the default setup except "-n 1024M". And I
> got the following data on iSCSI throughput and TCP throughput v.s. rtt
> 
> rtt (ms)iSCSI throughput by dd (MB/s)   TCP throughput by
> iperf (Mbit/s)
> 0.2   11.3
> 94.3
> 4  11.1
> 94.3
> 8  10.2
> 94.3
> 128.6
> 94.2
> 167.2
> 94.2
> 206.0
> 94.1
> 
> local disk throughput by dd was 26.7 MB/s.
> 
> As shown in the table above, iSCSI throughput declined rapidly with
> rtt increased from 0.2ms to 20ms. TCP throughput, however, only
> dropped less than 1 percent.

>From what I know the (estimated) RTT (Round Trip Time) increases if a link 
>problem 
(i.e. lost packets) was detected (if other parameters are unchanged).

> 
> Then I used Wireshark to grab the traces of iSCSI and iperf and I
> found lots of iSCSI PDUs were divided into TCP segments of 1448 bytes
> but with iperf TCP segments could be as large as 65000+ bytes.

How would you transport such a segmen unfragmented?

> 
> I first thought this was because of the small default value (8192) for
> MaxRecvDataSegmentLength. So I increased that value to 262144. But in
> a later test with 16ms rtt, I found the iSCSI throughput was only
> improved by 0.7 MB/s and a lot of iSCSI PDUs were still divided into
> 1448 byte long TCP segments... So I think MaxRecvDataSegmentLength may
> not be the reason.

I think the question is how big the TCP receive window will be.


> 
> I also skimmed through the iSCSI specification, but it seemed no luck
> there either...
> 
> I know the Ethernet MTU is 1500 byte long and that might be the reason
> of the 1448 byte TCP segments, but iperf did get to send much larger
> TCP segments of 65000+ bytes...

over which layer 2?

> 
> So does anyone have any idea about this: why iSCSI is not fully
> utilizing the bandwidth on long rtt links by increasing the TCP
> segment size?

Sorry, but I think utilizing a high-delay conncetion works via increasing the 
window size (i.e. number of packets), not the size of the segments. Both would 
be 
valid, but due to layer 2 and layer 3 restrictions (ISO OSI talk), only sending 
more packets while waiting for an answer will be a valid assumption (unless you 
have a dedicated single-hop line).

Regards,
Ulrich

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.




Re: iSCSI throughput drops as link rtt increases?

2010-01-07 Thread Jack Z
Hi Pasi,

Thank you very much for your help. I really appreciate it!

On Jan 5, 12:58 pm, Pasi Kärkkäinen  wrote:

> On Tue, Jan 05, 2010 at 02:05:03AM -0800, Jack Z wrote:
> > > Try using some benchmarking tool that can do multiple outstanding IOs..
> > > for example ltp disktest.
> > And I tried ltp disktest, too. But I'm not sure whether I used it
> > right because the result was a little surprising...
> > I did
> > disktest -w -S0:1k -B 1024 /dev/sdb
> > (/dev/sdb is the iSCSI device file, no partition or file system on it)
> > And the result was:
> > | 2010/01/05-02:58:26 | START | 27293 | v1.4.2 | /dev/sdb | Start
> > args: -w -S0:1024k -B 1024 -PA (-I b) (-N 8385867) (-K 4) (-c) (-p R)
> > (-L 1048577) (-D 0:100) (-t 0:2m) (-o 0)
> > | 2010/01/05-02:58:26 | INFO  | 27293 | v1.4.2 | /dev/sdb | Starting
> > pass
> > ^C| 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > bytes written in 85578 transfers: 87631872
> > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > write throughput: 701055.0B/s (0.67MB/s), IOPS 684.6/s.
> > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > Write Time: 125 seconds (0d0h2m5s)
> > | 2010/01/05-03:00:58 | STAT  | 27293 | v1.4.2 | /dev/sdb | Total
> > overall runtime: 152 seconds (0d0h2m32s)
> > | 2010/01/05-03:00:58 | END   | 27293 | v1.4.2 | /dev/sdb | User
> > Interrupt: Test Done (Passed)
> > As you can see, the throughput was only 0.67MB/s and only 85578
> > written in 87631872 transfers...
> > I also tweaked the options with "-p l" and/or "-I bd" (change seek
> > pattern to linear and/or speficy IO type as block and direct IO) but
> > no improvement happened...
> Hmm.. so it does 684 IO operations per second (IOPS), and each IO was 1k
> in size, so it makes 684 kB/sec of throughput.
> 1000 milliseconds (1 second) divided by 684 IOPS is 1.46 milliseconds per IO..
> Are you sure you had 16ms of rtt?


Actually that was probably the output from 0.2 ms rtt instead of 16
ms... I'm sorry for the mistake. I tried again the same command on a
16ms RTT, and the IOPS was mostly around 150.

> Try to play and experiment with these options:
>
> -B 64k  (blocksize 64k, try also 4k)
> -I BD (block device, direct IO (O_DIRECT))
> -K 16 (16 threads, aka 16 outstanding IOs. -K 1 should be the same as dd)
>
> Examples:
>
> Sequential (linear) reads using blocksize 4k and 4 simultaneous threads, for 
> 60 seconds:
> disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 60 -r /dev/sdX
>
> Random writes:
>
> disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 60 -w /dev/sdX
>
> 30% random reads, 70% random writes:
> disktest -r -w -D30:70 -K2 -E32 -B 8k -T 60 -pR -Ibd -PA /dev/md4
>
> Hopefully that helps..


That did help! I tried the following combinations of -B -K and -p at
20 ms RTT and the other options were -h 30 -I BD -P T -S0:(1 GB size)

-B 4k/64k -K 4/64 -p l

When I put -p l there the performance went down
drastically...

-B 4k -K 4/64 -p r

The disk throughput was similar to the one I used in the previous
post
"disktest -w -S0:1k -B 1024 /dev/sdb " and it was much lower than dd
could get.

-B 64k -K 4 -p r

The disk throughput was higher than the last one but still not as
high
as dd could get.

-B 64k -K 64 -p r

The disk throughput was boosted to 8.06 MB/s and the IOPS was 129.0.
At the link layer, the traffic rate was 70.536 Mbps (the TCP baseline
was 96.202 Mbps). At the same time, dd ( bs=64K count=(1 GB size))
got
a throughput of 6.7 MB/s and the traffic rate on the link layer was
57.749 Mbps.

Although not much, it was still an improvement :) and it was the
first
improvement I have ever seen since I started my experiments! Thank
you
very much!

As for
> Oh, also make sure you have 'oflag=direct' for dd.

The result was surprisingly low again... Do you think the reason
might
be that I was running dd on a device file (/dev/sdb), which did not
have any partitions/file systems on it?

Thanks a lot!

Jack

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.