ISCSI target LIO performance bottle neck analyze

LSZhu Wed, 18 Mar 2015 19:16:03 -0700

Hi, 

I have been working on LIO performance work for weeks, now I can release 
some results and issues, in this mail, I would like to talk about issues on 
CPU usage and  transaction speed. There are also some CPU cycles in wait 
status in the initiator side, I really hope can get some hints and 
suggestion from you!


Summary: 
(1) In 512Bytes, single process, reading case, I found the transaction 
speed is 2.818MB/s in a 1GB network, the running CPU core in initiator side 
spent over 80% cycles in waiting, while one core of LIO side spent 43.6% in 
Sys, even no cycles in user, no cycles in wait. I assume the bottle neck of 
this small package, one thread transaction is the lock operations on LIO 
target side. 

(2) In 512Bytes, 32 process, reading case, I found the transaction speed is 
11.259MB/s in a 1GB network, I found there is only one CPU core in the LIO 
target side running, and the load is 100% in SYS. While other cores totally 
free, no workload. I assume the bottle neck of this small package, multi 
threads transaction is the that, no workload balance on target side. 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 

Here are all detailed information: 


My environment: 
Two blade severs with E5 CPU and 32GB ram, one run LIO and the other is the 
initiator. 
ISCSI backstore: RAM disk, I use the command line "modprobe brd 
rd_size=4200000 max_part=1 rd_nr=1" to create it.(/dev/ram0, and in the 
initiator side it is /dev/sdc). 
1GB network. 
OS: SUSE Enterprise Linux Sever on both sides, kernel version 3.12.28-4. 
Initiator: Open-iSCSI Initiator 2.0873-20.4 
LIO-utils: version: 4.1-14.6 
My tools: perf, netperf, nmon, FIO 


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 

For case (1): 

In 512Bytes, single process, reading case, I found the transaction speed is 
2.897MB/s in a 1GB network, the running CPU core in initiator side spent 
over 80% cycles in waiting, while one core of LIO side spent 43.6% in Sys, 
even no cycles in user, no cycles in wait. 

I run this test case by the command line: 
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=2G -numjobs=1 
-runtime=600 -group_reporting -name=test. 

part of the results: 
Jobs: 1 (f=1): [R(1)] [100.0% done] [2818KB/0KB/0KB /s] [5636/0/0 iops] 
[eta 00m:00s] 
test: (groupid=0, jobs=1): err= 0: pid=1258: Mon Mar 16 21:48:14 2015 
  read : io=262144KB, bw=2897.8KB/s, iops=5795, runt= 90464msec 

I run a netperf test with buffer set to 512Bytes and 512Bytes per package, 
get a transaction speed of 6.5MB/s, better than our LIO did, so I tried 
nmon and perf to find why. 
This is the screen shot of what nmon show about CPU in the initiator side: 


┌nmon─14i─────────────────────Hostname=INIT─────────Refresh=10secs 
───21:30.42────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 

│ CPU Utilisation 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 
│ 
│---------------------------+-------------------------------------------------+ 
│ 
│CPU  User%  Sys% Wait% Idle|0          |25         |50 |75 100| │ 
│  1   0.0   0.0   0.2 99.8|> | │ 
│  2   0.1   0.1   0.0 99.8|> | │ 
│  3   0.0   0.2   0.0 99.8|> | │ 
│  4   0.0   0.0   0.0 100.0|> | │ 
│  5   0.0   0.0   0.0 100.0|> | │ 
│  6   0.0   3.1   0.0 96.9|s> | │ 
│  7   2.8  12.2  83.8 
1.2|UssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>| │ 
│  8   0.0   0.0   0.0 100.0|> | │ 
│  9   0.0   0.0   0.0 100.0|> | │ 
│ 10   0.0   0.0   0.0 100.0|> | │ 
│ 11   0.0   0.0   0.0 100.0|> | │ 
│ 12   0.0   0.0   0.0 100.0|> | │ 
│---------------------------+-------------------------------------------------+ 
│ 
│Avg   0.2   1.1   5.8 92.8|WW> | │ 
│---------------------------+-------------------------------------------------+ 


We can see on the initiator side, there is only one core running, that is 
ok, but this core spent 83.8% in wait, that seems strange, while on the LIO 
target side, the only running core spent 43.6% in SYS, even no cycles in 
user or wait. Why the initiator waited while there is still some free 
resource(CPU core cycles) on the target side? Then I use perf record to 
monitor the LIO target, I find locks, especially spin lock consumed nearly 
40% CPU cycles. I assume this is the reason why the initiator side shown 
wait and low speed,lock operation is the bottle neck of this case(small 
package, single thread transaction) Do you have any comments on that? 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 


For case (2): 
In 512Bytes, 32 process, reading case, I found the transaction speed is 
11.259MB/s in a 1GB network, I found there is only one CPU core in the LIO 
target side running, and the load is 100% in SYS. While other cores totally 
free, no workload. 

I run the case by this command line: 
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=4GB -numjobs=32 
-runtime=600 -group_reporting -name=test. 

The speed is 11.259MB/s. On the LIO target side, I found only one cpu core 
running, all other cores totally free. It seems that  there is not a 
workload balance scheduler. It seems the bottle neck of this case(small 
package, multi threads transaction). Is it nice to be some code to balance 
the transaction traffic to all cores? Hope can get some hints, suggestion 
and why from you experts! 



Thanks a lot for your time to read my mail. 
Have a nice day! 
BR 
Zhu Lingshan 




-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

ISCSI target LIO performance bottle neck analyze

Reply via email to