Re: [Open-FCoE] System crashes with increased drive count

Rustad, Mark D Mon, 09 Jun 2014 10:05:01 -0700

On Jun 6, 2014, at 2:45 PM, Jun Wu <[email protected]> wrote:

> The queue_depth is 32 by default. So all my previous tests were based
> on 32 queue_depth.
> 
> The result of my tests today confirms that with higher queue_depth,
> there are more aborts on the initiator side and corresponding
> "Exchange timer armed : 0 msecs" messages on the target side.
> I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
> or any abnormal messages. For 3, there are 15 instances of aborts and
> 15 instances of "0 msecs" messages. When queue_depth increased to 6,
> there are 81 instances of the aborts and 81 instances of "0 msecs"
> messages for the same fio test.


<snip>

> When the abort occurs, "iostat -x 1" command on the target side starts
> to show 0 iops on some of the drives. Sometimes all the 10 target
> drives have 0 iops.
> 
> What is the reasonable number to assume for the maximum number of
> outstanding IO?


You mentioned previously that fewer drives also worked better. The combination 
of workarounds, reduced queue depth or fewer drives, makes me wonder if there 
is a power issue in the system housing the drives. It is surprisingly common 
for a fully-populated system to not really have adequate power for fully active 
drives. Under those conditions, the drives can have a variety of symptoms 
ranging from pausing I/O to resetting. When that happens, the target will 
definitely be overloaded, things will time out and aborts will happen. If 
things work well with one or two fewer drives with full queue depth, I would 
really suspect a power, or possibly cooling issue rather than a software issue.

It is even possible that there is a vibration issue in the chassis that results 
in the drives interfering with one another and causing error recovery related 
I/O interruptions that could also result in timeouts and exchange aborts.

Nab, earlier you mentioned generating a busy response. I think that is a good 
thing to be able to do. It is a little complicated because different initiators 
have different timeouts. Ideally, you'd like to generate the busy response 
shortly before the initiator times out. Although that might reduce the logging 
in this case, I really suspect that not having that capability is not the root 
cause of the remaining issue here.

-- 
Mark Rustad, Networking Division, Intel Corporation

_______________________________________________
fcoe-devel mailing list
[email protected]
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel

Re: [Open-FCoE] System crashes with increased drive count

Reply via email to