Re: [Open-FCoE] System crashes with increased drive count

Vasu Dev Mon, 09 Jun 2014 10:38:42 -0700

On Mon, 2014-06-09 at 17:00 +0000, Rustad, Mark D wrote:
> On Jun 6, 2014, at 2:45 PM, Jun Wu <[email protected]> wrote:
> 
> > The queue_depth is 32 by default. So all my previous tests were based
> > on 32 queue_depth.
> > 
> > The result of my tests today confirms that with higher queue_depth,
> > there are more aborts on the initiator side and corresponding
> > "Exchange timer armed : 0 msecs" messages on the target side.
> > I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
> > or any abnormal messages. For 3, there are 15 instances of aborts and
> > 15 instances of "0 msecs" messages. When queue_depth increased to 6,
> > there are 81 instances of the aborts and 81 instances of "0 msecs"
> > messages for the same fio test.
> 
> <snip>
> 
> > When the abort occurs, "iostat -x 1" command on the target side starts
> > to show 0 iops on some of the drives. Sometimes all the 10 target
> > drives have 0 iops.
> > 
> > What is the reasonable number to assume for the maximum number of
> > outstanding IO?
>


As mentioned before "It completely depends on the fabric, initiator, and
backend storage.", see related text for more details. 
> 
> You mentioned previously that fewer drives also worked better. The 
> combination of workarounds, reduced queue depth or fewer drives, makes me 
> wonder if there is a power issue in the system housing the drives. It is 
> surprisingly common for a fully-populated system to not really have adequate 
> power for fully active drives. Under those conditions, the drives can have a 
> variety of symptoms ranging from pausing I/O to resetting. When that happens, 
> the target will definitely be overloaded, things will time out and aborts 
> will happen. If things work well with one or two fewer drives with full queue 
> depth, I would really suspect a power, or possibly cooling issue rather than 
> a software issue.
> 
> It is even possible that there is a vibration issue in the chassis that 
> results in the drives interfering with one another and causing error recovery 
> related I/O interruptions that could also result in timeouts and exchange 
> aborts.
> 

Using ramdisks instead backend disks could confirm and isolate to
backend issues.

> Nab, earlier you mentioned generating a busy response. I think that is a good 
> thing to be able to do. It is a little complicated because different 
> initiators have different timeouts. Ideally, you'd like to generate the busy 
> response shortly before the initiator times out. Although that might reduce 
> the logging in this case, I really suspect that not having that capability is 
> not the root cause of the remaining issue here.
> 

If target sends queue full to hosts based on pending back long instead
re-acting upon data out failures could avoid host timeouts, that will be
inline to described ideal scenario above.

Thanks,
Vasu




_______________________________________________
fcoe-devel mailing list
[email protected]
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel

Re: [Open-FCoE] System crashes with increased drive count

Reply via email to