Re: [Open-FCoE] System crashes with increased drive count

Jun Wu Mon, 09 Jun 2014 11:11:32 -0700

All the drives are SSD drives. So it shouldn't be power issue in this case.


On Mon, Jun 9, 2014 at 10:00 AM, Rustad, Mark D <[email protected]> wrote:
> On Jun 6, 2014, at 2:45 PM, Jun Wu <[email protected]> wrote:
>
>> The queue_depth is 32 by default. So all my previous tests were based
>> on 32 queue_depth.
>>
>> The result of my tests today confirms that with higher queue_depth,
>> there are more aborts on the initiator side and corresponding
>> "Exchange timer armed : 0 msecs" messages on the target side.
>> I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
>> or any abnormal messages. For 3, there are 15 instances of aborts and
>> 15 instances of "0 msecs" messages. When queue_depth increased to 6,
>> there are 81 instances of the aborts and 81 instances of "0 msecs"
>> messages for the same fio test.
>
> <snip>
>
>> When the abort occurs, "iostat -x 1" command on the target side starts
>> to show 0 iops on some of the drives. Sometimes all the 10 target
>> drives have 0 iops.
>>
>> What is the reasonable number to assume for the maximum number of
>> outstanding IO?
>
>
> You mentioned previously that fewer drives also worked better. The 
> combination of workarounds, reduced queue depth or fewer drives, makes me 
> wonder if there is a power issue in the system housing the drives. It is 
> surprisingly common for a fully-populated system to not really have adequate 
> power for fully active drives. Under those conditions, the drives can have a 
> variety of symptoms ranging from pausing I/O to resetting. When that happens, 
> the target will definitely be overloaded, things will time out and aborts 
> will happen. If things work well with one or two fewer drives with full queue 
> depth, I would really suspect a power, or possibly cooling issue rather than 
> a software issue.
>
> It is even possible that there is a vibration issue in the chassis that 
> results in the drives interfering with one another and causing error recovery 
> related I/O interruptions that could also result in timeouts and exchange 
> aborts.
>
> Nab, earlier you mentioned generating a busy response. I think that is a good 
> thing to be able to do. It is a little complicated because different 
> initiators have different timeouts. Ideally, you'd like to generate the busy 
> response shortly before the initiator times out. Although that might reduce 
> the logging in this case, I really suspect that not having that capability is 
> not the root cause of the remaining issue here.
>
> --
> Mark Rustad, Networking Division, Intel Corporation
>
_______________________________________________
fcoe-devel mailing list
[email protected]
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel

Re: [Open-FCoE] System crashes with increased drive count

Reply via email to