All the drives are SSD drives. So it shouldn't be power issue in this case.
On Mon, Jun 9, 2014 at 10:00 AM, Rustad, Mark D <[email protected]> wrote: > On Jun 6, 2014, at 2:45 PM, Jun Wu <[email protected]> wrote: > >> The queue_depth is 32 by default. So all my previous tests were based >> on 32 queue_depth. >> >> The result of my tests today confirms that with higher queue_depth, >> there are more aborts on the initiator side and corresponding >> "Exchange timer armed : 0 msecs" messages on the target side. >> I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort >> or any abnormal messages. For 3, there are 15 instances of aborts and >> 15 instances of "0 msecs" messages. When queue_depth increased to 6, >> there are 81 instances of the aborts and 81 instances of "0 msecs" >> messages for the same fio test. > > <snip> > >> When the abort occurs, "iostat -x 1" command on the target side starts >> to show 0 iops on some of the drives. Sometimes all the 10 target >> drives have 0 iops. >> >> What is the reasonable number to assume for the maximum number of >> outstanding IO? > > > You mentioned previously that fewer drives also worked better. The > combination of workarounds, reduced queue depth or fewer drives, makes me > wonder if there is a power issue in the system housing the drives. It is > surprisingly common for a fully-populated system to not really have adequate > power for fully active drives. Under those conditions, the drives can have a > variety of symptoms ranging from pausing I/O to resetting. When that happens, > the target will definitely be overloaded, things will time out and aborts > will happen. If things work well with one or two fewer drives with full queue > depth, I would really suspect a power, or possibly cooling issue rather than > a software issue. > > It is even possible that there is a vibration issue in the chassis that > results in the drives interfering with one another and causing error recovery > related I/O interruptions that could also result in timeouts and exchange > aborts. > > Nab, earlier you mentioned generating a busy response. I think that is a good > thing to be able to do. It is a little complicated because different > initiators have different timeouts. Ideally, you'd like to generate the busy > response shortly before the initiator times out. Although that might reduce > the logging in this case, I really suspect that not having that capability is > not the root cause of the remaining issue here. > > -- > Mark Rustad, Networking Division, Intel Corporation > _______________________________________________ fcoe-devel mailing list [email protected] http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
