On Mon, 2014-06-09 at 17:00 +0000, Rustad, Mark D wrote: > On Jun 6, 2014, at 2:45 PM, Jun Wu <[email protected]> wrote: > > > The queue_depth is 32 by default. So all my previous tests were based > > on 32 queue_depth. > > > > The result of my tests today confirms that with higher queue_depth, > > there are more aborts on the initiator side and corresponding > > "Exchange timer armed : 0 msecs" messages on the target side. > > I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort > > or any abnormal messages. For 3, there are 15 instances of aborts and > > 15 instances of "0 msecs" messages. When queue_depth increased to 6, > > there are 81 instances of the aborts and 81 instances of "0 msecs" > > messages for the same fio test. > > <snip> > > > When the abort occurs, "iostat -x 1" command on the target side starts > > to show 0 iops on some of the drives. Sometimes all the 10 target > > drives have 0 iops. > > > > What is the reasonable number to assume for the maximum number of > > outstanding IO? >
As mentioned before "It completely depends on the fabric, initiator, and backend storage.", see related text for more details. > > You mentioned previously that fewer drives also worked better. The > combination of workarounds, reduced queue depth or fewer drives, makes me > wonder if there is a power issue in the system housing the drives. It is > surprisingly common for a fully-populated system to not really have adequate > power for fully active drives. Under those conditions, the drives can have a > variety of symptoms ranging from pausing I/O to resetting. When that happens, > the target will definitely be overloaded, things will time out and aborts > will happen. If things work well with one or two fewer drives with full queue > depth, I would really suspect a power, or possibly cooling issue rather than > a software issue. > > It is even possible that there is a vibration issue in the chassis that > results in the drives interfering with one another and causing error recovery > related I/O interruptions that could also result in timeouts and exchange > aborts. > Using ramdisks instead backend disks could confirm and isolate to backend issues. > Nab, earlier you mentioned generating a busy response. I think that is a good > thing to be able to do. It is a little complicated because different > initiators have different timeouts. Ideally, you'd like to generate the busy > response shortly before the initiator times out. Although that might reduce > the logging in this case, I really suspect that not having that capability is > not the root cause of the remaining issue here. > If target sends queue full to hosts based on pending back long instead re-acting upon data out failures could avoid host timeouts, that will be inline to described ideal scenario above. Thanks, Vasu _______________________________________________ fcoe-devel mailing list [email protected] http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
