I've got a problem I can reproduce easily enough, but really I fail to understand what's going wrong.

I've got a 5.3 Dom0, which is running three guests. One is Fedora 10, that runs with local flat files, and works fine. One is Nexenta 2 (opensolaris-based), and that runs off of physical partitions, and seems to work great. The third runs Fedora 11 and and has for its disks, iSCSI devices that are exported from Nexenta (ZFS-backed).

I have the Dom0 mapping the two iSCSI devices, one for /boot and one for /. They're showing up initially as /dev/sdc and /dev/sdd.

If I go after the iSCSI devices in Dom0, with dd, for instance, they work fine all day. I can read and write the entire devices to and from local files without error. iSCSI seems to work properly in that regard. I'm getting about 38MB/s. I've scrubbed the disk pool and no errors were found and long SMART self-test passed on each of the disks.

So, I specify those devices in the Xen config for the domU (tried both real device name and /dev/disk/by-path/ names) and the DomU boots and operates as I'd expect. Installation worked fine and typical operations (low volume) work. However, then I try to do something, which I'm assuming is more disk intensive, like running a yum update, and iSCSI seems to fall over.

In the DomU, I'll see a lock-up, and then filesystem errors.  e.g.:

Installing : kernel [############################################ ] 1/33EXT3-fs error (device xvda1) in ext3_ordered_writepage: IO failure

In the Dom0, I'll see:

  sd 6:0:0:0: timing out command, waited 360s
  sd 6:0:0:0: SCSI error: return code = 0x06050000
  end_request: I/O error, dev sdc, sector 37319
  sd 7:0:0:0: timing out command, waited 360s
  sd 7:0:0:0: SCSI error: return code = 0x06000000
  end_request: I/O error, dev sdd, sector 29792137
  sd 7:0:0:0: timing out command, waited 360s
  sd 7:0:0:0: SCSI error: return code = 0x06000000
  end_request: I/O error, dev sdd, sector 29792313

Both (all) iSCSI devices are failed. Under iostat I see activity to the iSCSI block devices, and the whole machine acts mostly I/O blocked (even the Fedora 10 DomU running on flat files will start throwing nagios into a tizzy). If I do 'service iscsi stop' everything picks right back up (though the DomU using them as its disks is obviously unhappy).

When I start iscsi again I can pick right back up (after repairing filesystems in the DomU), and I can repeat the process at will. Sometimes the disks will come back as, e.g. sdd and sde, leaving me to think something still has a handle on sdc. But lsof shows nothing in dom0.

One thing that stood out were some of the block and sector number errors being right on power of two boundries:

 scsi 7:0:0:0: SCSI error: return code = 0x00010000
 end_request: I/O error, dev sdd, sector 32768
 Buffer I/O error on device sdd, logical block 4096
 Buffer I/O error on device sdd, logical block 4097
 Buffer I/O error on device sdd, logical block 4098
 Buffer I/O error on device sdd, logical block 4099
 Buffer I/O error on device sdd, logical block 4100
 Buffer I/O error on device sdd, logical block 4101
 Buffer I/O error on device sdd, logical block 4102
 Buffer I/O error on device sdd, logical block 4103
 Buffer I/O error on device sdd, logical block 4104
 Buffer I/O error on device sdd, logical block 4105
 scsi 7:0:0:0: rejecting I/O to dead device

but as I opened with, I'm sort of at as loss as to what is actually causing the problem. Any suggestions for further troubleshooting and/or ideas about what's happening appreciated.

-Bill

--
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
http://www.bfccomputing.com/    Cell: 603.252.2606
Twitter, etc.: bill_mcgonigle   Page: 603.442.1833
Email, IM, VOIP: [email protected]
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to