Can you post your commandline for the MSDOS 6.22 issue? NT is known to have a few problems and may be out of scope for what I can help with, but I was under the assumption that MSDOS 6.22 was well-behaved in QEMU.
Commandline and steps to reproduce the error may be helpful (any particularly kind of command, workflow, etc that helps trigger the IO errors? How big is the hard disk you are using? etc) Thanks, --John -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1745312 Title: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] Status in QEMU: New Bug description: [Headsup: This report is long-ish due to the amount of detail I've stumbled on along the way that I think is relevant to include. I can't speak as to the complexity of the actual bugs, but the size of this report should not suggest that the reproduction process is particularly headache-inducing.] Hi! I recently needed to fire up some ancient software for research purposes and got very distracted discovering and playing with old versions of Windows :). In the process I've discovered some glitches with disk I/O. I believe I've stumbled on two completely separate issues that coincidentally surfaced at the same time. It's possible that components of this report will be re-filed as more specific new bugs, but I'm not an authority on QEMU internals or how to narrow down/categorize what I've found. - The first bug only surfaces when the "isapc" machine type is used. It intermittently produces "General failure {read,writ}ing drive _" under MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to make no difference whatsoever, which may help with debugging. - The second issue involves - a WinNT4 disk image - created by running through a bog-standard NT4 install inside QEMU 2.9.0 - which will now fail to boot in any version of QEMU - even version 1.0 - but which VirtualBox will boot fine - but only if I point VirtualBox at QEMU's raw disk image via a hacked-together VMDK file - if the raw image is converted to VHD(X), VirtualBox will also fail to boot the image with exactly the same error as QEMU - this state of affairs is not affected by image sparseness (which makes sense) I'm confident I've bisected the first issue. I wasn't able to bisect the second issue (as all tested versions of QEMU behaved identically), but I've figured out a working repro testcase and I believe I've managed to pin down a solid root cause. == #1: Intermittent I/O issues when `-M isapc` is used ===== These symptoms sometimes take a small amount of time and fiddling to trigger, but I AM able to consistently surface them on my machine after a short while. (I am very very interested to hear if others cannot reproduce them.) So, first of all: https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc (Jul 30 2013): the last version that works https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3 (Oct 30 2013): the first version that intermittently fails Maybe lift out and build these branches while reading. *shrug* (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW) Here are the changelists between these two revisions: https://github.com/qemu/qemu/compare/306ec6c...e689f7c (Compare direction: OLD to NEW) (Commits: 166 Files changed: 192) https://github.com/qemu/qemu/compare/e689f7c...306ec6c (Compare direction: NEW to OLD) (Commits: 30 Files changed: 22) (Someone else more familiar with Git might know why GitHub returns results for both compare directions, and/or if the 2nd link is useful information. The first link returns a lot more results than the 2nd one, at least. Does comparing new>old return deletions?) --- Now on to the symptoms. In a moment I'll describe reproduction. # MS-DOS 6.22 The first symptom I discovered was that trivial read and write operations under MS-DOS would sometimes fail: C:\>echo test > hi General failure writing drive C Abort, Retry, Fail? Anything else that exercises the disk behaves similarly: C:\>dir /s > nul General failure reading drive C Abort, Retry, Fail? (Note that the above demonstrates both write and read failures) (Also, FWIW, `dir /s` == `ls -R`) The behavior of the I/O errors is not possible to characterise as it fluctuates so much. For example something as simple as DIR can produce wildly differing results: in one run, poking around with DIR ended with DOS deciding C:\ was empty at one point; at another point in a different run C:\ mysteriously dropped 50% of its contents only to magically gain it all back moments later after some poking around in one of the subdirectories that was still visible. The time it takes to trigger these errors is also highly variable. QEMU may fall over as early as hanging forever at "Starting MS- DOS...", or I might get all the way into Windows 3.1 before it triggers (in which case Win3.1 reports vague memory errors - of all things). Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard Disk..." "Boot failed: could not read the boot disk" ... "No bootable device.", and on one occasion I even got "Non-System disk or disk error" "Replace and strike any key when ready"! # WinNT 4 Terminal Server Most of the time, NTLDR will fire up normally. But every so often... SeaBIOS (version rel-1.7.3-117-g31b8b4e- 20131206_080705-nilsson.home.kraxel.org) Booting from Hard Disk... A disk read error occurred. Insert a system diskette and restart the system. (NB. You're seeing the old SeaBIOS version included with e689f7c, which was the first buggy commit.) If NT gets past this point without erroring out (ie, it makes it to the boot menu), the rest of the system is 100% fine and there are no other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was able to enable disk compression, answer "Yes" to "Compress entire disk now?" and have the process fully complete. No hitches. This makes me vaguely recall/wonder that perhaps this could be somehow related to LBA and/or Int 13h, or something floating around near that bunch of functionality. (I'm woefully ignorant about such low-level details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has a buggy implementation of, while NT 4 (once NTOSKRNL is up and running) is able to use a different disk mode or access mechanism. I'm really interested to get some understanding of what the root issue is here, when this is fixed. (I wonder if it's a timing thing?) I've observed some unusual behavior with repeated restarts. In one case, I attempted to start NT4 multiple times, and QEMU consistently failed with "No bootable device" each time. So, I removed `-M isapc`, promptly got a boot menu, hit ^C, readded `-M isapc` - and continued to get a boot menu. Yep. I'll accept "really really big coincidence" but I do very much wonder if something else is going on here. I've observed many similar incidents. It makes me wonder whether the contents of memory or some other system state is an influence. Very probably not, but still... -- Reproduction -------------------------------------------- First of all, there was unfortunately no way for me to avoid having to post entire disk images, but I've managed to compress everything down to 174MB total download size. FWIW, WinWorld and many other sites seem to have no operational issues providing clear pointers to CD keys; I consider my distribution of my installed HDD images an extension of the apparent status quo. That being said, I've put everything on Google Drive so nobody has to headscratch about Launchpad/Canonical/etc's stance on hosting this data. So, this folder contains the disk images: https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c ("Download all" at the top-right will create a ZIP file, but FWIW downloading the individual files simultaneously would implement a rough form of download acceleration) File meta info: Compressed | | Apparent | | Actual | | | 38M -> 200M (103M) win31.img.xz 82M -> 1G (289M) wnt4ts-broken.img.xz 55M -> 350M (146M) wnt4ts-intermittent.img.xz SHA-256s: win31: 8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2 broken: a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784 (Wanted to keep the checksum lines within 80 columns) And, since I can't figure out where else in this report to put this, wnt4ts-broken.img's password is "admin" but something seems to have happened to the disk and NT doesn't actually boot properly :(, and wnt4ts-intermittent.img's password is "1234". (These were set up as test images. Now I'm _really_ glad I used simple passwords! :) ) --- I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4. # MS-DOS DOS is the simplest. It basically consists of $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable- kvm And then literally just playing around. Things to try include creating files (`echo blah > file`), repeatedly seeking across the entire FAT (`dir /s > nul` or `dir /s`), and launching Windows (`win`). win31.img is not special (as far as I can tell) and merely consists of the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC. I've basically just included the image for convenience. Generally no single "run" is immune to starting Win3.1 and then launching File Manager; if that doesn't generate an error, something is definitely up. The second best trigger is creating new files. That very very frequently produces "General Failure ...", but not always. # WinNT 4 Windows NT 4 is a bit more complicated. Because this error only occurs at presumably a single small point very early in boot, the window of opportunity for the glitch to surface within is much much narrower and thus often requires a larger number of tries. Anecdotally I've had QEMU hit the boot error at the first try/run, and after as many as 63 "successful" boots. I made a small test harness that automates the launch process. It consists of two shell scripts and requires tmux (and netcat). (*Potential epilepsy warning*: if you use a light-colored terminal background, the terminal QEMU is repeatedly invoked from will continuously flash rapidly from white to black.) One of the scripts is run inside a tmux session in one terminal, while the other script is run in its own terminal (without any tmux). I named this one `run-qemu-loop`: --8<-------------------------------------------------------- #!/bin/bash # --- qemu=/path/to/qemu-system-i386 #or, alternatively: (I used the following line myself so I #could tab-complete my way to different qemu executables) #qemu="$1" disk=/path/to/wnt4ts-intermittent.img # --- port=4444 rm -f STOP itercount itercount=0 while :; do [ -f STOP ] && break ((itercount++)) echo $itercount > itercount $qemu \ -enable-kvm -vga cirrus -curses -M isapc \ -drive file="$disk",format=raw \ -chardev socket,id=mon0,host=localhost,port=$port,server,nowait \ -mon chardev=mon0,mode=readline #point to an otherwise-unused terminal if you like (see also: `tty`) #echo "$itercount run(s)" > /dev/pts/__ done ------------------------------------------------------------ Not much logic above; this just repeatedly runs QEMU for as long as the file `STOP` does not exist in the current directory. The key "magic" bit is that QEMU is launched in -curses mode. The other key bit is that the above script is run inside tmux. Here's `tmux-ctl-loop`: --8<-------------------------------------------------------- #!/bin/bash port=4444 tmux=./tmux printf -v l '%0.0s-' {0..25} h1="$l/ buffer dump begin \\$l" h2="$l-\\ buffer dump end /-$l" while :; do while :; do echo | nc localhost $port -q0 -w1 > /dev/null && break echo 'Start qemu!' done buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)" echo "$h1" [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )' echo "$h2" if echo "$buffer" | grep -q 'A disk read error occurred.'; then s="<Crashed after $(< itercount) runs.>" echo "$s" echo "$s" >> stats touch STOP #echo q | nc localhost $port -q0 > /dev/null exit elif echo "$buffer" | grep -q 'OS Loader V4.00'; then echo '<Booted successfully, trying again>' echo q | nc localhost $port -q0 > /dev/null else echo '<Waiting for boot>' fi done ------------------------------------------------------------ Nothing particularly amazing going on here either. While `qemu-run-loop` is running inside tmux in the first terminal, this is running in the 2nd one. The small infinite loop at the top only breaks when it can successfully ping QEMU and it knows it's running. Then, a screendump of the contents of the terminal QEMU is in is fetched from tmux, and the buffer's content is analyzed. - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop, sends `q` to QEMU through netcat, and then the script exits. - If NTLDR loads successfully, the script sends `q` to QEMU and continues looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.) The scripts run very quickly, with 2-3 iterations per second on my i3 box. # Usage Save the two scripts above to the same directory as wnt4ts-intermittent.img, then: - (If port 4444 doesn't work, the value needs to be changed in both scripts.) - In the first terminal, run `tmux -S <file>`, where <file> names the socket tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`. (with `tmux=./tmux`, the command would be `tmux -S tmux`) - Still in the first terminal (and now also inside tmux), enter `./qemu-run-loop`, passing the path to qemu if you're using that approach (refer to the first few lines of the script). Don't hit enter yet. - Now, in the 2nd terminal, type `./tmux-ctl-loop` - Hit enter in both terminals. Rationale for timing of Enter key: - Running qemu-run-loop first will start QEMU, and if NTLDR starts successfully it will immediately begin counting down from 30. If NT actually starts to boot and is then hard-shut-down this /may/ affect the disk image - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until qemu-run-loop is running. - Starting both scripts at "more or less" the same time (no rush) works out well. Hopefully potential script modifications are obvious; for example - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP yourself (NB, if `STOP` is not created, when qemu finally exits it will of course promptly be relaunched) - pointing run-qemu-loop to a modified qemu binary == #2: QEMU-vs-VirtualBox image issue ====================== I was initially completely stumped by this issue, perhaps unsurprisingly so. :) wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked NTFS and installed everything (which took a while). NT setup reboots a number of times during the boot process, and IIRC those all went just fine. However, at some point, the image began to consistently bomb out with "A disk read error occurred. ...", and stubbornly refused to boot, regardless of the number of boot attempts I tried. QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build on my system) all consistently hit "disk read error occurred". I tried compiling QEMU 1.0 using clang so I could build for 32-bit on my 64-bit system (GCC 7 died with "Frame pointer required, but reserved"). The resulting qemu completely crashed if I didn't enable KVM (ie, TCG was (understandably) broken); with KVM enabled qemu didn't crash, but NTLDR halted with the same error as on 64-bit qemu. (TL;DR, no difference whatsoever.) My initial reaction at this point was to try the image on another virtualization platform. My first pick was VirtualBox. So, I followed the official instructions for pointing VirtualBox to physical disk images, except I substituted a /dev/loopN device I'd pointed to the image file via losetup. And... VirtualBox picked the image up fine and Just Worked(TM). Yay! - but not yay. What gives?! Confused, I then tried to convert the disk image to VHD format. Unfortunately, for some reason, if I try `qemu-image convert ... -O vhdx ...`, VirtualBox chokes on the result: ----- VD: error VERR_NOT_SUPPORTED opening image file '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED). Result Code: NS_ERROR_FAILURE (0x80004005) Component: MediumWrap Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda} Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945} Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001) ----- Welp. Well, a bit more digging later, and I found I could do $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd but... as soon as I pointed VirtualBox to this, it too began to choke with "A disk read error occurred". And yet, the VMDK->raw image setup worked just fine. I found I could even replace the loop device with the path of the .img file itself and that worked just fine too. At my wits' end, I followed some online instructions to learn about manual CHS configuration so I could try and get the image working in Bochs. "A disk read error occurred". I wasn't surprised. It was at this point I began to give up, but I decided to try One Last Thing(TM) before properly throwing in the towel. :) I decided to learn a bit more about how `VBoxManage internalcommands createrawvmdk` worked, and try one thing in particular: I can edit the .vmdk file, but can I point `createrawvmdk` at the .img file directly too? Turns out, yes you can. It also turns out that this promptly caused VirtualBox to bomb out. Interesting. For reference, here's the VMDK file I initially created (by pointing `createrawvmdk` at /dev/loopN) and then later edited to point straight to the .img file, with both approaches resulting in successful boot. --8<-------------------------------------------------------- # Disk DescriptorFile version=1 CID=e35b9a45 parentCID=ffffffff createType="fullDevice" # Extent description RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0 # The disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.adapterType="ide" ddb.geometry.cylinders="1523" ddb.geometry.heads="16" ddb.geometry.sectors="63" ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4" ddb.uuid.parent="00000000-0000-0000-0000-000000000000" ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8" ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000" ddb.geometry.biosCylinders="761" ddb.geometry.biosHeads="32" ddb.geometry.biosSectors="63" ------------------------------------------------------------ Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly: --8<-------------------------------------------------------- ddb.geometry.cylinders="2080" ddb.geometry.heads="16" ddb.geometry.sectors="63" ------------------------------------------------------------ :D Naturally, $ qemu-system-i386 -drive file=wnt4ts- broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl will boot happily on 2.9.0 (notwithstanding the occasional "disk read error occurred" documented above). It will also boot in 1.6.0. (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank 640x480 window and use 0% CPU if I specify `-M isapc`.) And, of course, using these CHS values in Bochs also results in successful boot as well (after setting the CPU type to pentium). Unfortunately, I have no idea what sequence of events caused the creation of the VMDK file above. No invocation of `createrawvmdk` is producing a VMDK file with the CHS settings above. I've only just begun to learn about the intricacies of CHS. Am I to understand that these values are stored amongst the first 512 bytes of the disk? If this is the case, then I wonder what changed the data, and why. I was initially only using QEMU 2.9.0, and didn't move the image to different VMs or QEMU versions. Perhaps Windows NT got confused about the disk CHS and rewrote it? == Sporadic BIOS-level boot failure ======================== I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No bootable device" (et al), even with the above manually-applied CHS settings. Commit e689f7c also presents such errors. Commit 306ec6c does not suffer from intermittent breakage of any kind: - No SeaBIOS flake-outs - No "Non-system disk or disk error" - No "A disk error has occurred" - No "General failure ..." While most of my confidence in commit 306ec6c is based on anecdotal evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level I/O stability and left this modified version running for a few minutes. --8<-------------------------------------------------------- #!/bin/bash port=4444 tmux=./tmux printf -v l '%0.0s-' {0..25} h1="$l/ buffer dump begin \\$l" h2="$l-\\ buffer dump end /-$l" while :; do while :; do echo | nc localhost $port -q0 -w1 > /dev/null && break echo 'Start qemu!' done buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)" echo "$h1" [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )' echo "$h2" if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \ grep -q 'No bootable device' then s="<Hit error after $(< itercount) runs.>" echo "$s" echo "$s" >> stats touch STOP #echo q | nc localhost $port -q0 > /dev/null exit elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \ grep -q 'A disk read error' then echo '<Boot did not hang at BIOS, trying again>' echo q | nc localhost $port -q0 > /dev/null else echo '<Waiting for boot>' fi done ------------------------------------------------------------ For the above to work, the top of run-qemu-loop must also be modified to read something along the lines of disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63 (Suggestion: modify copies of both scripts) One small terminal-flicker-headache (and a 57°C CPU) later, I was able to carefully observe just over 350 successful runs in which QEMU commit 306ec6c only ever produced a boot menu. No other hitches. ** Important: ** However, commit 306ec6c will fail to boot, ever, if the cylinders and geometry are not set to the values VirtualBox "discovered". (Of note is the fact that QEMU (2.9.0) was what initially created this image. I must admit that I don't remember what sequence of QEMU versions I fed the image to - and I maybe, possibly, didn't think to back the file up (sorry), so maybe something mangled something somewhere. But VirtualBox figured it out nonetheless!) Furthermore, feeding /dev/loopN to any QEMU version will NOT result in correct CHS discovery (and successful boot). This is what leads me to conclude that I've discovered two separate issues. == Appendix: How to build the branches ===================== It's very simple. First, `git clone https://github.com/qemu/qemu` somewhere if you don't already have a local copy. If you have an old git checkout that's from 2014 or later, you can use that old checkout instead. (If you want to test an old checkout you have, the commands below will either work perfectly or completely bomb out with no side effects.) A full checkout is a ~183MB download. Sorry. Next, create two new directories somewhere. Name them what you like, eg `qemu-working` and `qemu-broken`. Now, cd into the checkout directory, and run: $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC /path/to/qemu-working/ $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC /path/to/qemu-broken/ The paths can be relative. Now, run this in both of the new directories: $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp --disable-usb-redir --disable-guest-agent --disable-libiscsi --disable-spice --disable-smartcard-nss --disable-vhost-net --disable- docs --disable-attr --disable-cap-ng --disable-vde --disable-user --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi --enable-debug --target-list=i386-softmmu --disable-fdt $ make -j64 You can open two terminals and configure and build both simultaneously if you like. On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :) (NB. Do. not. use. -j64. with. the. linux. kernel.) On my system, a single build with -j64 takes only about 35 seconds. C FTW. (Although this has increased to 1min20sec for more recent builds.) Most of the configure arguments remove functionality I'll never use (in this situation) and which will only slow down the build. Once QEMU is built, run qemu-system-i386 directly from where it has been built. $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ... $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ... Again, the paths can be relative. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions