Excerpts from Bernie Innocenti's message of Sun Jun 20 00:33:50 +0000 2010:
> The journal was showing just one object, but the > ~/.sugar/default/datastore directory contained 4-5 invisible entries. When were these "invisible" entries written? Directly before a crash? With the way the current data store works, it's easy for the index to get "out of sync" with the metadata stored directly on disk (on crashes, not during normal operation). Barring another rewrite, I can't think of any way to prevent that from happening without slowing down to a crawl. We already do the best we can by flushing the index every 20 changes and 60 seconds after the last change (see IndexStore._flush()). FWIW, this is the most likely reason for your current problem. It would probably pay off more to analyse the reasons for the crashes (including power cycles) than trying to make the current data store more robust against all scenarios. If the laptops run out of power suddenly, maybe powerd could tell the data store to go to a slower, more fail-safe mode (like the laptop-mode script does on the lower layer). We would flush the Xapian index on every change then, increasing the chance the index contains all entries saved directly before the crash. IIRC automatic shut-down on low battery has been decided against because it would reduce the maximum run-time. If the kids power-cycle the laptops during (more or less) normal operation, we should check why. One thing that bugged me was the lack of busy-feedback. On an old-style PC, I would watch the HD LED and listen to the hard disk to know whether the system is not reacting to my input because it's busy doing something, or whether it crashed and I need to reset it. Sugar on XO-1[.5] lacks all useful indicators (even the busy cursor is almost unused), so how would a child decide whether to wait or to power-cycle? > There was no time to analyze the problem in detail, :( We can't do much to improve the robustness of the current data store without knowing exactly what caused it to break in the field. Maybe you could do a dd of mtdblock0 to a file on a USB stick the next time? If possible from a system booted from the stick (so it's not mounted at the time of the dump) - you could even automate it then. > * the corruption could be caused by flash problems. I have found > laptops in the field that wouldn't boot because /sbin/lvm was > corrupted There's nothing the data store can do against this type of problem (it wouldn't even be able to start up if its program code is corrupted). This needs to be fixed at a lower layer (which would essentially boil down to making everything redundant, thus halving the available space). > * we can't exclude jffs2 problems too: when it's almost full, it does > slow garbage collection passes on boot which kids interrupt by > power cycling. I wonder how robust jffs2 is in this case. I wonder if UBIFS is better in this regard. Flash doesn't seem to last as long as with JFFS2 [1], but maybe it handles crashes better? (I don't know much about either JFFS2 or UBIFS myself, so I can't tell) > * there might be a bug in xapian. If so, we'll see this issue also > on the XO-1.5 Unlikely as I've never seen it happen in testing. But not impossible, of course. Especially since I tend to run the latest upstream version which will already have more bugs fixed. > * I'm skeptical it's a new issue in 0.84 or F-11: the older builds > had so many data loss issues that a subtler problem like this > could have easily gone unnoticed. The data store has been rewritten from scratch for 0.84+. Only bugs on a lower layer (e.g. JFFS2) would apply to both data stores. > * can the datastore detect index corruption in the most obvious cases? > If so, what would it do? If it's corrupted badly enough Xapian will detect it and we will do a full index rebuild. Xapian couldn't effectively guard itself against subtle corruptions without doing checksumming of each and every data block. That belongs on a lower layer (md, file system), not in applications. > * how long does it take to rebuild the index on a busy journal? Long enough to cause the kids to power-cycle again. It would start over on next boot, but if the issue we try to fix with the index rebuild was on a lower layer, what will happen on that reboot? > * finally, if we can't find a 100% robust solution, would it make > sense to add a "Reindex Journal" button somewhere? The user can just delete the index_updated file and restart Sugar. If the laptop crashes often enough to warrant a button, we failed badly. Sascha [1] http://dev.laptop.org/~wad/nand/ -- http://sascha.silbe.org/ http://www.infra-silbe.de/
signature.asc
Description: PGP signature
_______________________________________________ Sugar-devel mailing list Sugar-devel@lists.sugarlabs.org http://lists.sugarlabs.org/listinfo/sugar-devel