[email protected] wrote:
This might also explain how manifests are getting corrupted in bug 6011.
They parse correctly when we download them, but subsequent re-loads
fail.  Two processes writing to the same file without any locking could
certainly cause this problem.  The manifest code doesn't use a method to
keep its tempfiles unique, so two writers could be modifying the same
file, and then rename it into place.

This is an extension of similar comments Shawn made earlier this
morning on bug 11169.

http://defect.opensolaris.org/bz/show_bug.cgi?id=11169#c4

I should note that it's just a theory, but I think it's a sound one. The other problem in being able to consistently reproduce this is that the cron entry for the update-refresh script:

30 0,9,12,18,21 * * * /usr/lib/update-manager/update-refresh.sh

...means that it will only be triggered (if I remember how to read crontab correctly :P) at the half-hour mark for the hours 0,9,12,18,21. In addition to that, there is a "dither" that the update-refresh.sh script adds to attempt to introduce an additional random amount of delay before the update refresh is actually performed to prevent all clients from accessing the server at the same time.

Digging further, I attempted to reproduce this on an upgraded image (122 -> 124), and while I saw the pkg/catalog directory disappearance behaviour described, I was unable to get a client to crash. This isn't surprising given that only specific race condition cases would expose it.

The good news is that, as far as I can tell, the current image upgrade cold should not leave the system in a bad state. That is, it attempts to do all of the conversion work first before altering the image structure, and the very last thing it does is remove the old /var/pkg/catalog directory.

That should mean that if any of the clients is interrupted before it has a chance to perform the final step of removing the /var/pkg/catalog directory, the next time a client runs, it should be able to complete the upgrade without a problem.

In a worst case scenario, should the image upgrade completely and unexpectedly fail, the user always has the old boot environment to fallback to.

For now, based on a conversation with Danek, I think release noting this is sufficient. The long-term solution is to have proper image-locking mechanisms in the client, since I imagine that this is not the last time we'll have an image format change and that would solve many other issues as well.

Cheers,
--
Shawn Walker
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to