Re: [Openstack-operators] [nova] Is verification of images in the image cache necessary?

2016-05-24 Thread Matthew Booth
On Tue, May 24, 2016 at 11:06 AM, John Garbutt  wrote:

> On 24 May 2016 at 10:16, Matthew Booth  wrote:
> > During its periodic task, ImageCacheManager does a checksum of every
> image
> > in the cache. It verifies this checksum against a previously stored
> value,
> > or creates that value if it doesn't already exist.[1] Based on this
> > information it generates a log message if the image is corrupt, but
> > otherwise takes no action. Going by git, this has been the case since
> 2012.
> >
> > The commit which added it was associated with 'blueprint
> > nova-image-cache-management phase 1'. I can't find this blueprint, but I
> did
> > find this page:
> https://wiki.openstack.org/wiki/Nova-image-cache-management
> > . This talks about 'detecting images which are corrupt'. It doesn't
> explain
> > why we would want to do that, though. It also doesn't seem to have been
> > followed through in the last 4 years, suggesting that nobody's really
> that
> > bothered.
> >
> > I understand that corruption of bits on disks is a thing, but it's a
> thing
> > for more than just the image cache. I feel that this is a problem much
> > better solved at other layers, prime candidates being the block and
> > filesystem layers. There are existing robust solutions to bitrot at both
> of
> > these layers which would cover all aspects of data corruption, not just
> this
> > randomly selected slice.
>
> +1
>
> That might mean improved docs on the need to configure such a thing.
>
> > As it stands, I think this code is regularly running a pretty expensive
> task
> > looking for something which will very rarely happen, only to generate a
> log
> > message which nobody is looking for. And it could be solved better in
> other
> > ways. Would anybody be sad if I deleted it?
>
> For completeness, we need to deprecate it using the usual cycles:
>
> https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html


I guess I'm arguing that it isn't a feature, and never has been: it really
doesn't do anything at all except generate a log message. Are log messages
part of the deprecation contract?

If operators are genuinely finding corrupt images to be a problem and find
this log message useful that would be extremely useful to know.


> I like the idea of checking the md5 matches before each boot, as it
> mirrors the check we do after downloading from glance. Its possible
> thats very unlikely to spot anything that shouldn't already be worried
> about by something else. It may just be my love of symmetry that makes
> me like that idea?
>

It just feels arbitrary to me for a few reasons. Firstly, it's only
relevant to storage schemes which use the file in the image cache as a
backing file. In this libvirt driver, this is just the qcow2 backend. While
this is the default, most users are actually using ceph. Assuming it isn't
cloning it directly from ceph-backed glance, the Rbd backend imports from
the image cache during spawn, and has nothing to do with it thereafter. So
for Rbd we'd want to check during spawn. Same for the Flat, Lvm and Ploop
backends.

Except that it's still arbitrary because we're not checking the Qcow
overlay on each boot. Or ephemeral or swap disks. Or Lvm, Flat or Rbd disks
at all. Or the operating system. And it's still expensive, and better done
by the block or filesystem layer.

I'm not personally convinced there's all that much point checking during
download either, but given that we're loading all the bits anyway that
check is essentially free. However, even if we decided we needed to defend
the system against bitrot above the block/filesystem layer (and I'm not at
all convinced of that) we'd want a coordinated design for it. Without one,
we risk implementing a bunch of disconnected/incomplete stuff that doesn't
meet anybody's needs, but burns resources anyway.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] Is verification of images in the image cache necessary?

2016-05-24 Thread John Garbutt
On 24 May 2016 at 10:16, Matthew Booth  wrote:
> During its periodic task, ImageCacheManager does a checksum of every image
> in the cache. It verifies this checksum against a previously stored value,
> or creates that value if it doesn't already exist.[1] Based on this
> information it generates a log message if the image is corrupt, but
> otherwise takes no action. Going by git, this has been the case since 2012.
>
> The commit which added it was associated with 'blueprint
> nova-image-cache-management phase 1'. I can't find this blueprint, but I did
> find this page: https://wiki.openstack.org/wiki/Nova-image-cache-management
> . This talks about 'detecting images which are corrupt'. It doesn't explain
> why we would want to do that, though. It also doesn't seem to have been
> followed through in the last 4 years, suggesting that nobody's really that
> bothered.
>
> I understand that corruption of bits on disks is a thing, but it's a thing
> for more than just the image cache. I feel that this is a problem much
> better solved at other layers, prime candidates being the block and
> filesystem layers. There are existing robust solutions to bitrot at both of
> these layers which would cover all aspects of data corruption, not just this
> randomly selected slice.

+1

That might mean improved docs on the need to configure such a thing.

> As it stands, I think this code is regularly running a pretty expensive task
> looking for something which will very rarely happen, only to generate a log
> message which nobody is looking for. And it could be solved better in other
> ways. Would anybody be sad if I deleted it?

For completeness, we need to deprecate it using the usual cycles:
https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html

I like the idea of checking the md5 matches before each boot, as it
mirrors the check we do after downloading from glance. Its possible
thats very unlikely to spot anything that shouldn't already be worried
about by something else. It may just be my love of symmetry that makes
me like that idea?

Thanks,
johnthetubaguy


> [1] Incidentally, there also seems to be a bug in this implementation, in
> that it doesn't hold the lock on the image itself at any point during the
> hashing process, meaning that it cannot guarantee that the image has
> finished downloading yet.
> --
> Matthew Booth
> Red Hat Engineering, Virtualisation Team
>
> Phone: +442070094448 (UK)
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators