Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-09 Thread Chris Friesen

On 06/08/2015 06:26 PM, Robert Collins wrote:

On 9 June 2015 at 07:53, Chris Friesen chris.frie...@windriver.com wrote:

 From what I understand, the iterator (in the glance-api process) normally
breaks out of the while loop once the whole file has been read and the
read() call returns an empty string.

It's not clear to me how an error in the nova process (which causes it to
stop reading the file from glance-api)  will cause glance-api to break out
of the while loop in ChunkedFile.__iter__().


AIUI the conclusion of our IRC investigation was:
  - with caching middleware, the fd is freed, just after ~4m.
  - without caching middleware, the fd is freed after ~90s.


Correct. I went back and tried it out in a hardware lab and those are the 
results I got.  Sorry for the noise (though it'd be nice to get rid of the 
4-minute delay).


I did a bit more digging and we've apparently got another similar issue reported 
where taking a snapshot fails because the filesystem fills up and the space 
never gets reclaimed even though the snapshot failed.  I'll make sure I can 
reproduce that one before going further.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-08 Thread Chris Friesen

On 06/08/2015 12:30 PM, Robert Collins wrote:

On 9 June 2015 at 03:50, Chris Friesen chris.frie...@windriver.com wrote:

On 06/07/2015 04:22 PM, Robert Collins wrote:



Hi, original reporter here.

There's no LB involved.  The issue was noticed in a test lab that is tight
on disk space.  When an instance failed to boot the person using the lab
tried to delete some images to free up space, at which point it was noticed
that space didn't actually free up.  (For at least half an hour, exact time
unknown.)

I'm more of a nova guy, so could you elaborate a bit on the GC?  Is
something going to delete the ChunkedFile object after a certain amount of
inactivity? What triggers the GC to run?


Ok, updating my theory...  I'm not super familiar with glances
innards, so I'm going to call out my assumptions.

The ChunkedFile object is in the Nova process. It reads from a local
file, so its the issue - and it has a handle onto the image because
glance arranged for it to read from it.


As I understand it, the ChunkedFile object is in the glance-api process.  (At 
least, it's the glance-api process that is holding the open file descriptor.)




Anyhow - to answer your question: the iterator is only referenced by
the for loop, nothing else *can* hold a reference to it (without nasty
introspection hacks) and so as long as the iterator has an appropriate
try:finally:, which the filesystem ChunkedFile one does- the file will
be closed automatically.


From what I understand, the iterator (in the glance-api process) normally 
breaks out of the while loop once the whole file has been read and the read() 
call returns an empty string.


It's not clear to me how an error in the nova process (which causes it to stop 
reading the file from glance-api)  will cause glance-api to break out of the 
while loop in ChunkedFile.__iter__().


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-08 Thread Robert Collins
On 9 June 2015 at 07:53, Chris Friesen chris.frie...@windriver.com wrote:
 On 06/08/2015 12:30 PM, Robert Collins wrote:

 On 9 June 2015 at 03:50, Chris Friesen chris.frie...@windriver.com
 wrote:

 On 06/07/2015 04:22 PM, Robert Collins wrote:


 Hi, original reporter here.

 There's no LB involved.  The issue was noticed in a test lab that is
 tight
 on disk space.  When an instance failed to boot the person using the lab
 tried to delete some images to free up space, at which point it was
 noticed
 that space didn't actually free up.  (For at least half an hour, exact
 time
 unknown.)

 I'm more of a nova guy, so could you elaborate a bit on the GC?  Is
 something going to delete the ChunkedFile object after a certain amount
 of
 inactivity? What triggers the GC to run?


 Ok, updating my theory...  I'm not super familiar with glances
 innards, so I'm going to call out my assumptions.

 The ChunkedFile object is in the Nova process. It reads from a local
 file, so its the issue - and it has a handle onto the image because
 glance arranged for it to read from it.


 As I understand it, the ChunkedFile object is in the glance-api process.
 (At least, it's the glance-api process that is holding the open file
 descriptor.)


 Anyhow - to answer your question: the iterator is only referenced by
 the for loop, nothing else *can* hold a reference to it (without nasty
 introspection hacks) and so as long as the iterator has an appropriate
 try:finally:, which the filesystem ChunkedFile one does- the file will
 be closed automatically.


 From what I understand, the iterator (in the glance-api process) normally
 breaks out of the while loop once the whole file has been read and the
 read() call returns an empty string.

 It's not clear to me how an error in the nova process (which causes it to
 stop reading the file from glance-api)  will cause glance-api to break out
 of the while loop in ChunkedFile.__iter__().

AIUI the conclusion of our IRC investigation was:
 - with caching middleware, the fd is freed, just after ~4m.
 - without caching middleware, the fd is freed after ~90s.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-08 Thread Chris Friesen

On 06/07/2015 04:22 PM, Robert Collins wrote:

On 6 June 2015 at 13:08, Ian Cordasco ian.corda...@rackspace.com wrote:



So the problem is with how we use ResponseSerializer and the ChunkedFile
(https://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/image_d
ata.py#n222). I think the problem we'll have is that webob provides
nothing on a Response
(https://webob.readthedocs.org/en/latest/modules/webob.html#response) to
hook into so we can close the ChunkedFile.

I wonder if we used the body_file attribute if webob would close the file
when the response is closed (because I'm assuming that nova/glanceclient
are closing the response with which it's downloading the data).


But the maximum leak time is a single GC run, which we don't expect to
be long, unless the server is super quiet (and if it is, the temporary
leak is less like to be an issue, no?).

I wonder if there's actually something else going on here. E.g. a
broken LB in front of glance-api which is preventing the HTTP
connection termination from being detected, and the thread is staying
open-and-stalled. That would explain the symptoms just as well - and
is deployer specific so could also explain the (perceived) trickiness
in reproduction.


Hi, original reporter here.

There's no LB involved.  The issue was noticed in a test lab that is tight on 
disk space.  When an instance failed to boot the person using the lab tried to 
delete some images to free up space, at which point it was noticed that space 
didn't actually free up.  (For at least half an hour, exact time unknown.)


I'm more of a nova guy, so could you elaborate a bit on the GC?  Is something 
going to delete the ChunkedFile object after a certain amount of inactivity? 
What triggers the GC to run?


Thanks,
Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-07 Thread Robert Collins
On 6 June 2015 at 13:08, Ian Cordasco ian.corda...@rackspace.com wrote:


 On 6/5/15, 02:55, Flavio Percoco fla...@redhat.com wrote:

On 04/06/15 11:46 -0600, Chris Friesen wrote:
On 06/04/2015 03:01 AM, Flavio Percoco wrote:
On 03/06/15 16:46 -0600, Chris Friesen wrote:
We recently ran into an issue where nova couldn't write an image file
due to
lack of space and so just quit reading from glance.

This caused glance to be stuck with an open file descriptor, which
meant that
the image consumed space even after it was deleted.

I have a crude fix for nova at
https://review.openstack.org/#/c/188179/;
which basically continues to read the image even though it can't write
it.
That seems less than ideal for large images though.

Is there a better way to do this?  Is there a way for nova to indicate
to
glance that it's no longer interested in that image and glance can
close the
file?

If I've followed this correctly, on the glance side I think the code in
question is ultimately
glance_store._drivers.filesystem.ChunkedFile.__iter__().

Actually, to be honest, I was quite confused by the email :P

Correct me if I still didn't understand what you're asking.

You ran out of space on the Nova side while downloading the image and
there's a file descriptor leak somewhere either in that lovely (sarcasm)
glance wrapper or in glanceclient.

The first part is correct, but the file descriptor is actually held by
glance-api.

Just by reading your email and glancing your patch, I believe the bug
might be in glanceclient but I'd need to five into this. The piece of
code you'll need to look into is[0].

glance_store is just used server side. If that's what you meant -
glance is keeping the request and the ChunkedFile around - then yes,
glance_store is the place to look into.

[0]
https://github.com/openstack/python-glanceclient/blob/master/glanceclien
t/v1/images.py#L152

I believe what's happening is that the ChunkedFile code opens the file
and creates the iterator.  Nova then starts iterating through the
file.

If nova (or any other user of glance) iterates all the way through the
file then the ChunkedFile code will hit the finally clause in
__iter__() and close the file descriptor.

If nova starts iterating through the file and then stops (due to
running out of room, for example), the ChunkedFile.__iter__() routine
is left with an open file descriptor.  At this point deleting the
image will not actually free up any space.

I'm not a glance guy so I could be wrong about the code.  The
externally-visible data are:
1) glance-api is holding an open file descriptor to a deleted image file
2) If I kill glance-api the disk space is freed up.
3) If I modify nova to always finish iterating through the file the
problem doesn't occur in the first place.

Gotcha, thanks for explaining. I think the problem is that there might
be a reference leak and therefore the FD is kept opened. Probably the
request interruption is not getting to the driver. I've filed this
bug[0] so we can look into it.

[0] https://bugs.launchpad.net/glance-store/+bug/1462235

Flavio

--
@flaper87
Flavio Percoco

 So the problem is with how we use ResponseSerializer and the ChunkedFile
 (https://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/image_d
 ata.py#n222). I think the problem we'll have is that webob provides
 nothing on a Response
 (https://webob.readthedocs.org/en/latest/modules/webob.html#response) to
 hook into so we can close the ChunkedFile.

 I wonder if we used the body_file attribute if webob would close the file
 when the response is closed (because I'm assuming that nova/glanceclient
 are closing the response with which it's downloading the data).

But the maximum leak time is a single GC run, which we don't expect to
be long, unless the server is super quiet (and if it is, the temporary
leak is less like to be an issue, no?).

I wonder if there's actually something else going on here. E.g. a
broken LB in front of glance-api which is preventing the HTTP
connection termination from being detected, and the thread is staying
open-and-stalled. That would explain the symptoms just as well - and
is deployer specific so could also explain the (perceived) trickiness
in reproduction.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-05 Thread Flavio Percoco

On 04/06/15 11:46 -0600, Chris Friesen wrote:

On 06/04/2015 03:01 AM, Flavio Percoco wrote:

On 03/06/15 16:46 -0600, Chris Friesen wrote:

We recently ran into an issue where nova couldn't write an image file due to
lack of space and so just quit reading from glance.

This caused glance to be stuck with an open file descriptor, which meant that
the image consumed space even after it was deleted.

I have a crude fix for nova at https://review.openstack.org/#/c/188179/;
which basically continues to read the image even though it can't write it.
That seems less than ideal for large images though.

Is there a better way to do this?  Is there a way for nova to indicate to
glance that it's no longer interested in that image and glance can close the
file?

If I've followed this correctly, on the glance side I think the code in
question is ultimately glance_store._drivers.filesystem.ChunkedFile.__iter__().


Actually, to be honest, I was quite confused by the email :P

Correct me if I still didn't understand what you're asking.

You ran out of space on the Nova side while downloading the image and
there's a file descriptor leak somewhere either in that lovely (sarcasm)
glance wrapper or in glanceclient.


The first part is correct, but the file descriptor is actually held by 
glance-api.


Just by reading your email and glancing your patch, I believe the bug
might be in glanceclient but I'd need to five into this. The piece of
code you'll need to look into is[0].

glance_store is just used server side. If that's what you meant -
glance is keeping the request and the ChunkedFile around - then yes,
glance_store is the place to look into.

[0]
https://github.com/openstack/python-glanceclient/blob/master/glanceclient/v1/images.py#L152


I believe what's happening is that the ChunkedFile code opens the file 
and creates the iterator.  Nova then starts iterating through the 
file.


If nova (or any other user of glance) iterates all the way through the 
file then the ChunkedFile code will hit the finally clause in 
__iter__() and close the file descriptor.


If nova starts iterating through the file and then stops (due to 
running out of room, for example), the ChunkedFile.__iter__() routine 
is left with an open file descriptor.  At this point deleting the 
image will not actually free up any space.


I'm not a glance guy so I could be wrong about the code.  The 
externally-visible data are:

1) glance-api is holding an open file descriptor to a deleted image file
2) If I kill glance-api the disk space is freed up.
3) If I modify nova to always finish iterating through the file the 
problem doesn't occur in the first place.


Gotcha, thanks for explaining. I think the problem is that there might
be a reference leak and therefore the FD is kept opened. Probably the
request interruption is not getting to the driver. I've filed this
bug[0] so we can look into it.

[0] https://bugs.launchpad.net/glance-store/+bug/1462235

Flavio

--
@flaper87
Flavio Percoco


pgpqbO1TwobzI.pgp
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-04 Thread Chris Friesen

On 06/04/2015 07:16 PM, Joshua Harlow wrote:

Perhaps someone needs to use (or forgot to use) contextlib.closing?

https://docs.python.org/2/library/contextlib.html#contextlib.closing

Or contextlib2 also may be useful:

http://contextlib2.readthedocs.org/en/latest/#contextlib2.ExitStack


The complication is that the error happens in nova, but it's glance-api that 
holds the file descriptor open.


So somehow the error has to be detected in nova, then the fact that nova is no 
longer interested in the file needs to be propagated through glanceclient via an 
HTTP call into glance-api so that it knows it can close the file descriptor. 
And I don't think that an API to do this exists currently.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-04 Thread Joshua Harlow

Perhaps someone needs to use (or forgot to use) contextlib.closing?

https://docs.python.org/2/library/contextlib.html#contextlib.closing

Or contextlib2 also may be useful:

http://contextlib2.readthedocs.org/en/latest/#contextlib2.ExitStack

-Josh

Chris Friesen wrote:

On 06/04/2015 03:01 AM, Flavio Percoco wrote:

On 03/06/15 16:46 -0600, Chris Friesen wrote:

We recently ran into an issue where nova couldn't write an image file
due to
lack of space and so just quit reading from glance.

This caused glance to be stuck with an open file descriptor, which
meant that
the image consumed space even after it was deleted.

I have a crude fix for nova at
https://review.openstack.org/#/c/188179/;
which basically continues to read the image even though it can't
write it.
That seems less than ideal for large images though.

Is there a better way to do this? Is there a way for nova to indicate to
glance that it's no longer interested in that image and glance can
close the
file?

If I've followed this correctly, on the glance side I think the code in
question is ultimately
glance_store._drivers.filesystem.ChunkedFile.__iter__().


Actually, to be honest, I was quite confused by the email :P

Correct me if I still didn't understand what you're asking.

You ran out of space on the Nova side while downloading the image and
there's a file descriptor leak somewhere either in that lovely (sarcasm)
glance wrapper or in glanceclient.


The first part is correct, but the file descriptor is actually held by
glance-api.


Just by reading your email and glancing your patch, I believe the bug
might be in glanceclient but I'd need to five into this. The piece of
code you'll need to look into is[0].

glance_store is just used server side. If that's what you meant -
glance is keeping the request and the ChunkedFile around - then yes,
glance_store is the place to look into.

[0]
https://github.com/openstack/python-glanceclient/blob/master/glanceclient/v1/images.py#L152



I believe what's happening is that the ChunkedFile code opens the file
and creates the iterator. Nova then starts iterating through the file.

If nova (or any other user of glance) iterates all the way through the
file then the ChunkedFile code will hit the finally clause in
__iter__() and close the file descriptor.

If nova starts iterating through the file and then stops (due to running
out of room, for example), the ChunkedFile.__iter__() routine is left
with an open file descriptor. At this point deleting the image will not
actually free up any space.

I'm not a glance guy so I could be wrong about the code. The
externally-visible data are:
1) glance-api is holding an open file descriptor to a deleted image file
2) If I kill glance-api the disk space is freed up.
3) If I modify nova to always finish iterating through the file the
problem doesn't occur in the first place.

Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-04 Thread Chris Friesen

On 06/04/2015 03:01 AM, Flavio Percoco wrote:

On 03/06/15 16:46 -0600, Chris Friesen wrote:

We recently ran into an issue where nova couldn't write an image file due to
lack of space and so just quit reading from glance.

This caused glance to be stuck with an open file descriptor, which meant that
the image consumed space even after it was deleted.

I have a crude fix for nova at https://review.openstack.org/#/c/188179/;
which basically continues to read the image even though it can't write it.
That seems less than ideal for large images though.

Is there a better way to do this?  Is there a way for nova to indicate to
glance that it's no longer interested in that image and glance can close the
file?

If I've followed this correctly, on the glance side I think the code in
question is ultimately glance_store._drivers.filesystem.ChunkedFile.__iter__().


Actually, to be honest, I was quite confused by the email :P

Correct me if I still didn't understand what you're asking.

You ran out of space on the Nova side while downloading the image and
there's a file descriptor leak somewhere either in that lovely (sarcasm)
glance wrapper or in glanceclient.


The first part is correct, but the file descriptor is actually held by 
glance-api.


Just by reading your email and glancing your patch, I believe the bug
might be in glanceclient but I'd need to five into this. The piece of
code you'll need to look into is[0].

glance_store is just used server side. If that's what you meant -
glance is keeping the request and the ChunkedFile around - then yes,
glance_store is the place to look into.

[0]
https://github.com/openstack/python-glanceclient/blob/master/glanceclient/v1/images.py#L152


I believe what's happening is that the ChunkedFile code opens the file and 
creates the iterator.  Nova then starts iterating through the file.


If nova (or any other user of glance) iterates all the way through the file then 
the ChunkedFile code will hit the finally clause in __iter__() and close the 
file descriptor.


If nova starts iterating through the file and then stops (due to running out of 
room, for example), the ChunkedFile.__iter__() routine is left with an open file 
descriptor.  At this point deleting the image will not actually free up any space.


I'm not a glance guy so I could be wrong about the code.  The externally-visible 
data are:

1) glance-api is holding an open file descriptor to a deleted image file
2) If I kill glance-api the disk space is freed up.
3) If I modify nova to always finish iterating through the file the problem 
doesn't occur in the first place.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [glance] How to deal with aborted image read?

2015-06-04 Thread Flavio Percoco

On 03/06/15 16:46 -0600, Chris Friesen wrote:
We recently ran into an issue where nova couldn't write an image file 
due to lack of space and so just quit reading from glance.


This caused glance to be stuck with an open file descriptor, which 
meant that the image consumed space even after it was deleted.


I have a crude fix for nova at 
https://review.openstack.org/#/c/188179/; which basically continues 
to read the image even though it can't write it.  That seems less than 
ideal for large images though.


Is there a better way to do this?  Is there a way for nova to indicate 
to glance that it's no longer interested in that image and glance can 
close the file?


If I've followed this correctly, on the glance side I think the code 
in question is ultimately 
glance_store._drivers.filesystem.ChunkedFile.__iter__().


Actually, to be honest, I was quite confused by the email :P

Correct me if I still didn't understand what you're asking.

You ran out of space on the Nova side while downloading the image and
there's a file descriptor leak somewhere either in that lovely (sarcasm)
glance wrapper or in glanceclient.

Just by reading your email and glancing your patch, I believe the bug
might be in glanceclient but I'd need to five into this. The piece of
code you'll need to look into is[0].

glance_store is just used server side. If that's what you meant -
glance is keeping the request and the ChunkedFile around - then yes,
glance_store is the place to look into.

[0] 
https://github.com/openstack/python-glanceclient/blob/master/glanceclient/v1/images.py#L152

Cheers,
Flavio

--
@flaper87
Flavio Percoco


pgp0DL1sr4LBf.pgp
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev