Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On Sun, Aug 15, 2010 at 11:08:43PM +0200, Soren Hansen wrote: Unless multiple threads are reading from the fd (?), I'm quite confident in my data. I dumped the whole monitor_response buffer right before and right after the recvfrom call along with the return value of recvfrom. I'll rerun the tests tomorrow and show you the results. I'm running this on another kernel right now and I'm not seeing the problem. I'll try again with the kernel I used a couple of days ago. I'm not sure if I pointed this out in my initial e-mail, but even if I didn't have this recvfrom problem, the check seems odd to my eye. Is the monitor really going to send a max sized datagram each time? I would have expected it to only send a datagram the size of the header + the length of the data. My hunch here seems correct though. I've applied this diff: diff --git a/src/uml/uml_driver.c b/src/uml/uml_driver.c index 9cf669f..c289ad3 100644 --- a/src/uml/uml_driver.c +++ b/src/uml/uml_driver.c @@ -729,8 +733,13 @@ static int umlMonitorCommand(const struct uml_driver *driver, do { ssize_t nbytes; addrlen = sizeof(addr); +VIR_DEBUG(res.error: %d, res.extra: %d, res.length: %d, res.data: %.*s, + res.error, res.extra, res.length, res.length, res.data); nbytes = recvfrom(priv-monitor, res, sizeof res, 0, (struct sockaddr *)addr, addrlen); +VIR_DEBUG(nbytes: %d, nbytes); +VIR_DEBUG(res.error: %d, res.extra: %d, res.length: %d, res.data: %.*s, + res.error, res.extra, res.length, res.length, res.data); if (nbytes 0) { if (errno == EAGAIN || errno == EINTR) continue; And I get this output: 11:28:14.602: debug : umlMonitorCommand:702 : Run command 'config con0' 11:28:14.602: debug : umlMonitorCommand:737 : res.error: 5, res.extra: 0, res.length: 4096, res.data: 11:28:14.602: debug : umlMonitorCommand:740 : nbytes: 28 11:28:14.602: debug : umlMonitorCommand:742 : res.error: 0, res.extra: 0, res.length: 16, res.data: pts:/dev/pts/31 11:28:14.602: debug : umlMonitorCommand:771 : Command reply is 'pts:/dev/pts/31' So the size of the response datagram isn't sizeof(res) as the check in uml_driver.c expects, but rather sizeof(res.error) + sizeof(res.extra) + sizeof(res.length) + res.length. -- Soren Hansen so...@linux2go.dk Systems Architect, The Rackspace Cloud Ubuntu Developer -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On Mon, Aug 16, 2010 at 11:37:07AM +0200, Soren Hansen wrote: I'm running this on another kernel right now and I'm not seeing the problem. I'll try again with the kernel I used a couple of days ago. Ok, found the other kernel. Same diff as in my previous e-mail, same action. These are my results: 09:41:01.134: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.134: debug : umlMonitorCommand:733 : res.error: 6, res.extra: 0, res.length: 4096, res.data: 09:41:01.134: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.134: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.134: debug : umlMonitorCommand:761 : Command reply is 'pts' 09:41:01.144: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.144: debug : umlMonitorCommand:733 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.144: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.144: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.144: debug : umlMonitorCommand:761 : Command reply is 'pts' 09:41:01.154: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.154: debug : umlMonitorCommand:733 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.155: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.155: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.155: debug : umlMonitorCommand:761 : Command reply is 'pts' 09:41:01.165: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.165: debug : umlMonitorCommand:733 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.169: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.169: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.169: debug : umlMonitorCommand:761 : Command reply is 'pts' 09:41:01.179: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.179: debug : umlMonitorCommand:733 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts 09:41:01.179: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.179: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 15, res.data: pts:/dev/pts/7 09:41:01.179: debug : umlMonitorCommand:761 : Command reply is 'pts:/dev/pts/7' This one's a 2.6.34.1 kernel. The one where I didn't see the problem is a 2.6.32.something-or-the-other kernel. Mindboggling. -- Soren Hansen so...@linux2go.dk Systems Architect, The Rackspace Cloud Ubuntu Developer -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 08/16/2010 03:37 AM, Soren Hansen wrote: And I get this output: 11:28:14.602: debug : umlMonitorCommand:702 : Run command 'config con0' 11:28:14.602: debug : umlMonitorCommand:737 : res.error: 5, res.extra: 0, res.length: 4096, res.data: 11:28:14.602: debug : umlMonitorCommand:740 : nbytes: 28 11:28:14.602: debug : umlMonitorCommand:742 : res.error: 0, res.extra: 0, res.length: 16, res.data: pts:/dev/pts/31 11:28:14.602: debug : umlMonitorCommand:771 : Command reply is 'pts:/dev/pts/31' So the size of the response datagram isn't sizeof(res) as the check in uml_driver.c expects, but rather sizeof(res.error) + sizeof(res.extra) + sizeof(res.length) + res.length. I agree with this analysis. In other words, the check should be more like this (two conditions - did we get enough bytes to even have a valid res.length, and did we get enough bytes to match with what res.length stated): if (nbytes offsetof(struct monitor_request, data) || nbytes res.length + offsetof(struct monitor_request, data)) incomplete reply instead of the two current but incorrect checks: if (nbytes sizeof ret) /* incorrectly true except when res.length==MONITOR_BUFLEN */ incomplete reply; if (sizeof res.data res.length) /* res.length of MONITOR_BUFLEN+1 does not trigger this check, but leads to a buffer overflow in subsequent memmove */ invalid length; But before I write such a patch, I'm going to look in more details at your other reply. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 08/16/2010 03:54 AM, Soren Hansen wrote: On Mon, Aug 16, 2010 at 11:37:07AM +0200, Soren Hansen wrote: I'm running this on another kernel right now and I'm not seeing the problem. I'll try again with the kernel I used a couple of days ago. Ok, found the other kernel. Same diff as in my previous e-mail, same action. These are my results: 09:41:01.134: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.134: debug : umlMonitorCommand:733 : res.error: 6, res.extra: 0, res.length: 4096, res.data: 09:41:01.134: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.134: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts This one's a 2.6.34.1 kernel. The one where I didn't see the problem is a 2.6.32.something-or-the-other kernel. Mindboggling. Oh my. This really does look like a kernel bug. Can you confirm it with an strace? Have you reported this regression to the right kernel folks? I guess it would help if we could write a simpler test program to isolate whether this recvfrom bug exists in a bare minimum number of syscalls. Meanwhile, I have no idea how to work around a buggy recvfrom that doesn't return the correct number of bytes. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On Mon, Aug 16, 2010 at 10:08:37AM -0600, Eric Blake wrote: On 08/16/2010 03:54 AM, Soren Hansen wrote: On Mon, Aug 16, 2010 at 11:37:07AM +0200, Soren Hansen wrote: I'm running this on another kernel right now and I'm not seeing the problem. I'll try again with the kernel I used a couple of days ago. Ok, found the other kernel. Same diff as in my previous e-mail, same action. These are my results: 09:41:01.134: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.134: debug : umlMonitorCommand:733 : res.error: 6, res.extra: 0, res.length: 4096, res.data: 09:41:01.134: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.134: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts This one's a 2.6.34.1 kernel. The one where I didn't see the problem is a 2.6.32.something-or-the-other kernel. Mindboggling. Oh my. This really does look like a kernel bug. Can you confirm it with an strace? Have you reported this regression to the right kernel folks? I guess it would help if we could write a simpler test program to isolate whether this recvfrom bug exists in a bare minimum number of syscalls. Meanwhile, I have no idea how to work around a buggy recvfrom that doesn't return the correct number of bytes. If there is an identifiable kernel bug then that should be reported and fixed ASAP. IMHO we shouldn't try to workaround such serious brokenness as this is showing. Regards, Daniel -- |: Red Hat, Engineering, London-o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org-o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 16-08-2010 18:08, Eric Blake wrote: On 08/16/2010 03:54 AM, Soren Hansen wrote: 09:41:01.134: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.134: debug : umlMonitorCommand:733 : res.error: 6, res.extra: 0, res.length: 4096, res.data: 09:41:01.134: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.134: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts This one's a 2.6.34.1 kernel. The one where I didn't see the problem is a 2.6.32.something-or-the-other kernel. Mindboggling. Oh my. This really does look like a kernel bug. Can you confirm it with an strace? Annoyingly, no. It doesn't happen in the primary libvirt thread, so I have to pass -f to strace, and uml doesn't let you PTRACE it, so I can't trigger the problem. Have you reported this regression to the right kernel folks? It doesn't happen in 2.6.35, so I seems isolated to that particular kernel. It's really very, very odd. I guess it would help if we could write a simpler test program to isolate whether this recvfrom bug exists in a bare minimum number of syscalls. Meanwhile, I have no idea how to work around a buggy recvfrom that doesn't return the correct number of bytes. No, you're right. Whatever we do to work around this is going to suck somehow. -- Soren Hansen Ubuntu Developer http://www.ubuntu.com/ signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 08/16/2010 03:37 AM, Soren Hansen wrote: nbytes = recvfrom(priv-monitor, res, sizeof res, 0, (struct sockaddr *)addr, addrlen); +VIR_DEBUG(nbytes: %d, nbytes); User error, not a kernel bug. Phew. %d and ssize_t are not always compatible sizes. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 16-08-2010 18:08, Eric Blake wrote: Ok, found the other kernel. Same diff as in my previous e-mail, same action. These are my results: 09:41:01.134: debug : umlMonitorCommand:698 : Run command 'config con0' 09:41:01.134: debug : umlMonitorCommand:733 : res.error: 6, res.extra: 0, res.length: 4096, res.data: 09:41:01.134: debug : umlMonitorCommand:736 : nbytes: 0 09:41:01.134: debug : umlMonitorCommand:738 : res.error: 0, res.extra: 0, res.length: 4, res.data: pts *blush* Ok, we can officially stop chasing this now. While working on some of the other patches I've sent, I apparantly managed to copy a bit of code from one place to another and by luck it was syntactically correct, so the compiler didn't complain. The offending *two characters* were: nbytes = recvfrom(priv-monitor, res, sizeof res, 0, - (struct sockaddr *)addr, addrlen); + (struct sockaddr *)addr, addrlen)0; if (nbytes 0) { Henceforth, nbytes was 0. Sorry about the wasted brain cycles on this. -- Soren Hansen Ubuntu Developer http://www.ubuntu.com/ signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 16-08-2010 18:04, Eric Blake wrote: So the size of the response datagram isn't sizeof(res) as the check in uml_driver.c expects, but rather sizeof(res.error) + sizeof(res.extra) + sizeof(res.length) + res.length. I agree with this analysis. In other words, the check should be more like this (two conditions - did we get enough bytes to even have a valid res.length, and did we get enough bytes to match with what res.length stated): if (nbytes offsetof(struct monitor_request, data) || nbytes res.length + offsetof(struct monitor_request, data)) incomplete reply Yup, this looks good. But before I write such a patch, I'm going to look in more details at your other reply. Let's just forget all about that one, shall we? Please? :) -- Soren Hansen Ubuntu Developer http://www.ubuntu.com/ signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On Sat, Aug 14, 2010 at 11:11:40AM -0600, Eric Blake wrote: First of all, for me, the call to recvfrom returns 0, even though the buffer actually is populated with the response from the monitor. I'm not sure about this one. The recvfrom is called in a loop, and POSIX says that recvfrom shall return the size if a message was read, or 0 if no messages were available but the other end of the connection did an orderly shutdown. Yes, it does. That's part of why it took me so long to figure out. I really didn't expect data to come back when recvfrom returned 0 :) I think we need a bit more understanding of why you are getting a 0 return, and whether it only happens after a successful iteration of the loop (in which case the contents of res are leftover from the prior iteration). Unless multiple threads are reading from the fd (?), I'm quite confident in my data. I dumped the whole monitor_response buffer right before and right after the recvfrom call along with the return value of recvfrom. I'll rerun the tests tomorrow and show you the results. If my results are accurate and it is indeed a kernel bug (well, or eglibc or whatnot), I believe it's in the spirit of libvirt to accept that one of the other parts of the equation is doing something silly and work around it. :) I'm happy to provide more evidence, though. I'm not sure if I pointed this out in my initial e-mail, but even if I didn't have this recvfrom problem, the check seems odd to my eye. Is the monitor really going to send a max sized datagram each time? I would have expected it to only send a datagram the size of the header + the length of the data. -- Soren Hansen so...@linux2go.dk Systems Architect, The Rackspace Cloud Ubuntu Developer -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 3/3] Remove wrong check for uml monitor response size
On 08/12/2010 03:00 AM, Soren Hansen wrote: The current check for the size of the response from the uml monitor is problematic for a couple of reasons: First of all, for me, the call to recvfrom returns 0, even though the buffer actually is populated with the response from the monitor. I'm not sure about this one. The recvfrom is called in a loop, and POSIX says that recvfrom shall return the size if a message was read, or 0 if no messages were available but the other end of the connection did an orderly shutdown. I think we need a bit more understanding of why you are getting a 0 return, and whether it only happens after a successful iteration of the loop (in which case the contents of res are leftover from the prior iteration). At any rate, the only way I can get anything useful to happen here is to completely remove the check, so this patch does just that. In other words, maybe the better thing to do is add an earlier check: if (nbytes == 0) break; but I'm not sure of that idea, either, without more understanding of the root cause. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list