Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-09-05 Thread Benoît Canet
The Friday 05 Sep 2014 à 16:30:31 (+0200), Kevin Wolf wrote :
> Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> > Benoît Canet  writes:
> > 
> > > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> > >> Cc'ing libvirt following Stefan's lead.
> > >> 
> > >> Benoît Canet  writes:
> > >> > /* the following would compute latecies for slices of 1 seconds then 
> > >> > toss the
> > >> >  * result and start a new slice. A weighted sumation of the instant 
> > >> > latencies
> > >> >  * could help to implement this.
> > >> >  */
> > >> > 1s_read_average_latency
> > >> > 1s_write_average_latency
> > >> > 1s_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1
> > >> > minute slice value */
> > >> > 1m_read_average_latency
> > >> > 1m_write_average_latency
> > >> > 1m_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1 hours
> > >> > slice value */
> > >> > 1h_read_average_latency
> > >> > 1h_write_average_latency
> > >> > 1h_flush_average_latency
> > >> 
> > >> This is something like "what we added to total_FOO_time in the last
> > >> completed 1s / 1m / 1h time slice divided by the number of additions".
> > >> Just another way to accumulate the same raw data, thus no worries.
> > >> 
> > >> > /* 1 second average number of requests in flight */
> > >> > 1s_read_queue_depth
> > >> > 1s_write_queue_depth
> > >> >
> > >> > /* 1 minute average number of requests in flight */
> > >> > 1m_read_queue_depth
> > >> > 1m_write_queue_depth
> > >> >
> > >> > /* 1 hours average number of requests in flight */
> > >> > 1h_read_queue_depth
> > >> > 1h_write_queue_depth
> 

I asked some input from a cloud provider.

> I don't think I agree with putting fixed time periods like 1 s/min/h
> into qemu. What you need there is policy and we should probably make
> it configurable.

yes, baking policy in qemu is bad.

> 
> Do we need accounting for multiple time periods at the same time or
> would it be enough to have one and make its duration an option?

Having multiple time periods (up to 3) at once would be cool.
Having big knobs to turn in the configuration (one per period) would be
preferable than a configuration nightmare of one or more setting per device.

> 
> > > Optionally collecting the same data for each BDS of the graph.
> > 
> > If that's the case, keeping the shared infrastructure in the block layer
> > makes sense.
> > 
> > BDS member acct then holds I/O stats for the BDS.  We currently use it
> > for something else: I/O stats of the device model backed by this BDS.
> > That needs to move elsewhere.  Two places come to mind:
> > 
> > 1. BlockBackend, when it's available (I resumed working on it last week
> >for a bit).  Superficially attractive, because it's close to what we
> >have now, but then we have to deal with what to do when the backend
> >gets disconnected from its device model, then connected to another
> >one.
> > 
> > 2. The device models that actually implement I/O accounting.  Since
> >query-blockstats names a backend rather than a device model, we need
> >a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >callback is null.  Lets us distinguish "no stats yet" and "device
> >model can't do stats", thus permits a QMP interface that doesn't lie.
> > 
> > Right now, I like (2) better.
> 
> So let's say I have some block device, which is attached to a guest
> device for a while, but then I detach it and continue using it in a
> different place (maybe another guest device or a block job). Should we
> really reset all counters in query-blockstats to 0?
> 
> I think as I user I would be surprised about this, because I still refer
> to it by the same name (the device_name, which will be in the BB), so
> it's the same thing for me and the total requests include everything
> that was ever issued against it.

This particular cloud provider think that associating the stats with the
emulated hardware is the less worisome to manage.

So now we need to discuss futher in the community 

Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-09-05 Thread Benoît Canet
The Friday 05 Sep 2014 à 16:30:31 (+0200), Kevin Wolf wrote :
> Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> > Benoît Canet  writes:
> > 
> > > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> > >> Cc'ing libvirt following Stefan's lead.
> > >> 
> > >> Benoît Canet  writes:
> > >> > /* the following would compute latecies for slices of 1 seconds then 
> > >> > toss the
> > >> >  * result and start a new slice. A weighted sumation of the instant 
> > >> > latencies
> > >> >  * could help to implement this.
> > >> >  */
> > >> > 1s_read_average_latency
> > >> > 1s_write_average_latency
> > >> > 1s_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1
> > >> > minute slice value */
> > >> > 1m_read_average_latency
> > >> > 1m_write_average_latency
> > >> > 1m_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1 hours
> > >> > slice value */
> > >> > 1h_read_average_latency
> > >> > 1h_write_average_latency
> > >> > 1h_flush_average_latency
> > >> 
> > >> This is something like "what we added to total_FOO_time in the last
> > >> completed 1s / 1m / 1h time slice divided by the number of additions".
> > >> Just another way to accumulate the same raw data, thus no worries.
> > >> 
> > >> > /* 1 second average number of requests in flight */
> > >> > 1s_read_queue_depth
> > >> > 1s_write_queue_depth
> > >> >
> > >> > /* 1 minute average number of requests in flight */
> > >> > 1m_read_queue_depth
> > >> > 1m_write_queue_depth
> > >> >
> > >> > /* 1 hours average number of requests in flight */
> > >> > 1h_read_queue_depth
> > >> > 1h_write_queue_depth
> 
> I don't think I agree with putting fixed time periods like 1 s/min/h
> into qemu. What you need there is policy and we should probably make
> it configurable.

I agree.

> 
> Do we need accounting for multiple time periods at the same time or
> would it be enough to have one and make its duration an option?

I don't know yet.

> 
> > > Optionally collecting the same data for each BDS of the graph.
> > 
> > If that's the case, keeping the shared infrastructure in the block layer
> > makes sense.
> > 
> > BDS member acct then holds I/O stats for the BDS.  We currently use it
> > for something else: I/O stats of the device model backed by this BDS.
> > That needs to move elsewhere.  Two places come to mind:
> > 
> > 1. BlockBackend, when it's available (I resumed working on it last week
> >for a bit).  Superficially attractive, because it's close to what we
> >have now, but then we have to deal with what to do when the backend
> >gets disconnected from its device model, then connected to another
> >one.
> > 
> > 2. The device models that actually implement I/O accounting.  Since
> >query-blockstats names a backend rather than a device model, we need
> >a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >callback is null.  Lets us distinguish "no stats yet" and "device
> >model can't do stats", thus permits a QMP interface that doesn't lie.
> > 
> > Right now, I like (2) better.
> 
> So let's say I have some block device, which is attached to a guest
> device for a while, but then I detach it and continue using it in a
> different place (maybe another guest device or a block job). Should we
> really reset all counters in query-blockstats to 0?
> 
> I think as I user I would be surprised about this, because I still refer
> to it by the same name (the device_name, which will be in the BB), so
> it's the same thing for me and the total requests include everything
> that was ever issued against it.
> 
> > > -API wize I think about adding
> > > bdrv_acct_invalid() and
> > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
> > 
> > Complication: partial success.  Example:
> > 
> > 1. Guest requests a read of N sectors.
> > 
> > 2. Device model calls
> >bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_

Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-09-05 Thread Benoît Canet
The Friday 05 Sep 2014 à 16:30:31 (+0200), Kevin Wolf wrote :
> Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
> > Benoît Canet  writes:
> > 
> > > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> > >> Cc'ing libvirt following Stefan's lead.
> > >> 
> > >> Benoît Canet  writes:
> > >> > /* the following would compute latecies for slices of 1 seconds then 
> > >> > toss the
> > >> >  * result and start a new slice. A weighted sumation of the instant 
> > >> > latencies
> > >> >  * could help to implement this.
> > >> >  */
> > >> > 1s_read_average_latency
> > >> > 1s_write_average_latency
> > >> > 1s_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1
> > >> > minute slice value */
> > >> > 1m_read_average_latency
> > >> > 1m_write_average_latency
> > >> > 1m_flush_average_latency
> > >> >
> > >> > /* the former three numbers could be used to further compute a 1 hours
> > >> > slice value */
> > >> > 1h_read_average_latency
> > >> > 1h_write_average_latency
> > >> > 1h_flush_average_latency
> > >> 
> > >> This is something like "what we added to total_FOO_time in the last
> > >> completed 1s / 1m / 1h time slice divided by the number of additions".
> > >> Just another way to accumulate the same raw data, thus no worries.
> > >> 
> > >> > /* 1 second average number of requests in flight */
> > >> > 1s_read_queue_depth
> > >> > 1s_write_queue_depth
> > >> >
> > >> > /* 1 minute average number of requests in flight */
> > >> > 1m_read_queue_depth
> > >> > 1m_write_queue_depth
> > >> >
> > >> > /* 1 hours average number of requests in flight */
> > >> > 1h_read_queue_depth
> > >> > 1h_write_queue_depth
> 
> I don't think I agree with putting fixed time periods like 1 s/min/h
> into qemu. What you need there is policy and we should probably make
> it configurable.
> 
> Do we need accounting for multiple time periods at the same time or
> would it be enough to have one and make its duration an option?
> 
> > > Optionally collecting the same data for each BDS of the graph.
> > 
> > If that's the case, keeping the shared infrastructure in the block layer
> > makes sense.
> > 
> > BDS member acct then holds I/O stats for the BDS.  We currently use it
> > for something else: I/O stats of the device model backed by this BDS.
> > That needs to move elsewhere.  Two places come to mind:
> > 
> > 1. BlockBackend, when it's available (I resumed working on it last week
> >for a bit).  Superficially attractive, because it's close to what we
> >have now, but then we have to deal with what to do when the backend
> >gets disconnected from its device model, then connected to another
> >one.
> > 
> > 2. The device models that actually implement I/O accounting.  Since
> >query-blockstats names a backend rather than a device model, we need
> >a BlockDevOps callback to fetch the stats.  Fetch fails when the
> >callback is null.  Lets us distinguish "no stats yet" and "device
> >model can't do stats", thus permits a QMP interface that doesn't lie.
> > 
> > Right now, I like (2) better.
> 
> So let's say I have some block device, which is attached to a guest
> device for a while, but then I detach it and continue using it in a
> different place (maybe another guest device or a block job). Should we
> really reset all counters in query-blockstats to 0?
> 
> I think as I user I would be surprised about this, because I still refer
> to it by the same name (the device_name, which will be in the BB), so
> it's the same thing for me and the total requests include everything
> that was ever issued against it.

It all depends on where you are in the user food chain.
If you are a cloud end user you are interested in the stats associated with
the hardware of your /dev/vdx because that's what iostat allow you to see.

> 
> > > -API wize I think about adding
> > > bdrv_acct_invalid() and
> > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
> > 
> > Complication: partial success.  Example:
> > 
&

Re: [libvirt] [Qemu-devel] NBD TLS support in QEMU

2014-09-04 Thread Benoît Canet
The Friday 05 Sep 2014 à 00:07:04 (+0200), Wouter Verhelst wrote :
> On Thu, Sep 04, 2014 at 04:19:17PM +0200, Benoît Canet wrote:
> > The Wednesday 03 Sep 2014 à 17:44:17 (+0100), Stefan Hajnoczi wrote :
> > > Hi,
> > > QEMU offers both NBD client and server functionality.  The NBD protocol
> > > runs unencrypted, which is a problem when the client and server
> > > communicate over an untrusted network.
> > > 
> > > The particular use case that prompted this mail is storage migration in
> > > OpenStack.  The goal is to encrypt the NBD connection between source and
> > > destination hosts during storage migration.
> > 
> > I agree this would be usefull.
> > 
> > > 
> > > I think we can integrate TLS into the NBD protocol as an optional flag.
> > > A quick web search does not reveal existing open source SSL/TLS NBD
> > > implementations.  I do see a VMware NBDSSL protocol but there is no
> > > specification so I guess it is proprietary.
> > > 
> > > The NBD protocol starts with a negotiation phase.  This would be the
> > > appropriate place to indicate that TLS will be used.  After client and
> > > server complete TLS setup the connection can continue as normal.
> > 
> > Prenegociating TLS look like we will accidentaly introduce some security 
> > hole.

I was thinking of the fallback to cleartext case.

As a regular developper I am afraid of doing something creative with
cryptography.

> 
> Can you elaborate on that? How would it be a security hole?
> 
> > Why not just using a dedicated port and let the TLS handshake happen 
> > normaly ?
> 
> Because STARTTLS(-like) protocols are much cleaner; no need to open two
> firewall ports. Also, when I made the request for a port number at IANA,
> I was told I wouldn't get another port for a "secure" variant -- which
> makes sense. As such, if the reference implementation is ever going to
> support TLS, it has to be in a way where it is negotiated at setup time.
> 
> SMTP can do this safely. So can LDAP. I'm sure we can come up with a
> safe way of negotiating TLS.
> 
> If you want to disallow nonencrypted communication, I'm sure it can be
> made possible to require TLS for (some of) your exports.
> 
> (my objections on userspace/kernelspace issues still stand, however)
> 
> -- 
> It is easy to love a country that is famous for chocolate and beer
> 
>   -- Barack Obama, speaking in Brussels, Belgium, 2014-03-26
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [Qemu-devel] NBD TLS support in QEMU

2014-09-04 Thread Benoît Canet
The Thursday 04 Sep 2014 à 15:34:59 (+0100), Daniel P. Berrange wrote :
> On Thu, Sep 04, 2014 at 04:19:17PM +0200, Benoît Canet wrote:
> > The Wednesday 03 Sep 2014 à 17:44:17 (+0100), Stefan Hajnoczi wrote :
> > > Hi,
> > > QEMU offers both NBD client and server functionality.  The NBD protocol
> > > runs unencrypted, which is a problem when the client and server
> > > communicate over an untrusted network.
> > > 
> > > The particular use case that prompted this mail is storage migration in
> > > OpenStack.  The goal is to encrypt the NBD connection between source and
> > > destination hosts during storage migration.
> > 
> > I agree this would be usefull.
> > 
> > > 
> > > I think we can integrate TLS into the NBD protocol as an optional flag.
> > > A quick web search does not reveal existing open source SSL/TLS NBD
> > > implementations.  I do see a VMware NBDSSL protocol but there is no
> > > specification so I guess it is proprietary.
> > > 
> > > The NBD protocol starts with a negotiation phase.  This would be the
> > > appropriate place to indicate that TLS will be used.  After client and
> > > server complete TLS setup the connection can continue as normal.
> > 
> > Prenegociating TLS look like we will accidentaly introduce some security 
> > hole.
> > Why not just using a dedicated port and let the TLS handshake happen 
> > normaly ?
> 
> The mgmt app (libvirt in this case) chooses an arbitrary port when
> telling QEMU to setup NBD, so we don't need to specify any alternate
> port. I'd expect that libvirt just tell QEMU to enable NBD at both
> ends, and we immediately do the TLS handshake upon opening the
> connection.  Only once TLS is established, should the NBD protocol
> start running. IOW we don't need to modify the NBD protocol at all.
> 
> If the mgmt app tells QEMU to enable TLS at one end and not the
> other, the mgmt app gets what it deserves (a failed TLS handshake).
> We certainly would not want QEMU to auto-negotiate and fallback
> to plain text in this case.

I agree.

Best regards

Benoît

> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [Qemu-devel] NBD TLS support in QEMU

2014-09-04 Thread Benoît Canet
The Wednesday 03 Sep 2014 à 17:44:17 (+0100), Stefan Hajnoczi wrote :
> Hi,
> QEMU offers both NBD client and server functionality.  The NBD protocol
> runs unencrypted, which is a problem when the client and server
> communicate over an untrusted network.
> 
> The particular use case that prompted this mail is storage migration in
> OpenStack.  The goal is to encrypt the NBD connection between source and
> destination hosts during storage migration.

I agree this would be usefull.

> 
> I think we can integrate TLS into the NBD protocol as an optional flag.
> A quick web search does not reveal existing open source SSL/TLS NBD
> implementations.  I do see a VMware NBDSSL protocol but there is no
> specification so I guess it is proprietary.
> 
> The NBD protocol starts with a negotiation phase.  This would be the
> appropriate place to indicate that TLS will be used.  After client and
> server complete TLS setup the connection can continue as normal.

Prenegociating TLS look like we will accidentaly introduce some security hole.
Why not just using a dedicated port and let the TLS handshake happen normaly ?

Best regards

Benoît
> 
> Besides QEMU, the userspace NBD tools (http://nbd.sf.net/) can also be
> extended to support TLS.  In this case the kernel needs a localhost
> socket and userspace handles TLS.
> 
> Thoughts?
> 
> Stefan


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-09-01 Thread Benoît Canet
The Monday 01 Sep 2014 à 13:41:01 (+0200), Markus Armbruster wrote :
> Benoît Canet  writes:
> 
> > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> >> Cc'ing libvirt following Stefan's lead.
> >> 
> >> Benoît Canet  writes:
> >> 
> >> > Hi,
> >> >
> >> > I collected some items of a cloud provider wishlist regarding I/O 
> >> > accouting.
> >> 
> >> Feedback from real power-users, lovely!
> >> 
> >> > In a cloud I/O accouting can have 3 purpose: billing, helping the 
> >> > customers
> >> > and doing metrology to help the cloud provider seeks hidden costs.
> >> >
> >> > I'll cover the two former topic in this mail because they are the
> >> > most important
> >> > business wize.
> >> >
> >> > 1) prefered place to collect billing IO accounting data:
> >> > 
> >> > For billing purpose the collected data must be as close as
> >> > possible to what the
> >> > customer would see by using iostats in his vm.
> >> 
> >> Good point.
> >> 
> >> > The first conclusion we can draw is that the choice of collecting
> >> > IO accouting
> >> > data used for billing in the block devices models is right.
> >> 
> >> Slightly rephrasing: doing I/O accounting in the block device models is
> >> right for billing.
> >> 
> >> There may be other uses for I/O accounting, with different preferences.
> >> For instance, data on how exactly guest I/O gets translated to host I/O
> >> as it flows through the nodes in the block graph could be useful.
> >
> > I think this is the third point that I named as metrology.
> > Basically it boils down to "Where are the hidden IO costs of the QEMU
> > block layer".
> 
> Understood.
> 
> >> Doesn't diminish the need for accurate billing information, of course.
> >> 
> >> > 2) what to do with occurences of rare events:
> >> > -
> >> >
> >> > Another point is that QEMU developpers agree that they don't know
> >> > which policy
> >> > to apply to some I/O accounting events.
> >> > Must QEMU discard invalid I/O write IO or account them as done ?
> >> > Must QEMU count a failed read I/O as done ?
> >> >
> >> > When discusting this with a cloud provider the following appears:
> >> > these decisions
> >> > are really specific to each cloud provider and QEMU should not
> >> > implement them.
> >> 
> >> Good point, consistent with the old advice to avoid baking policy into
> >> inappropriately low levels of the stack.
> >> 
> >> > The right thing to do is to add accouting counters to collect these 
> >> > events.
> >> >
> >> > Moreover these rare events are precious troubleshooting data so it's
> >> > an additional
> >> > reason not to toss them.
> >> 
> >> Another good point.
> >> 
> >> > 3) list of block I/O accouting metrics wished for billing and helping
> >> > the customers
> >> > ---
> >> >
> >> > Basic I/O accouting data will end up making the customers bills.
> >> > Extra I/O accouting informations would be a precious help for the
> >> > cloud provider
> >> > to implement a monitoring panel like Amazon Cloudwatch.
> >> 
> >> These are the first two from your list of three purposes, i.e. the ones
> >> you promised to cover here.
> >> 
> >> > Here is the list of counters and statitics I would like to help
> >> > implement in QEMU.
> >> >
> >> > This is the most important part of the mail and the one I would like
> >> > the community
> >> > review the most.
> >> >
> >> > Once this list is settled I would proceed to implement the required
> >> > infrastructure
> >> > in QEMU before using it in the device models.
> >> 
> >> For context, let me recap how I/O accounting works now.
> >> 
> >> The BlockDriverState abstract data type (short: BDS) can hold the
> >> following accounting data:
> >> 
> >> uint64_t nr_bytes[BDRV_MA

Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-09-01 Thread Benoît Canet
The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> Cc'ing libvirt following Stefan's lead.
> 
> Benoît Canet  writes:
> 
> > Hi,
> >
> > I collected some items of a cloud provider wishlist regarding I/O accouting.
> 
> Feedback from real power-users, lovely!
> 
> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> > and doing metrology to help the cloud provider seeks hidden costs.
> >
> > I'll cover the two former topic in this mail because they are the most 
> > important
> > business wize.
> >
> > 1) prefered place to collect billing IO accounting data:
> > 
> > For billing purpose the collected data must be as close as possible to what 
> > the
> > customer would see by using iostats in his vm.
> 
> Good point.
> 
> > The first conclusion we can draw is that the choice of collecting IO 
> > accouting
> > data used for billing in the block devices models is right.
> 
> Slightly rephrasing: doing I/O accounting in the block device models is
> right for billing.
> 
> There may be other uses for I/O accounting, with different preferences.
> For instance, data on how exactly guest I/O gets translated to host I/O
> as it flows through the nodes in the block graph could be useful.

I think this is the third point that I named as metrology.
Basically it boils down to "Where are the hidden IO costs of the QEMU block 
layer".

> 
> Doesn't diminish the need for accurate billing information, of course.
> 
> > 2) what to do with occurences of rare events:
> > -
> >
> > Another point is that QEMU developpers agree that they don't know which 
> > policy
> > to apply to some I/O accounting events.
> > Must QEMU discard invalid I/O write IO or account them as done ?
> > Must QEMU count a failed read I/O as done ?
> >
> > When discusting this with a cloud provider the following appears:
> > these decisions
> > are really specific to each cloud provider and QEMU should not implement 
> > them.
> 
> Good point, consistent with the old advice to avoid baking policy into
> inappropriately low levels of the stack.
> 
> > The right thing to do is to add accouting counters to collect these events.
> >
> > Moreover these rare events are precious troubleshooting data so it's
> > an additional
> > reason not to toss them.
> 
> Another good point.
> 
> > 3) list of block I/O accouting metrics wished for billing and helping
> > the customers
> > ---
> >
> > Basic I/O accouting data will end up making the customers bills.
> > Extra I/O accouting informations would be a precious help for the cloud 
> > provider
> > to implement a monitoring panel like Amazon Cloudwatch.
> 
> These are the first two from your list of three purposes, i.e. the ones
> you promised to cover here.
> 
> > Here is the list of counters and statitics I would like to help
> > implement in QEMU.
> >
> > This is the most important part of the mail and the one I would like
> > the community
> > review the most.
> >
> > Once this list is settled I would proceed to implement the required
> > infrastructure
> > in QEMU before using it in the device models.
> 
> For context, let me recap how I/O accounting works now.
> 
> The BlockDriverState abstract data type (short: BDS) can hold the
> following accounting data:
> 
> uint64_t nr_bytes[BDRV_MAX_IOTYPE];
> uint64_t nr_ops[BDRV_MAX_IOTYPE];
> uint64_t total_time_ns[BDRV_MAX_IOTYPE];
> uint64_t wr_highest_sector;
> 
> where BDRV_MAX_IOTYPE enumerates read, write, flush.
> 
> wr_highest_sector is a high watermark updated by the block layer as it
> writes sectors.
> 
> The other three are *not* touched by the block layer.  Instead, the
> block layer provides a pair of functions for device models to update
> them:
> 
> void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
> int64_t bytes, enum BlockAcctType type);
> void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);
> 
> bdrv_acct_start() initializes cookie for a read, write, or flush
> operation of a certain size.  The size of a flush is always zero.
> 
> bdrv_acct_done() adds the operations to the BDS's accounting data.
> total_time_ns is incremented by the time between _start() and _done().
> 
> You may call _start() without callin

Re: [libvirt] [Qemu-devel] IO accounting overhaul

2014-08-29 Thread Benoît Canet
The Friday 29 Aug 2014 à 17:04:46 (+0100), Stefan Hajnoczi wrote :
> On Thu, Aug 28, 2014 at 04:38:09PM +0200, Benoît Canet wrote:
> > I collected some items of a cloud provider wishlist regarding I/O accouting.
> > 
> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> > and doing metrology to help the cloud provider seeks hidden costs.
> > 
> > I'll cover the two former topic in this mail because they are the most 
> > important
> > business wize.
> > 
> > 1) prefered place to collect billing IO accounting data:
> > 
> > For billing purpose the collected data must be as close as possible to what 
> > the
> > customer would see by using iostats in his vm.
> > 
> > The first conclusion we can draw is that the choice of collecting IO 
> > accouting
> > data used for billing in the block devices models is right.
> 
> I agree.  When statistics are collected at lower layers it becomes are
> for the end user to understand numbers that include hidden costs for
> image formats, network protocols, etc.
> 
> > 2) what to do with occurences of rare events:
> > -
> > 
> > Another point is that QEMU developpers agree that they don't know which 
> > policy
> > to apply to some I/O accounting events.
> > Must QEMU discard invalid I/O write IO or account them as done ?
> > Must QEMU count a failed read I/O as done ?
> > 
> > When discusting this with a cloud provider the following appears: these 
> > decisions
> > are really specific to each cloud provider and QEMU should not implement 
> > them.
> > The right thing to do is to add accouting counters to collect these events.
> > 
> > Moreover these rare events are precious troubleshooting data so it's an 
> > additional
> > reason not to toss them.
> 
> Sounds good, network interface statistics also include error counters.
> 
> > 3) list of block I/O accouting metrics wished for billing and helping the 
> > customers
> > ---
> > 
> > Basic I/O accouting data will end up making the customers bills.
> > Extra I/O accouting informations would be a precious help for the cloud 
> > provider
> > to implement a monitoring panel like Amazon Cloudwatch.
> 
> One thing to be aware of is that counters inside QEMU cannot be trusted.
> If a malicious guest can overwrite memory in QEMU then the counters can
> be manipulated.
> 
> For most purposes this should be okay.  Just be aware that evil guests
> could manipulate their counters if a security hole is found in QEMU.
> 
> > Here is the list of counters and statitics I would like to help implement 
> > in QEMU.
> > 
> > This is the most important part of the mail and the one I would like the 
> > community
> > review the most.
> > 
> > Once this list is settled I would proceed to implement the required 
> > infrastructure
> > in QEMU before using it in the device models.
> > 
> > /* volume of data transfered by the IOs */
> > read_bytes
> > write_bytes
> > 
> > /* operation count */
> > read_ios
> > write_ios
> > flush_ios
> > 
> > /* how many invalid IOs the guest submit */
> > invalid_read_ios
> > invalid_write_ios
> > invalid_flush_ios
> > 
> > /* how many io error happened */
> > read_ios_error
> > write_ios_error
> > flush_ios_error
> > 
> > /* account the time passed doing IOs */
> > total_read_time
> > total_write_time
> > total_flush_time
> > 
> > /* since when the volume is iddle */
> > qvolume_iddleness_time
> 
> ?

s/qv/v/

It's the time the volume spent being iddle.
Amazon report it in it's tools.

> 
> > 
> > /* the following would compute latecies for slices of 1 seconds then toss 
> > the
> >  * result and start a new slice. A weighted sumation of the instant 
> > latencies
> >  * could help to implement this.
> >  */
> > 1s_read_average_latency
> > 1s_write_average_latency
> > 1s_flush_average_latency
> > 
> > /* the former three numbers could be used to further compute a 1 minute 
> > slice value */
> > 1m_read_average_latency
> > 1m_write_average_latency
> > 1m_flush_average_latency
> > 
> > /* the former three numbers could be used to further compute a 1 hours 
> > slice value */
> > 1h_read_average_latency
> > 1h_write_aver

Re: [libvirt] Quorum block driver libvirt support proposal

2014-05-16 Thread Benoît Canet
The Friday 16 May 2014 à 08:20:22 (-0600), Eric Blake wrote :
> On 05/16/2014 08:07 AM, Peter Krempa wrote:
> 
> >>>
> >>> It feels rather odd to have  elements but no top level
> >>> disk images. Really these are all top level images
> >>
> >> It reflect the ways QEMU does it. A single BlockDriverState holding n
> >> quorum BlockDriverState children. There is a 1-1 mapping.
> >>
> >> How would you see it ?
> > 
> > We'd rather see multiple source elements for the top level disk. Backing
> > store is the property of the source image, thus every single of those
> > sources should have it's own list. (or perhaps a tree?)
> 
> I don't see how you can possibly have multiple source elements.
> Remember, part of the determination of what forms a valid 
> element is the type='...' attribute tied to the  parent element -
> but you can't have duplicate attributes.  As I see it, a quorum HAS to
> be a special chain element with 0 sources and multiple backingStore
> children, where each backingStore then includes the type='...' attribute
> for how to interpret the  element of that child.

Additionally quorum support taking snapshots so we need one entity to
bind them together.

Best regards

Benoît

> 
> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 



> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Quorum block driver libvirt support proposal

2014-05-16 Thread Benoît Canet
The Friday 16 May 2014 à 09:54:43 (-0400), Daniel P. Berrange wrote :
> On Fri, May 16, 2014 at 12:33:04PM +0200, Benoît Canet wrote:
> > 
> > Hello list, 
> > 
> > 
> > 
> > I want to implement libvirt Quorum support. 
> > 
> > (https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c)
> >   
> > Quorum is a QEMU RAID like block storage driver.
> > 
> > Data are written on n replicas and when a read is done a comparison between 
> > the 
> > replica read is done. If more than threshold reads are identical the read 
> > succeed
> > else it's and error.
> > 
> > For example a Quorum with n = 3 and threshold = 2 would be made of three 
> > QCOW2
> > backing chains used as identicals replicas. threshold = 2 means that at 
> > least
> > 2 replica must be identical when doing a read.  
> >   
> > 
> > 
> > I want to make use of the new backingStore xml element to implement quorum. 
> > 
> > 
> > 
> > Proposed Quorum libvirt format: 
> > 
> > --- 
> > 
> > 
> > 
> >   
> > 
> >  
> > 
> > 
> > 
> >   
> > 
> >   
> > 
> > 
> > 
> >  
> > 
> >   
> > 
> >   
> > 
> > 
> > 
> >  
> > 
> >   
> > 
> >   
> > 
> > 
> > 
> >  
> > 
> > 
> > 
> > 
> 
> It feels rather odd to have  elements but no top level
> disk images. Really these are all top level images

It reflect the ways QEMU does it. A single BlockDriverState holding n
quorum BlockDriverState children. There is a 1-1 mapping.

How would you see it ?

Best regards

Benoît


> 
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] Quorum block driver libvirt support proposal

2014-05-16 Thread Benoît Canet

Hello list, 

I want to implement libvirt Quorum support. 
(https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c)  
Quorum is a QEMU RAID like block storage driver.
Data are written on n replicas and when a read is done a comparison between the 
replica read is done. If more than threshold reads are identical the read 
succeed
else it's and error.

For example a Quorum with n = 3 and threshold = 2 would be made of three QCOW2
backing chains used as identicals replicas. threshold = 2 means that at least
2 replica must be identical when doing a read.  
  

I want to make use of the new backingStore xml element to implement quorum. 

Proposed Quorum libvirt format: 
--- 

  
 

  
  

 
  
  

 
  
  

 



Implementation plan:


* Add VIR_STORAGE_TYPE_QUORUM   
  

* In src/util/virstoragefile.h change _virStorageSource to contain a
  
virStorageSourcePtrPtr backingStores.   
I think doing it at this level allow to keep a 1-1 mapping with the qemu
BlockDriverState hiearchy   

* Add a int quorum_threshold field to the same structure
  

* Add support for parsing treshold in virDomainDiskDefParseXML  
  

* Change virDomainDiskBackingStoreParse to virDomainDiskBackingStoresParse to 
parse
all the backingStore at once an use realloc to grow the backingStores field.

* Modify virDomainDiskDefFormat to call virDomainDiskBackingStoreFormat in a 
loop
  for saving

* hook into qemuBuildDriveStr around line 3442 to create the quorum parameters  
  

Do you feel that I am missing something ?

Best regards

Benoît  


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list