Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-07 Thread Jan Beulich
>>> On 06.03.12 at 18:20, Konrad Rzeszutek Wilk  wrote:
>  -> the usage of XenbusStateInitWait? Why do we introduce that? Looks
> like a fix to something.

No, this is required to get the negotiation working (the frontend must
not try to read the new nodes until it can be certain that the backend
populated them). However, as already pointed out in an earlier reply
to Santosh, the way this is done here doesn't appear to allow for the
backend to already be in InitWait state when the frontend gets
invoked.

> -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal
> default size for SSD usage? 16?

What do SSDs have to do with a XenBus definition? Imo it's wrong (and
unnecessary) to introduce a limit at the XenBus level at all - each driver
can do this for itself.

As to the limit for SSDs in the block interface - I don't think the number
of possibly simultaneous requests has anything to do with this. Instead,
I'd expect the request number/size/segments extension that NetBSD
apparently implements to possibly have an effect.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-07 Thread Konrad Rzeszutek Wilk
On Mar 7, 2012 4:33 AM, "Jan Beulich"  wrote:
>
> >>> On 06.03.12 at 18:20, Konrad Rzeszutek Wilk  wrote:
> >  -> the usage of XenbusStateInitWait? Why do we introduce that? Looks
> > like a fix to something.
>
> No, this is required to get the negotiation working (the frontend must
> not try to read the new nodes until it can be certain that the backend
> populated them). However, as already pointed out in an earlier reply
> to Santosh, the way this is done here doesn't appear to allow for the
> backend to already be in InitWait state when the frontend gets
> invoked.

OK.
>
> > -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal
> > default size for SSD usage? 16?
>
> What do SSDs have to do with a XenBus definition? Imo it's wrong (and
> unnecessary) to introduce a limit at the XenBus level at all - each driver
> can do this for itself.

The patch should mention what the benefit of multi ring is.
>
> As to the limit for SSDs in the block interface - I don't think the number
> of possibly simultaneous requests has anything to do with this. Instead,
> I'd expect the request number/size/segments extension that NetBSD
> apparently implements to possibly have an effect.

.. which sounds to me like increasing the bandwidth of the protocol. Should
be mentioned somewhere in the git description.
>
> Jan
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Jan Beulich
>>> On 14.03.12 at 07:32, Justin Gibbs  wrote:
> There's another problem here that I brought up during the Xen
> Hack-a-thon.  The ring macros require that the ring element count
> be a power of two.  This doesn't mean that the ring will be a power
> of 2 pages in size.  To illustrate this point, I modified the FreeBSD
> blkback driver to provide negotiated ring stats via sysctl.
> 
> Here's a connection to a Windows VM running the Citrix PV drivers:
> 
> dev.xbbd.2.max_requests: 128
> dev.xbbd.2.max_request_segments: 11
> dev.xbbd.2.max_request_size: 45056
> dev.xbbd.2.ring_elem_size: 108  <= 32bit ABI
> dev.xbbd.2.ring_pages: 4
> dev.xbbd.2.ring_elements: 128
> dev.xbbd.2.ring_waste: 2496
> 
> Over half a page is wasted when ring-page-order is 2.  I'm sure you
> can see where this is going.  :-)
> 
> Here are the limits published by our backend to the XenStore:
> 
> max-ring-pages = "113"
> max-ring-page-order = "7"
> max-requests = "256"
> max-request-segments = "129"
> max-request-size = "524288"
> 
> Because we allow so many concurrent, large requests in our product,
> the ring wastage really adds up if the front end doesn't support
> the "ring-pages" variant of the extension.  However, you only need
> a ring-page-order of 3 with this protocol to start seeing pages of
> wasted ring space.
> 
> You don't really want to negotiate "ring-pages" either.  The backends
> often need to support multiple ABIs.  I can easily construct a set
> of limits for the FreeBSD blkback driver which will cause the ring
> limits to vary by a page between the 32bit and 64bit ABIs.
> 
> With all this in mind, the backend must do a dance of rounding up,
> taking the max of the ring sizes for the different ABIs, and then
> validating the front-end published limits taking its ABI into
> account.  The front-end does some of this too.  Its way too messy
> and error prone because we don't communicate the ring element limit
> directly.
> 
> "max-ring-element-order" anyone? :-)

Interesting observation - yes, I think deprecating both pre-existing
methods in favor of something along those lines would be desirable.
(But I'd favor not using the term "order" here as it is - at least in
Linux - usually implied to be used on pages. "max-ringent-log2"
perhaps?)

What you say also implies that all currently floating around Linux
backend patches are flawed in their way of calculating the number
of ring entries, as this number really depends on the protocol the
frontend advertises.

Further, if you're concerned about wasting ring space (and
particularly in the context of your request number/size/segments
extension), shouldn't we bother to define pairs (or larger groups)
of struct blkif_request_segment (as currently a quarter of the space
is mere padding)? Or split grefs from {first,last}_sect altogether?

Finally, while looking at all this again, I stumbled across the use
of blkif_vdev_t in the ring structures: At least Linux'es blkback
completely ignores this field - {xen_,}vbd_translate() simply
overwrites what dispatch_rw_block_io() put there (and with this,
struct phys_req's dev and bdev members seem rather pointless too).
Does anyone recall what the original intention with this request field
was? Allowing I/O on multiple devices over a single ring?

Bottom line - shouldn't we define a blkif2 interface to cleanly
accommodate all the various extensions (and do away with the
protocol variations)?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Jan Beulich
>>> On 14.03.12 at 07:32, Justin Gibbs  wrote:
> There's another problem here that I brought up during the Xen
> Hack-a-thon.  The ring macros require that the ring element count
> be a power of two.  This doesn't mean that the ring will be a power
> of 2 pages in size.  To illustrate this point, I modified the FreeBSD
> blkback driver to provide negotiated ring stats via sysctl.
> 
> Here's a connection to a Windows VM running the Citrix PV drivers:
> 
> dev.xbbd.2.max_requests: 128
> dev.xbbd.2.max_request_segments: 11
> dev.xbbd.2.max_request_size: 45056
> dev.xbbd.2.ring_elem_size: 108  <= 32bit ABI
> dev.xbbd.2.ring_pages: 4
> dev.xbbd.2.ring_elements: 128
> dev.xbbd.2.ring_waste: 2496
> 
> Over half a page is wasted when ring-page-order is 2.  I'm sure you
> can see where this is going.  :-)

Having looked a little closer on how the wasted space is progressing,
I find myself in the odd position that I can't explain the original (and
still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11):
With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit
respectively) are unused. With 32 requests fitting in the ring, and with
each segment occupying 6 bytes (padded to 8), in the 64-bit variant
there's enough space for a 12th segment (32-bit would even have
space for a 13th). Am I missing anything here?

Plus all this assumes a page size of 4k, yet ia64 had always been using
pages of 16k iirc.

> Here are the limits published by our backend to the XenStore:
>
> max-ring-pages = "113"
> max-ring-page-order = "7"
> max-requests = "256"
> max-request-segments = "129"
> max-request-size = "524288"

Oh, so this protocol doesn't require ring-pages (and max-ring-pages)
to be a power of two? In which case I think it is a mistake to also
advertise max-ring-page-order, as at least the (Linux) frontend code
I know of interprets this as being able to set up a ring of (using the
numbers above) 128 pages (unless, of course, your backend can deal
with this regardless of the max-ring-pages value it announces).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Wei Liu
On Mon, 2012-03-05 at 21:49 +, Santosh Jodh wrote:
> From: Santosh Jodh 
> 
> Add support for multi page ring for block devices.
> The number of pages is configurable for blkback via module parameter.
> blkback reports max-ring-page-order to blkfront via xenstore.
> blkfront reports its supported ring-page-order to blkback via xenstore.
> blkfront reports multi page ring references via ring-refNN in xenstore.
> The change allows newer blkfront to work with older blkback and
> vice-versa.
> Based on original patch by Paul Durrant.
> 
> Signed-off-by: Santosh Jodh 


Doesn't the xenbus interface change deserve another patch (as
prerequisite for block devices change)? Or at least please mention the
change in commit message?


Wei.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Justin Gibbs

On Mar 7, 2012, at 2:33 AM, Jan Beulich wrote:

 On 06.03.12 at 18:20, Konrad Rzeszutek Wilk  wrote:
>> -> XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal
>> default size for SSD usage? 16?
> 
> What do SSDs have to do with a XenBus definition? Imo it's wrong (and
> unnecessary) to introduce a limit at the XenBus level at all - each driver
> can do this for itself.
> 
> As to the limit for SSDs in the block interface - I don't think the number
> of possibly simultaneous requests has anything to do with this. Instead,
> I'd expect the request number/size/segments extension that NetBSD
> apparently implements to possibly have an effect.
> 
> Jan

There's another problem here that I brought up during the Xen
Hack-a-thon.  The ring macros require that the ring element count
be a power of two.  This doesn't mean that the ring will be a power
of 2 pages in size.  To illustrate this point, I modified the FreeBSD
blkback driver to provide negotiated ring stats via sysctl.

Here's a connection to a Windows VM running the Citrix PV drivers:

dev.xbbd.2.max_requests: 128
dev.xbbd.2.max_request_segments: 11
dev.xbbd.2.max_request_size: 45056
dev.xbbd.2.ring_elem_size: 108  <= 32bit ABI
dev.xbbd.2.ring_pages: 4
dev.xbbd.2.ring_elements: 128
dev.xbbd.2.ring_waste: 2496

Over half a page is wasted when ring-page-order is 2.  I'm sure you
can see where this is going.  :-)

Here are the limits published by our backend to the XenStore:

max-ring-pages = "113"
max-ring-page-order = "7"
max-requests = "256"
max-request-segments = "129"
max-request-size = "524288"

Because we allow so many concurrent, large requests in our product,
the ring wastage really adds up if the front end doesn't support
the "ring-pages" variant of the extension.  However, you only need
a ring-page-order of 3 with this protocol to start seeing pages of
wasted ring space.

You don't really want to negotiate "ring-pages" either.  The backends
often need to support multiple ABIs.  I can easily construct a set
of limits for the FreeBSD blkback driver which will cause the ring
limits to vary by a page between the 32bit and 64bit ABIs.

With all this in mind, the backend must do a dance of rounding up,
taking the max of the ring sizes for the different ABIs, and then
validating the front-end published limits taking its ABI into
account.  The front-end does some of this too.  Its way too messy
and error prone because we don't communicate the ring element limit
directly.

"max-ring-element-order" anyone? :-)

--
Justin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Justin Gibbs
On Mar 14, 2012, at 9:34 AM, Jan Beulich wrote:

 On 14.03.12 at 07:32, Justin Gibbs  wrote:
>> There's another problem here that I brought up during the Xen
>> Hack-a-thon.  The ring macros require that the ring element count
>> be a power of two.  This doesn't mean that the ring will be a power
>> of 2 pages in size.  To illustrate this point, I modified the FreeBSD
>> blkback driver to provide negotiated ring stats via sysctl.
>> 
>> Here's a connection to a Windows VM running the Citrix PV drivers:
>> 
>>dev.xbbd.2.max_requests: 128
>>dev.xbbd.2.max_request_segments: 11
>>dev.xbbd.2.max_request_size: 45056
>>dev.xbbd.2.ring_elem_size: 108  <= 32bit ABI
>>dev.xbbd.2.ring_pages: 4
>>dev.xbbd.2.ring_elements: 128
>>dev.xbbd.2.ring_waste: 2496
>> 
>> Over half a page is wasted when ring-page-order is 2.  I'm sure you
>> can see where this is going.  :-)
> 
> Having looked a little closer on how the wasted space is progressing,
> I find myself in the odd position that I can't explain the original (and
> still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11):
> With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit
> respectively) are unused. With 32 requests fitting in the ring, and with
> each segment occupying 6 bytes (padded to 8), in the 64-bit variant
> there's enough space for a 12th segment (32-bit would even have
> space for a 13th). Am I missing anything here?

I don't profess to know the real reason, but the only thing I can come up
with is a requirement/desire on some platforms for 16byte alignment
of the request structures.  This would make the largest possible structure
112 bytes, not the 120 that would allow for more elements.

While we're talking about fixing ring data structures, can RING_IDX
be defined as a "uint32_t" instead of "unsigned int".  The structure
padding in the ring macros assumes RING_IDX is exactly 4 bytes,
so this should be made explicit.  ILP64 machines may still be a way
out, but the use of non-fixed sized types in places where size really
matters just isn't clean.

> 
>> Here are the limits published by our backend to the XenStore:
>> 
>>max-ring-pages = "113"
>>max-ring-page-order = "7"
>>max-requests = "256"
>>max-request-segments = "129"
>>max-request-size = "524288"
> 
> Oh, so this protocol doesn't require ring-pages (and max-ring-pages)
> to be a power of two? In which case I think it is a mistake to also
> advertise max-ring-page-order, as at least the (Linux) frontend code
> I know of interprets this as being able to set up a ring of (using the
> numbers above) 128 pages (unless, of course, your backend can deal
> with this regardless of the max-ring-pages value it announces).

The advertised max-ring-pages is sufficient to hold the maximum allowed
number of ring elements regardless of ABI.  This is then rounded up to the
next power of 2 pages to get the max-ring-page order.  When the front-end
negotiates, the backend just verifies that the maximum number of ring
elements in the specified ring size doesn't exceed the backend's limit.
Fortunately, even with this large of a ring, regardless of ABI, a given
page order computes to the same number of ring elements.  You just have
more wasted space.

--
Justin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Justin T. Gibbs
On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote:

 On 05.03.12 at 22:49, Santosh Jodh  wrote:

…

>> +   }
>> +
>>/* Create shared ring, alloc event channel. */
>>err = setup_blkring(dev, info);
>>if (err)
>> @@ -889,12 +916,35 @@ again:
>>goto destroy_blkring;
>>}
>> 
>> -   err = xenbus_printf(xbt, dev->nodename,
>> -   "ring-ref", "%u", info->ring_ref);
>> -   if (err) {
>> -   message = "writing ring-ref";
>> -   goto abort_transaction;
>> +   if (legacy_backend) {
> 
> Why not use the simpler interface always when info->ring_order == 0?

Because, as I just found out today via a FreeBSD bug report, that's
not how XenServer works.  If the front-end publishes "ring-page-order",
the backend assumes the "ring-refNN" XenStore nodes are in effect,
even if the order is 0.

I'm working on a documentation update for blkif.h now.



--
Justin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-15 Thread Jan Beulich
>>> On 14.03.12 at 18:01, Justin Gibbs  wrote:
> While we're talking about fixing ring data structures, can RING_IDX
> be defined as a "uint32_t" instead of "unsigned int".  The structure
> padding in the ring macros assumes RING_IDX is exactly 4 bytes,
> so this should be made explicit.  ILP64 machines may still be a way
> out, but the use of non-fixed sized types in places where size really
> matters just isn't clean.

Yes, if we're going to rev the interface, then any such flaws should be
corrected.

(Also shrinking the Cc list a little.)

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-15 Thread Jan Beulich
>>> On 14.03.12 at 18:17, "Justin T. Gibbs"  wrote:
> On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote:
> 
> On 05.03.12 at 22:49, Santosh Jodh  wrote:
> 
> …
> 
>>> +   }
>>> +
>>>/* Create shared ring, alloc event channel. */
>>>err = setup_blkring(dev, info);
>>>if (err)
>>> @@ -889,12 +916,35 @@ again:
>>>goto destroy_blkring;
>>>}
>>> 
>>> -   err = xenbus_printf(xbt, dev->nodename,
>>> -   "ring-ref", "%u", info->ring_ref);
>>> -   if (err) {
>>> -   message = "writing ring-ref";
>>> -   goto abort_transaction;
>>> +   if (legacy_backend) {
>> 
>> Why not use the simpler interface always when info->ring_order == 0?
> 
> Because, as I just found out today via a FreeBSD bug report, that's
> not how XenServer works.  If the front-end publishes "ring-page-order",
> the backend assumes the "ring-refNN" XenStore nodes are in effect,
> even if the order is 0.

I was certainly implying to not write the ring-page-order and
num-ring-pages nodes in that case.

> I'm working on a documentation update for blkif.h now.
> 
> 
> 
> --
> Justin


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-15 Thread Ian Campbell
On Thu, 2012-03-15 at 08:03 +, Jan Beulich wrote:
> >>> On 14.03.12 at 18:01, Justin Gibbs  wrote:
> > While we're talking about fixing ring data structures, can RING_IDX
> > be defined as a "uint32_t" instead of "unsigned int".  The structure
> > padding in the ring macros assumes RING_IDX is exactly 4 bytes,
> > so this should be made explicit.  ILP64 machines may still be a way
> > out, but the use of non-fixed sized types in places where size really
> > matters just isn't clean.
> 
> Yes, if we're going to rev the interface, then any such flaws should be
> corrected.

There has been talk of doing something similar for netif too. IIRC the
netchannel2 work included a new generic ring scheme with support for
variable sized req/rsp elements and such.

If we are going to rev the rings then should we try and use a common
ring mechanism? I think so. If so then we could do worse than to start
from the netchannel2 ring stuff and/or concepts?

Looks like that is
http://xenbits.xen.org/ext/netchannel2/linux-2.6.18/log/075f6677a290/include/xen/interface/io/uring.h
still a bit nc2 specific though.

Ian.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-15 Thread Jan Beulich
>>> On 15.03.12 at 09:51, Ian Campbell  wrote:
> On Thu, 2012-03-15 at 08:03 +, Jan Beulich wrote:
>> >>> On 14.03.12 at 18:01, Justin Gibbs  wrote:
>> > While we're talking about fixing ring data structures, can RING_IDX
>> > be defined as a "uint32_t" instead of "unsigned int".  The structure
>> > padding in the ring macros assumes RING_IDX is exactly 4 bytes,
>> > so this should be made explicit.  ILP64 machines may still be a way
>> > out, but the use of non-fixed sized types in places where size really
>> > matters just isn't clean.
>> 
>> Yes, if we're going to rev the interface, then any such flaws should be
>> corrected.
> 
> There has been talk of doing something similar for netif too. IIRC the
> netchannel2 work included a new generic ring scheme with support for
> variable sized req/rsp elements and such.
> 
> If we are going to rev the rings then should we try and use a common
> ring mechanism? I think so. If so then we could do worse than to start
> from the netchannel2 ring stuff and/or concepts?
> 
> Looks like that is
> http://xenbits.xen.org/ext/netchannel2/linux-2.6.18/log/075f6677a290/include 
> /xen/interface/io/uring.h
> still a bit nc2 specific though.

Taking the concept (and the implementation as a starting point) would
seem like a good idea to me. Separate request and reply rings as well
as variable size entries would certainly benefit blkif too.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization