Re: [Xen-devel] [PATCH 0/5] Towards a restartable oxenstored

2017-04-07 Thread Jonathan Davies

On 07/04/17 15:21, Juergen Gross wrote:

Is the data format of the saved state documented somewhere? I'd
appreciate any pointer or (in case there is no such doc) a followup
patch to add the documentation.

This would help later in case C-xenstored learns restartability, too,
and someone would like to switch the xenstored type on a running
system.


That's a good idea. No, it's not currently documented as far as I'm aware.

My impression is that the existing bits of oxenstored code that produced 
this output have never been used in practice. So now would be a good 
time to review the format of the saved state before it potentially gets 
more widely used.


Thanks,
Jonathan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/5] Towards a restartable oxenstored

2017-04-07 Thread Jonathan Davies

On 07/04/17 15:15, Andrew Cooper wrote:

On 07/04/17 14:27, Jonathan Davies wrote:

With these patches, the state saved on exit contains information about
inter-domain connections and active watches, and the contents of the
store. Some internal state is not currently preserved over a restart:
  1. quota usage info
  2. partially-read and already-queued packets from rings
  3. recent transaction history

Items 1 and 2 are probably needed for this to be considered fully
functional, so fixes for them should follow. But the bugfixes in this
series already represent a worthwhile improvement.


Shouldn't quota usage be recalculated as part of parsing the stashed
database?  After all, the database is the canonical state of the world.
What would you do if the database and the stashed quota differ?


You're right. This would be recalculated when loading the saved state.

I had in mind that there were some quotas of "dynamic" things that 
aren't derived from the contents alone, but I can't spot them in the 
code so must have been imagining them.


So there's no need to dump quota state to disk.

Thanks,
Jonathan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/5] oxenstored: initialise logging earlier

2017-04-07 Thread Jonathan Davies
Otherwise we miss out on messages from things that try to log earlier in
the start-up procedure.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/xenstored.ml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/xenstored.ml 
b/tools/ocaml/xenstored/xenstored.ml
index 05ace4d..09da257 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -285,6 +285,8 @@ let _ =
 
let quit = ref false in
 
+   Logging.init_xenstored_log();
+
if cf.restart then (
DB.from_file store domains cons (Paths.xen_run_stored ^ "/db");
Event.bind_dom_exc_virq eventchn
@@ -311,7 +313,6 @@ let _ =
Sys.set_signal Sys.sigusr1 (Sys.Signal_handle (fun i -> sigusr1_handler 
store));
Sys.set_signal Sys.sigpipe Sys.Signal_ignore;
 
-   Logging.init_xenstored_log();
if cf.activate_access_log then begin
let post_rotate () = DB.to_file store cons 
(Paths.xen_run_stored ^ "/db") in
Logging.init_access_log post_rotate
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 4/5] oxenstored: improve event-channel binding logging

2017-04-07 Thread Jonathan Davies
It's useful to see a bit more detail when an inter-domain event-channel
is bound, especially over an oxenstored restart.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/domain.ml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/domain.ml b/tools/ocaml/xenstored/domain.ml
index eda2ea9..b0a01b0 100644
--- a/tools/ocaml/xenstored/domain.ml
+++ b/tools/ocaml/xenstored/domain.ml
@@ -72,7 +72,7 @@ let notify dom = match dom.port with
 
 let bind_interdomain dom =
dom.port <- Some (Event.bind_interdomain dom.eventchn dom.id 
dom.remote_port);
-   debug "domain %d bound port %s" dom.id (string_of_port dom.port)
+   debug "bound domain %d remote port %d to local port %s" dom.id 
dom.remote_port (string_of_port dom.port)
 
 
 let close dom =
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/5] oxenstored: avoid leading slash in paths in saved store state

2017-04-07 Thread Jonathan Davies
Internally, paths are represented as lists of strings, where
  * path "/" is represented by []
  * path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).

However, the traversal function generated paths like
[""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.

Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/store.ml | 8 +++-
 tools/ocaml/xenstored/xenstored.ml | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/ocaml/xenstored/store.ml b/tools/ocaml/xenstored/store.ml
index 9f619b8..13cf3b5 100644
--- a/tools/ocaml/xenstored/store.ml
+++ b/tools/ocaml/xenstored/store.ml
@@ -122,6 +122,11 @@ let of_string s =
| "" :: path when is_valid path -> path
| _ -> raise Define.Invalid_path
 
+let of_path_and_name path name =
+   match path, name with
+   | [], "" -> []
+   | _ -> path @ [name]
+
 let create path connection_path =
of_string (Utils.path_validate path connection_path)
 
@@ -343,7 +348,8 @@ let path_exists store path =
 let traversal root_node f =
let rec _traversal path node =
f path node;
-   List.iter (_traversal (path @ [ Symbol.to_string node.Node.name 
])) node.Node.children
+   let node_path = Path.of_path_and_name path (Symbol.to_string 
node.Node.name) in
+   List.iter (_traversal node_path) node.Node.children
in
_traversal [] root_node
 
diff --git a/tools/ocaml/xenstored/xenstored.ml 
b/tools/ocaml/xenstored/xenstored.ml
index 09da257..77fd9e3 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -213,7 +213,7 @@ let to_channel store cons chan =
(* dump the store *)
Store.dump_fct store (fun path node ->
let name, perms, value = Store.Node.unpack node in
-   let fullpath = (Store.Path.to_string path) ^ "/" ^ name in
+   let fullpath = Store.Path.to_string 
(Store.Path.of_path_and_name path name) in
let permstr = Perms.Node.to_string perms in
fprintf chan "store,%s,%s,%s\n" (hexify fullpath) (hexify 
permstr) (hexify value)
);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 0/5] Towards a restartable oxenstored

2017-04-07 Thread Jonathan Davies
In order to make oxenstored restartable, we need to save internal state 
to a file on exit and restore from this file on startup.

Much of the infrastructure for making oxenstored restartable already 
existed, but a handful of bugs prevented it from working.

After these patches I can do the following:

# xenstore-write foo bar
# xenstore-read foo
bar
# xenstore-ls -f -p | sort > contents.before
# killall oxenstored
# ./oxenstored --restart
# xenstore-ls -f -p | sort > contents.after
# diff contents.before contents.after
# xenstore-read foo
bar

... and I can do similar kinds of activity in a guest across the 
restart.

Note that clients of local socket connections will get EPIPE or similar 
when oxenstored terminates. Hence these clients need to handle this
gracefully, e.g. by attempting to reconnect, if they wish to tolerate 
xenstored restarts.

With these patches, the state saved on exit contains information about 
inter-domain connections and active watches, and the contents of the
store. Some internal state is not currently preserved over a restart:
  1. quota usage info
  2. partially-read and already-queued packets from rings
  3. recent transaction history

Items 1 and 2 are probably needed for this to be considered fully 
functional, so fixes for them should follow. But the bugfixes in this 
series already represent a worthwhile improvement.

Jonathan Davies (5):
  oxenstored: initialise logging earlier
  oxenstored: avoid leading slash in paths in saved store state
  oxenstored: save remote evtchn port, not local port
  oxenstored: improve event-channel binding logging
  oxenstored: make --restart option best-effort

 tools/ocaml/xenstored/domain.ml|  4 ++--
 tools/ocaml/xenstored/store.ml |  8 +++-
 tools/ocaml/xenstored/xenstored.ml | 10 ++
 3 files changed, 15 insertions(+), 7 deletions(-)

-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 5/5] oxenstored: make --restart option best-effort

2017-04-07 Thread Jonathan Davies
Only attempt to restore from saved state if it exists.

Without this, oxenstored immediately exits with an exception if the
--restart option is provided but the state file is not present.

(The time-of-check to time-of-use race isn't a concern as oxenstored is
the only thing that should write the state file.)

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/xenstored.ml | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/ocaml/xenstored/xenstored.ml 
b/tools/ocaml/xenstored/xenstored.ml
index 77fd9e3..bb780d0 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -287,8 +287,9 @@ let _ =
 
Logging.init_xenstored_log();
 
-   if cf.restart then (
-   DB.from_file store domains cons (Paths.xen_run_stored ^ "/db");
+   let filename = Paths.xen_run_stored ^ "/db" in
+   if cf.restart && Sys.file_exists filename then (
+   DB.from_file store domains cons filename;
Event.bind_dom_exc_virq eventchn
) else (
if !Disk.enable then (
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/5] oxenstored: save remote evtchn port, not local port

2017-04-07 Thread Jonathan Davies
Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.

Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.

Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/domain.ml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/domain.ml b/tools/ocaml/xenstored/domain.ml
index 4515650..eda2ea9 100644
--- a/tools/ocaml/xenstored/domain.ml
+++ b/tools/ocaml/xenstored/domain.ml
@@ -62,7 +62,7 @@ let string_of_port = function
 | Some x -> string_of_int (Xeneventchn.to_int x)
 
 let dump d chan =
-   fprintf chan "dom,%d,%nd,%s\n" d.id d.mfn (string_of_port d.port)
+   fprintf chan "dom,%d,%nd,%d\n" d.id d.mfn d.remote_port
 
 let notify dom = match dom.port with
 | None ->
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] xl cpupool-numa-split: don't try to bring any dom0 vCPUs online

2016-08-12 Thread Jonathan Davies
Since commit a9dd01431a799b6743193a75f4f1ce2fdfdb7296, main_cpupoolnumasplit
wants to ensure that dom0 doesn't have more online vCPUs than the number of
pCPUs in a NUMA node.

However, if dom0 already has fewer online vCPUs than the number of pCPUs in a
NUMA node, this will cause some to be made online.

Furthermore, if dom0's maximum number of vCPUs is less than the number of pCPUs,
this will result in an error like the following:

libxl: error: libxl.c:5564:libxl_set_vcpuonline: Requested 24 VCPUs, 
however maxcpus is 12!: Function not implemented
error on removing vcpus for Domain-0

Instead, make main_cpupoolnumasplit only reduce the number of vCPUs dom0 has
online, and don't try to add any more.

This incurs an extra call to libxl_domain_info to find out the current number of
dom0 vCPUs. Conveniently, there is already an initialised libxl_dominfo that we
can use.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/libxl/xl_cmdimpl.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 7f961e3..d5df20d 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -8614,12 +8614,19 @@ int main_cpupoolnumasplit(int argc, char **argv)
 goto out;
 }
 
+if (libxl_domain_info(ctx, , 0)) {
+fprintf(stderr, "error on getting info for Domain-0\n");
+goto out;
+}
+
 n = 0;
 for (c = 0; c < n_cpus; c++) {
 if (topology[c].node == node) {
 topology[c].node = LIBXL_CPUTOPOLOGY_INVALID_ENTRY;
-libxl_bitmap_set(, n);
-n++;
+if (n < info.vcpu_online) {
+libxl_bitmap_set(, n);
+n++;
+}
 }
 }
 if (libxl_domain_info(ctx, , 0)) {
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] oxenstored: allow compilation prior to OCaml 3.12.0

2016-03-30 Thread Jonathan Davies
Commit 363ae55c8 used an OCaml feature called record field punning. This broke
the build on compilers prior to OCaml 3.12.0.

This patch makes no semantic change but now uses backwards-compatible syntax.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reported-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

---
Changed since v1:
 * names listed in the preamble
---
 tools/ocaml/xenstored/process.ml |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index fb5fdaf..7b60376 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -484,7 +484,7 @@ let do_input store cons doms con =
if newpacket then (
let packet = Connection.pop_in con in
let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
-   let req = {Packet.tid; Packet.rid; Packet.ty; Packet.data} in
+   let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; 
Packet.data=data} in
 
(* As we don't log IO, do not call an unnecessary sanitize_data 
   info "[%s] -> [%d] %s \"%s\""
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] oxenstored: allow compilation prior to OCaml 3.12.0

2016-03-30 Thread Jonathan Davies

On 30/03/2016 4:54 PM, Boris Ostrovsky wrote:
> On 03/30/2016 11:47 AM, Jonathan Davies wrote:
> > Commit 363ae55c8 used an OCaml feature called record field punning.
> > This broke the build on compilers prior to OCaml 3.12.0.
> >
> > This patch makes no semantic change but now uses backwards-compatible
> syntax.
> >
> > Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
> > Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
> 
> This should be Tested-by (and maybe Reported-by). Reviewed-by implies
> that I understood what you did. I make no such claim ;-)
> 
> -boris

My apologies!

Corrected version coming up.

Jonathan

> > ---
> >   tools/ocaml/xenstored/process.ml |2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/ocaml/xenstored/process.ml
> > b/tools/ocaml/xenstored/process.ml
> > index fb5fdaf..7b60376 100644
> > --- a/tools/ocaml/xenstored/process.ml
> > +++ b/tools/ocaml/xenstored/process.ml
> > @@ -484,7 +484,7 @@ let do_input store cons doms con =
> > if newpacket then (
> > let packet = Connection.pop_in con in
> > let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
> > -   let req = {Packet.tid; Packet.rid; Packet.ty; Packet.data} in
> > +   let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty;
> > +Packet.data=data} in
> >
> > (* As we don't log IO, do not call an unnecessary
> sanitize_data
> >info "[%s] -> [%d] %s \"%s\""
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/7] oxenstored: refactor request processing

2016-03-30 Thread Jonathan Davies
On Tue, Mar 29, 2016 at 08:41:54PM +0100, David Scott wrote:
> 
> > On 29 Mar 2016, at 17:38, Wei Liu <wei.l...@citrix.com> wrote:
> > 
> > On Tue, Mar 29, 2016 at 10:08:30AM +0100, Jonathan Davies wrote:
> >> On Thu, Mar 24, 2016 at 07:57:30PM -0400, Boris Ostrovsky wrote:
> >>> On 03/24/2016 06:49 PM, Andrew Cooper wrote:
> >>>> On 24/03/2016 22:22, Boris Ostrovsky wrote:
> >>>>> On 03/17/2016 01:51 PM, Jonathan Davies wrote:
> >>>>>> Encapsulate the request in a record that is passed from do_input to
> >>>>>> process_packet and input_handle_error.
> >>>>>> 
> >>>>>> This will be helpful when keeping track of the requests made as part
> >>>>>> of a
> >>>>>> transaction.
> >>>>>> 
> >>>>>> Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
> >>>>>> Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
> >>>>>> Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
> >>>>>> Reviewed-by: Euan Harris <euan.har...@citrix.com>
> >>>>>> ---
> >>>>>>  tools/ocaml/xenstored/process.ml |   15 ++-
> >>>>>>  1 file changed, 10 insertions(+), 5 deletions(-)
> >>>>>> 
> >>>>>> diff --git a/tools/ocaml/xenstored/process.ml
> >>>>>> b/tools/ocaml/xenstored/process.ml
> >>>>>> index 7a73669..c92bec7 100644
> >>>>>> --- a/tools/ocaml/xenstored/process.ml
> >>>>>> +++ b/tools/ocaml/xenstored/process.ml
> >>>>>> @@ -344,11 +344,11 @@ let function_of_type ty =
> >>>>>>  | Xenbus.Xb.Op.Invalid   -> reply_ack do_error
> >>>>>>  | _  -> reply_ack do_error
> >>>>>>  -let input_handle_error ~cons ~doms ~fct ~ty ~con ~t ~rid ~data =
> >>>>>> +let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
> >>>>>>  let reply_error e =
> >>>>>>  Packet.Error e in
> >>>>>>  try
> >>>>>> -fct con t doms cons data
> >>>>>> +fct con t doms cons req.Packet.data
> >>>>>>  with
> >>>>>>  | Define.Invalid_path  -> reply_error "EINVAL"
> >>>>>>  | Define.Already_exist -> reply_error "EEXIST"
> >>>>>> @@ -370,7 +370,10 @@ let input_handle_error ~cons ~doms ~fct ~ty ~con
> >>>>>> ~t ~rid ~data =
> >>>>>>  (**
> >>>>>>   * Nothrow guarantee.
> >>>>>>   *)
> >>>>>> -let process_packet ~store ~cons ~doms ~con ~tid ~rid ~ty ~data =
> >>>>>> +let process_packet ~store ~cons ~doms ~con ~req =
> >>>>>> +let ty = req.Packet.ty in
> >>>>>> +let tid = req.Packet.tid in
> >>>>>> +let rid = req.Packet.rid in
> >>>>>>  try
> >>>>>>  let fct = function_of_type ty in
> >>>>>>  let t =
> >>>>>> @@ -379,7 +382,7 @@ let process_packet ~store ~cons ~doms ~con ~tid
> >>>>>> ~rid ~ty ~data =
> >>>>>>  else
> >>>>>>  Connection.get_transaction con tid
> >>>>>>  in
> >>>>>> -let response = input_handle_error ~cons ~doms ~fct ~ty ~con
> >>>>>> ~t ~rid ~data in
> >>>>>> +let response = input_handle_error ~cons ~doms ~fct ~con ~t
> >>>>>> ~req in
> >>>>>>(* Put the response on the wire *)
> >>>>>>  send_response ty con t rid response
> >>>>>> @@ -412,11 +415,13 @@ let do_input store cons doms con =
> >>>>>>  if newpacket then (
> >>>>>>  let packet = Connection.pop_in con in
> >>>>>>  let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
> >>>>>> +let req = {Packet.tid; Packet.rid; Packet.ty; Packet.data} in
> >>>>>> +
> >>>>> I think this change breaks the build with older ocaml versions:
> >>>>> 
> >>>>> root@haswell> ocam

[Xen-devel] [PATCH] oxenstored: allow compilation prior to OCaml 3.12.0

2016-03-30 Thread Jonathan Davies
Commit 363ae55c8 used an OCaml feature called record field punning. This broke
the build on compilers prior to OCaml 3.12.0.

This patch makes no semantic change but now uses backwards-compatible syntax.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 tools/ocaml/xenstored/process.ml |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index fb5fdaf..7b60376 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -484,7 +484,7 @@ let do_input store cons doms con =
if newpacket then (
let packet = Connection.pop_in con in
let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
-   let req = {Packet.tid; Packet.rid; Packet.ty; Packet.data} in
+   let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; 
Packet.data=data} in
 
(* As we don't log IO, do not call an unnecessary sanitize_data 
   info "[%s] -> [%d] %s \"%s\""
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/7] oxenstored: refactor request processing

2016-03-29 Thread Jonathan Davies
On Thu, Mar 24, 2016 at 07:57:30PM -0400, Boris Ostrovsky wrote:
> On 03/24/2016 06:49 PM, Andrew Cooper wrote:
> >On 24/03/2016 22:22, Boris Ostrovsky wrote:
> >>On 03/17/2016 01:51 PM, Jonathan Davies wrote:
> >>>Encapsulate the request in a record that is passed from do_input to
> >>>process_packet and input_handle_error.
> >>>
> >>>This will be helpful when keeping track of the requests made as part
> >>>of a
> >>>transaction.
> >>>
> >>>Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
> >>>Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
> >>>Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
> >>>Reviewed-by: Euan Harris <euan.har...@citrix.com>
> >>>---
> >>>   tools/ocaml/xenstored/process.ml |   15 ++-
> >>>   1 file changed, 10 insertions(+), 5 deletions(-)
> >>>
> >>>diff --git a/tools/ocaml/xenstored/process.ml
> >>>b/tools/ocaml/xenstored/process.ml
> >>>index 7a73669..c92bec7 100644
> >>>--- a/tools/ocaml/xenstored/process.ml
> >>>+++ b/tools/ocaml/xenstored/process.ml
> >>>@@ -344,11 +344,11 @@ let function_of_type ty =
> >>>   | Xenbus.Xb.Op.Invalid   -> reply_ack do_error
> >>>   | _  -> reply_ack do_error
> >>>   -let input_handle_error ~cons ~doms ~fct ~ty ~con ~t ~rid ~data =
> >>>+let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
> >>>   let reply_error e =
> >>>   Packet.Error e in
> >>>   try
> >>>-fct con t doms cons data
> >>>+fct con t doms cons req.Packet.data
> >>>   with
> >>>   | Define.Invalid_path  -> reply_error "EINVAL"
> >>>   | Define.Already_exist -> reply_error "EEXIST"
> >>>@@ -370,7 +370,10 @@ let input_handle_error ~cons ~doms ~fct ~ty ~con
> >>>~t ~rid ~data =
> >>>   (**
> >>>* Nothrow guarantee.
> >>>*)
> >>>-let process_packet ~store ~cons ~doms ~con ~tid ~rid ~ty ~data =
> >>>+let process_packet ~store ~cons ~doms ~con ~req =
> >>>+let ty = req.Packet.ty in
> >>>+let tid = req.Packet.tid in
> >>>+let rid = req.Packet.rid in
> >>>   try
> >>>   let fct = function_of_type ty in
> >>>   let t =
> >>>@@ -379,7 +382,7 @@ let process_packet ~store ~cons ~doms ~con ~tid
> >>>~rid ~ty ~data =
> >>>   else
> >>>   Connection.get_transaction con tid
> >>>   in
> >>>-let response = input_handle_error ~cons ~doms ~fct ~ty ~con
> >>>~t ~rid ~data in
> >>>+let response = input_handle_error ~cons ~doms ~fct ~con ~t
> >>>~req in
> >>> (* Put the response on the wire *)
> >>>   send_response ty con t rid response
> >>>@@ -412,11 +415,13 @@ let do_input store cons doms con =
> >>>   if newpacket then (
> >>>   let packet = Connection.pop_in con in
> >>>   let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
> >>>+let req = {Packet.tid; Packet.rid; Packet.ty; Packet.data} in
> >>>+
> >>I think this change breaks the build with older ocaml versions:
> >>
> >>root@haswell> ocamlopt -v
> >>The OCaml native-code compiler, version 4.00.1
> >>Standard library directory: /usr/lib64/ocaml
> >>root@haswell> ocamlopt -g -ccopt "" -dtypes -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/xb -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/mmap -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/xc -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/eventchn -cc gcc -w F
> >>-warn-error F -c -o process.cmx process.ml
> >>root@haswell>
> >>
> >>
> >>root@ovs104> ocamlopt -v
> >>The Objective Caml native-code compiler, version 3.11.2
> >>Standard library directory: /usr/lib64/ocaml
> >>root@ovs104> ocamlopt -g -ccopt "" -dtypes -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/xb -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/mmap -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/xc -I
> >>/root/tmp/xen/tools/ocaml/xenstored/../libs/eventchn -cc gcc -

[Xen-devel] [PATCH 6/7] oxenstored: replay transaction upon conflict

2016-03-20 Thread Jonathan Davies
The existing transaction merge algorithm keeps track of the least upper bound
(longest common prefix) of all the nodes which have been read and written, and
will re-combine two stores which have disjoint upper bounds. This works well for
small transactions but causes unnecessary conflicts for ones that span a large
subtree, such as the following ones used by the xapi toolstack:

 * VM start: creates /vm/... /vss/... /local/domain/...
   The least upper bound of this transaction is / and so all
   these transactions conflict with everything.

 * Device hotplug: creates /local/domain/0/... /local/domain/n/...
   The least upper bound of this transaction is /local/domain so
   all these transactions conflict with each other.

If the existing merge algorithm cannot merge and commit, we attempt
a /replay/ of the failed transaction against the new store.

When we replay the requests we check whether the response sent to the client is
the same as during the first attempt at the transaction. If the responses are
all the same then the transaction replay can be committed. If any differ then
the transaction replay must be aborted and the client must retry.

This algorithm uses the intuition that the transactions made by the toolstack
are designed to be for separate domains, and should fundamentally not conflict
in the sense that they don't read or write any shared keys. By replaying the
transaction on the server side we do what the client would have to do anyway,
only we can do it quickly without allowing any other requests to interfere.

Performing 300 parallel simulated VM start and shutdowns without this code:

300 parallel starts and shutdowns: 268.92

Performing 300 parallel simulated VM start and shutdowns with this code:

300 parallel starts and shutdowns: 3.80

Signed-off-by: Dave Scott <d...@recoil.org>
Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/connection.ml |5 -
 tools/ocaml/xenstored/process.ml|   33 +
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/tools/ocaml/xenstored/connection.ml 
b/tools/ocaml/xenstored/connection.ml
index 0a2c481..b18336f 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -233,7 +233,10 @@ let end_transaction con tid commit =
let trans = Hashtbl.find con.transactions tid in
Hashtbl.remove con.transactions tid;
Logging.end_transaction ~tid ~con:(get_domstr con);
-   if commit then Transaction.commit ~con:(get_domstr con) trans else true
+   match commit with
+   | None -> true
+   | Some transaction_replay_f ->
+   Transaction.commit ~con:(get_domstr con) trans || 
transaction_replay_f con trans
 
 let get_transaction con tid =
Hashtbl.find con.transactions tid
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 39ae71b..6d1f551 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -281,6 +281,38 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
| (Failure "int_of_string")-> reply_error "EINVAL"
| Define.Unknown_operation -> reply_error "ENOSYS"
 
+(* Replay a stored transaction against a fresh store, check the responses are
+   all equivalent: if so, commit the transaction. Otherwise send the abort to
+   the client. *)
+let transaction_replay c t doms cons =
+   match t.Transaction.ty with
+   | Transaction.No ->
+   error "attempted to replay a non-full transaction";
+   false
+   | Transaction.Full(id, oldroot, cstore) ->
+   let tid = Connection.start_transaction c cstore in
+   let new_t = Transaction.make tid cstore in
+   let con = sprintf "r(%d):%s" id (Connection.get_domstr c) in
+   let perform_exn (request, response) =
+   let fct = function_of_type_simple_op request.Packet.ty 
in
+   let response' = input_handle_error ~cons ~doms ~fct 
~con:c ~t:new_t ~req:request in
+   if not(Packet.response_equal response response') then 
raise Transaction_again in
+   finally
+   (fun () ->
+   try
+   Logging.start_transaction ~con ~tid;
+   List.iter perform_exn 
(Transaction.get_operations t);
+   Logging.end_transaction ~con ~tid;
+
+   Transaction.commit ~con new_t
+   with e ->
+   info "transaction_replay %d caught: %s"

[Xen-devel] [PATCH 4/7] oxenstored: keep track of each transaction's operations

2016-03-19 Thread Jonathan Davies
A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.

Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.

After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/define.ml   |1 +
 tools/ocaml/xenstored/oxenstored.conf |1 +
 tools/ocaml/xenstored/process.ml  |   13 +++--
 tools/ocaml/xenstored/transaction.ml  |   21 +++--
 tools/ocaml/xenstored/xenstored.ml|1 +
 5 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 89a6aac..d60861c 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -27,6 +27,7 @@ let default_config_dir = "/etc/xen"
 
 let maxwatch = ref (50)
 let maxtransaction = ref (20)
+let maxrequests = ref (-1)   (* maximum requests per transaction *)
 
 let domid_self = 0x7FF0
 
diff --git a/tools/ocaml/xenstored/oxenstored.conf 
b/tools/ocaml/xenstored/oxenstored.conf
index dd20eda..ac60f49 100644
--- a/tools/ocaml/xenstored/oxenstored.conf
+++ b/tools/ocaml/xenstored/oxenstored.conf
@@ -18,6 +18,7 @@ quota-maxentity = 1000
 quota-maxsize = 2048
 quota-maxwatch = 100
 quota-transaction = 10
+quota-maxrequests = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index c92bec7..758ade1 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -155,7 +155,7 @@ let do_transaction_end con t domains cons data =
if not success then
raise Transaction_again;
if commit then
-   process_watch (List.rev (Transaction.get_ops t)) cons
+   process_watch (List.rev (Transaction.get_paths t)) cons
 
 let do_introduce con t domains cons data =
if not (Connection.is_dom0 con)
@@ -303,7 +303,7 @@ let reply_ack fct con t doms cons data =
fct con t doms cons data;
Packet.Ack (fun () ->
if Transaction.get_id t = Transaction.none then
-   process_watch (Transaction.get_ops t) cons
+   process_watch (Transaction.get_paths t) cons
)
 
 let reply_data fct con t doms cons data =
@@ -384,6 +384,15 @@ let process_packet ~store ~cons ~doms ~con ~req =
in
let response = input_handle_error ~cons ~doms ~fct ~con ~t ~req 
in
 
+   let response = try
+   if tid <> Transaction.none then
+   (* Remember the request and response for this 
operation in case we need to replay the transaction *)
+   Transaction.add_operation 
~perm:(Connection.get_perm con) t req response;
+   response
+   with Quota.Limit_reached ->
+   Packet.Error "EQUOTA"
+   in
+
(* Put the response on the wire *)
send_response ty con t rid response
with exn ->
diff --git a/tools/ocaml/xenstored/transaction.ml 
b/tools/ocaml/xenstored/transaction.ml
index 77de4e8..6b37fc2 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -75,7 +75,8 @@ type t = {
ty: ty;
store: Store.t;
quota: Quota.t;
-   mutable ops: (Xenbus.Xb.Op.operation * Store.Path.t) list;
+   mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
+   mutable operations: (Packet.request * Packet.response) list;
mutable read_lowpath: Store.Path.t option;
mutable write_lowpath: Store.Path.t option;
 }
@@ -86,16 +87,24 @@ let make id store =
ty = ty;
store = if id = none then store else Store.copy store;
quota = Quota.copy store.Store.quota;
-   ops = [];
+   paths = [];
+   operations = [];
read_lowpath = None;
write_lowpath = None;
}
 
 let get_id t = match t.ty with No -> none | Full (id, _, _) -> id
 let get_store t = t.store
-let get_ops t = t.ops
-
-let add_wop t ty path = t.ops <- (ty, path) :: t.ops
+let get_paths t = t.paths
+
+let add_wop t ty path = t.paths 

[Xen-devel] [PATCH 7/7] oxenstored: log request and response during transaction replay

2016-03-19 Thread Jonathan Davies
During a transaction replay, the replayed requests and the new responses are
logged in the same way as the original requests and the original responses.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/process.ml |   24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 6d1f551..fb5fdaf 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -281,6 +281,18 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
| (Failure "int_of_string")-> reply_error "EINVAL"
| Define.Unknown_operation -> reply_error "ENOSYS"
 
+let write_access_log ~ty ~tid ~con ~data =
+   Logging.xb_op ~ty ~tid ~con data
+
+let write_answer_log ~ty ~tid ~con ~data =
+   Logging.xb_answer ~ty ~tid ~con data
+
+let write_response_log ~ty ~tid ~con ~response =
+   match response with
+   | Packet.Ack _   -> write_answer_log ~ty ~tid ~con ~data:""
+   | Packet.Reply x -> write_answer_log ~ty ~tid ~con ~data:x
+   | Packet.Error e -> write_answer_log ~ty:(Xenbus.Xb.Op.Error) ~tid ~con 
~data:e
+
 (* Replay a stored transaction against a fresh store, check the responses are
all equivalent: if so, commit the transaction. Otherwise send the abort to
the client. *)
@@ -294,8 +306,10 @@ let transaction_replay c t doms cons =
let new_t = Transaction.make tid cstore in
let con = sprintf "r(%d):%s" id (Connection.get_domstr c) in
let perform_exn (request, response) =
+   write_access_log ~ty:request.Packet.ty ~tid ~con 
~data:request.Packet.data;
let fct = function_of_type_simple_op request.Packet.ty 
in
let response' = input_handle_error ~cons ~doms ~fct 
~con:c ~t:new_t ~req:request in
+   write_response_log ~ty:request.Packet.ty ~tid ~con 
~response:response';
if not(Packet.response_equal response response') then 
raise Transaction_again in
finally
(fun () ->
@@ -451,12 +465,6 @@ let process_packet ~store ~cons ~doms ~con ~req =
error "process packet: %s" (Printexc.to_string exn);
Connection.send_error con tid rid "EIO"
 
-let write_access_log ~ty ~tid ~con ~data =
-   Logging.xb_op ~ty ~tid ~con:(Connection.get_domstr con) data
-
-let write_answer_log ~ty ~tid ~con ~data =
-   Logging.xb_answer ~ty ~tid ~con:(Connection.get_domstr con) data
-
 let do_input store cons doms con =
let newpacket =
try
@@ -483,7 +491,7 @@ let do_input store cons doms con =
 (Connection.get_domstr con) tid
 (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
process_packet ~store ~cons ~doms ~con ~req;
-   write_access_log ~ty ~tid ~con ~data;
+   write_access_log ~ty ~tid ~con:(Connection.get_domstr con) 
~data;
Connection.incr_ops con;
)
 
@@ -496,7 +504,7 @@ let do_output store cons doms con =
   info "[%s] <- %s \"%s\""
 (Connection.get_domstr con)
 (Xenbus.Xb.Op.to_string ty) (sanitize_data 
data);*)
-   write_answer_log ~ty ~tid ~con ~data;
+   write_answer_log ~ty ~tid ~con:(Connection.get_domstr 
con) ~data;
);
try
ignore (Connection.do_output con)
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 5/7] oxenstored: move functions that process simple operations

2016-03-19 Thread Jonathan Davies
Separate the functions which process operations that can be done as part of a
transaction. Specifically, these operations are: read, write, rm, getperms,
setperms, getdomainpath, directory, mkdir.

Also split function_of_type into two functions: one for processing the simple
operations and one for processing the rest.

This will help allow replay of transactions, allowing us to invoke the functions
that process the simple operations as part of the processing of transaction_end.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/process.ml |  223 +-
 1 file changed, 121 insertions(+), 102 deletions(-)

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 758ade1..39ae71b 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -119,94 +119,6 @@ let do_getperms con t domains cons data =
let perms = Transaction.getperms t (Connection.get_perm con) path in
Perms.Node.to_string perms ^ "\000"
 
-let do_watch con t domains cons data =
-   let (node, token) = 
-   match (split None '\000' data) with
-   | [node; token; ""]   -> node, token
-   | _   -> raise Invalid_Cmd_Args
-   in
-   let watch = Connections.add_watch cons con node token in
-   Packet.Ack (fun () -> Connection.fire_single_watch watch)
-
-let do_unwatch con t domains cons data =
-   let (node, token) =
-   match (split None '\000' data) with
-   | [node; token; ""]   -> node, token
-   | _   -> raise Invalid_Cmd_Args
-   in
-   Connections.del_watch cons con node token
-
-let do_transaction_start con t domains cons data =
-   if Transaction.get_id t <> Transaction.none then
-   raise Transaction_nested;
-   let store = Transaction.get_store t in
-   string_of_int (Connection.start_transaction con store) ^ "\000"
-
-let do_transaction_end con t domains cons data =
-   let commit =
-   match (split None '\000' data) with
-   | "T" :: _ -> true
-   | "F" :: _ -> false
-   | x :: _   -> raise (Invalid_argument x)
-   | _-> raise Invalid_Cmd_Args
-   in
-   let success =
-   Connection.end_transaction con (Transaction.get_id t) commit in
-   if not success then
-   raise Transaction_again;
-   if commit then
-   process_watch (List.rev (Transaction.get_paths t)) cons
-
-let do_introduce con t domains cons data =
-   if not (Connection.is_dom0 con)
-   then raise Define.Permission_denied;
-   let (domid, mfn, port) =
-   match (split None '\000' data) with
-   | domid :: mfn :: port :: _ ->
-   int_of_string domid, Nativeint.of_string mfn, 
int_of_string port
-   | _ -> raise Invalid_Cmd_Args;
-   in
-   let dom =
-   if Domains.exist domains domid then
-   Domains.find domains domid
-   else try
-   let ndom = Xenctrl.with_intf (fun xc ->
-   Domains.create xc domains domid mfn port) in
-   Connections.add_domain cons ndom;
-   Connections.fire_spec_watches cons "@introduceDomain";
-   ndom
-   with _ -> raise Invalid_Cmd_Args
-   in
-   if (Domain.get_remote_port dom) <> port || (Domain.get_mfn dom) <> mfn 
then
-   raise Domain_not_match
-
-let do_release con t domains cons data =
-   if not (Connection.is_dom0 con)
-   then raise Define.Permission_denied;
-   let domid =
-   match (split None '\000' data) with
-   | [domid;""] -> int_of_string domid
-   | _  -> raise Invalid_Cmd_Args
-   in
-   let fire_spec_watches = Domains.exist domains domid in
-   Domains.del domains domid;
-   Connections.del_domain cons domid;
-   if fire_spec_watches 
-   then Connections.fire_spec_watches cons "@releaseDomain"
-   else raise Invalid_Cmd_Args
-
-let do_resume con t domains cons data =
-   if not (Connection.is_dom0 con)
-   then raise Define.Permission_denied;
-   let domid =
-   match (split None '\000' data) with
-   | domid :: _ -> int_of_string domid
-   | _  -> raise Invalid_Cmd_Args
-   in
-   if Domains.exist domains

[Xen-devel] [PATCH 0/7] oxenstored: improve transaction conflict handling

2016-03-19 Thread Jonathan Davies
This patch series makes a substantial improvement to oxenstored's transaction
handling.

The original design of oxenstored assumed that a transaction would only ever
span a small subtree. In practice this is rarely the case and means that other
xenstore traffic overlapping with a transaction can cause it to fail with
EAGAIN. This leads to significant performance problems with toolstack operations
when there is a certain level of xenstore traffic coming from other sources.

Instead, we observe that, although the transaction may span a large subtree, it
is very unlikely to read or write the same keys as other xenstore traffic. This
observation leads to an improved transaction conflict-handling algorithm in
which the transaction is replayed to check it is serializable. This approach is
superior in performance.

The sixth patch is the main one that changes the way transaction conflicts are
handled. The first five patches prepare the way by performing some refactoring
and addition of infrastructure that allows for transactions to be replayed. The
seventh patch adds some extra logging.

Jonathan Davies (7):
  oxenstored: refactor putting response on wire
  oxenstored: remove some unused parameters
  oxenstored: refactor request processing
  oxenstored: keep track of each transaction's operations
  oxenstored: move functions that process simple operations
  oxenstored: replay transaction upon conflict
  oxenstored: log request and response during transaction replay

 tools/ocaml/xenstored/Makefile|1 +
 tools/ocaml/xenstored/connection.ml   |5 +-
 tools/ocaml/xenstored/define.ml   |1 +
 tools/ocaml/xenstored/oxenstored.conf |1 +
 tools/ocaml/xenstored/process.ml  |  344 +
 tools/ocaml/xenstored/transaction.ml  |   21 +-
 tools/ocaml/xenstored/xenstored.ml|1 +
 7 files changed, 239 insertions(+), 135 deletions(-)

-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/7] oxenstored: remove some unused parameters

2016-03-19 Thread Jonathan Davies
Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/process.ml |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 3377966..7a73669 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -119,7 +119,7 @@ let do_getperms con t domains cons data =
let perms = Transaction.getperms t (Connection.get_perm con) path in
Perms.Node.to_string perms ^ "\000"
 
-let do_watch con t rid domains cons data =
+let do_watch con t domains cons data =
let (node, token) = 
match (split None '\000' data) with
| [node; token; ""]   -> node, token
@@ -299,25 +299,25 @@ let send_response ty con t rid response =
| Packet.Error e ->
Connection.send_error con (Transaction.get_id t) rid e
 
-let reply_ack fct ty con t rid doms cons data =
+let reply_ack fct con t doms cons data =
fct con t doms cons data;
Packet.Ack (fun () ->
if Transaction.get_id t = Transaction.none then
process_watch (Transaction.get_ops t) cons
)
 
-let reply_data fct ty con t rid doms cons data =
+let reply_data fct con t doms cons data =
let ret = fct con t doms cons data in
Packet.Reply ret
 
-let reply_data_or_ack fct ty con t rid doms cons data =
+let reply_data_or_ack fct con t doms cons data =
match fct con t doms cons data with
| Some ret -> Packet.Reply ret
| None -> Packet.Ack (fun () -> ())
 
-let reply_none fct ty con t rid doms cons data =
+let reply_none fct con t doms cons data =
(* let the function reply *)
-   fct con t rid doms cons data
+   fct con t doms cons data
 
 let function_of_type ty =
match ty with
@@ -348,7 +348,7 @@ let input_handle_error ~cons ~doms ~fct ~ty ~con ~t ~rid 
~data =
let reply_error e =
Packet.Error e in
try
-   fct ty con t rid doms cons data
+   fct con t doms cons data
with
| Define.Invalid_path  -> reply_error "EINVAL"
| Define.Already_exist -> reply_error "EEXIST"
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/7] oxenstored: refactor putting response on wire

2016-03-19 Thread Jonathan Davies
Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.

Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.

This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
Reviewed-by: Andrew Cooper <andrew.coop...@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.lud...@citrix.com>
Reviewed-by: Euan Harris <euan.har...@citrix.com>
---
 tools/ocaml/xenstored/Makefile   |1 +
 tools/ocaml/xenstored/process.ml |   34 --
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/tools/ocaml/xenstored/Makefile b/tools/ocaml/xenstored/Makefile
index 59875f7..dce9e70 100644
--- a/tools/ocaml/xenstored/Makefile
+++ b/tools/ocaml/xenstored/Makefile
@@ -36,6 +36,7 @@ OBJS = define \
stdext \
trie \
config \
+   packet \
logging \
quota \
perms \
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index e827678..3377966 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -126,8 +126,7 @@ let do_watch con t rid domains cons data =
| _   -> raise Invalid_Cmd_Args
in
let watch = Connections.add_watch cons con node token in
-   Connection.send_ack con (Transaction.get_id t) rid Xenbus.Xb.Op.Watch;
-   Connection.fire_single_watch watch
+   Packet.Ack (fun () -> Connection.fire_single_watch watch)
 
 let do_unwatch con t domains cons data =
let (node, token) =
@@ -289,20 +288,32 @@ let do_set_target con t domains cons data =
| _   -> raise Invalid_Cmd_Args
 
 (*- Generic handling of ty --*)
+let send_response ty con t rid response =
+   match response with
+   | Packet.Ack f ->
+   Connection.send_ack con (Transaction.get_id t) rid ty;
+   (* Now do any necessary follow-up actions *)
+   f ()
+   | Packet.Reply ret ->
+   Connection.send_reply con (Transaction.get_id t) rid ty ret
+   | Packet.Error e ->
+   Connection.send_error con (Transaction.get_id t) rid e
+
 let reply_ack fct ty con t rid doms cons data =
fct con t doms cons data;
-   Connection.send_ack con (Transaction.get_id t) rid ty;
-   if Transaction.get_id t = Transaction.none then
-   process_watch (Transaction.get_ops t) cons
+   Packet.Ack (fun () ->
+   if Transaction.get_id t = Transaction.none then
+   process_watch (Transaction.get_ops t) cons
+   )
 
 let reply_data fct ty con t rid doms cons data =
let ret = fct con t doms cons data in
-   Connection.send_reply con (Transaction.get_id t) rid ty ret
+   Packet.Reply ret
 
 let reply_data_or_ack fct ty con t rid doms cons data =
match fct con t doms cons data with
-   | Some ret -> Connection.send_reply con (Transaction.get_id t) 
rid ty ret
-   | None -> Connection.send_ack con (Transaction.get_id t) rid ty
+   | Some ret -> Packet.Reply ret
+   | None -> Packet.Ack (fun () -> ())
 
 let reply_none fct ty con t rid doms cons data =
(* let the function reply *)
@@ -335,7 +346,7 @@ let function_of_type ty =
 
 let input_handle_error ~cons ~doms ~fct ~ty ~con ~t ~rid ~data =
let reply_error e =
-   Connection.send_error con (Transaction.get_id t) rid e in
+   Packet.Error e in
try
fct ty con t rid doms cons data
with
@@ -368,7 +379,10 @@ let process_packet ~store ~cons ~doms ~con ~tid ~rid ~ty 
~data =
else
Connection.get_transaction con tid
in
-   input_handle_error ~cons ~doms ~fct ~ty ~con ~t ~rid ~data;
+   let response = input_handle_error ~cons ~doms ~fct ~ty ~con ~t 
~rid ~data in
+
+   (* Put the response on the wire *)
+   send_response ty con t rid response
with exn ->
error "process packet: %s" (Printexc.to_string exn);
Connection.send_error con tid rid "EIO"
-- 
1.7.10.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/7] oxenstored: improve transaction conflict handling

2016-03-18 Thread Jonathan Davies
On Fri, Mar 18, 2016 at 10:33:35AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Mar 17, 2016 at 05:51:08PM +0000, Jonathan Davies wrote:
> > This patch series makes a substantial improvement to oxenstored's 
> > transaction
> > handling.
> > 
> > The original design of oxenstored assumed that a transaction would only ever
> > span a small subtree. In practice this is rarely the case and means that 
> > other
> > xenstore traffic overlapping with a transaction can cause it to fail with
> > EAGAIN. This leads to significant performance problems with toolstack 
> > operations
> > when there is a certain level of xenstore traffic coming from other sources.
> > 
> > Instead, we observe that, although the transaction may span a large 
> > subtree, it
> > is very unlikely to read or write the same keys as other xenstore traffic. 
> > This
> > observation leads to an improved transaction conflict-handling algorithm in
> > which the transaction is replayed to check it is serializable. This 
> > approach is
> > superior in performance.
> > 
> > The sixth patch is the main one that changes the way transaction conflicts 
> > are
> > handled. The first five patches prepare the way by performing some 
> > refactoring
> > and addition of infrastructure that allows for transactions to be replayed. 
> > The
> > seventh patch adds some extra logging.
> 
> All the patches have quite the Reviewed-by list already so I would think
> these can go in now?
> 
> Thought oddly I didn't see any discussion on xen-devel for this?

Indeed; we did a lot of testing internally before these patches were posted.

Jonathan

> > Jonathan Davies (7):
> >   oxenstored: refactor putting response on wire
> >   oxenstored: remove some unused parameters
> >   oxenstored: refactor request processing
> >   oxenstored: keep track of each transaction's operations
> >   oxenstored: move functions that process simple operations
> >   oxenstored: replay transaction upon conflict
> >   oxenstored: log request and response during transaction replay
> > 
> >  tools/ocaml/xenstored/Makefile|1 +
> >  tools/ocaml/xenstored/connection.ml   |5 +-
> >  tools/ocaml/xenstored/define.ml   |1 +
> >  tools/ocaml/xenstored/oxenstored.conf |1 +
> >  tools/ocaml/xenstored/process.ml  |  344 
> > +
> >  tools/ocaml/xenstored/transaction.ml  |   21 +-
> >  tools/ocaml/xenstored/xenstored.ml|1 +
> >  7 files changed, 239 insertions(+), 135 deletions(-)
> > 
> > -- 
> > 1.7.10.4
> > 
> > 
> > ___
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH] oxenstored: Quota.merge: don't assume domain already exists

2015-11-11 Thread Jonathan Davies
In Quota.merge, we merge two quota hashtables, orig_quota and mod_quota, putting
the results into dest_quota. These hashtables map domids to the number of
entries currently owned by that domain.

When mod_quota contains an entry for a domid that was not present in orig_quota
(or dest_quota), the call to get_entry caused Quota.merge to raise a Not_found
exception. This propagates back to the client as an ENOENT error, which is not
an appropriate return value from some operations, such as transaction_end.

This situation can arise when a transaction that introduces a domain (hence
calling Quota.add_entry) needs to be coalesced due to concurrent xenstore
activity.

This patch handles the merge in the case where mod_quota contains an entry not
present in orig_quota (or in dest_quota) by treating that hashtable as having
existing value 0.

Signed-off-by: Jonathan Davies <jonathan.dav...@citrix.com>
---
 tools/ocaml/xenstored/quota.ml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index e6953c6..abcac91 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -83,6 +83,6 @@ let add quota diff =
Hashtbl.iter (fun id nb -> set_entry quota id (get_entry quota id + 
nb)) diff.cur
 
 let merge orig_quota mod_quota dest_quota =
- Hashtbl.iter (fun id nb -> let diff = nb - (get_entry orig_quota id) 
in
+ Hashtbl.iter (fun id nb -> let diff = nb - (try get_entry orig_quota 
id with Not_found -> 0) in
if diff <> 0 then
-   set_entry dest_quota id ((get_entry 
dest_quota id) + diff)) mod_quota.cur
+   set_entry dest_quota id ((try get_entry 
dest_quota id with Not_found -> 0) + diff)) mod_quota.cur
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-13 Thread Jonathan Davies

On 13/04/15 11:56, George Dunlap wrote:

On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
stefano.stabell...@eu.citrix.com wrote:

On Thu, 9 Apr 2015, Eric Dumazet wrote:

On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet eduma...@google.com
Date:   Sun Dec 7 12:22:18 2014 -0800

 tcp: refine TSO autosizing


[snip]


I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdevm=142738853820517


This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.


[snip]


Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?


A very big difference:

echo 262144  /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576  /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before refine TSO
autosizing


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?


Is the problem perhaps that netback/netfront delays TX completion?
Would it be better to see if that can be addressed properly, so that
the original purpose of the patch (fighting bufferbloat) can be
achieved while not degrading performance for Xen?  Or at least, so
that people get decent perfomance out of the box without having to
tweak TCP parameters?


I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.


Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdevm=142746161307283.


I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.


Jonathan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: transmit fully GSO-sized packets

2015-03-31 Thread Jonathan Davies

On 30/03/15 14:46, Wei Liu wrote:

On Thu, Mar 26, 2015 at 03:08:58PM +, Jonathan Davies wrote:


On 26/03/15 12:05, Eric Dumazet wrote:

On Thu, 2015-03-26 at 11:13 +, Jonathan Davies wrote:

xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
 XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
 sk-sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
 (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max TCP 
header)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
  drivers/net/xen-netfront.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e9b960f..fb6e978 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1279,8 +1279,6 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
netdev-ethtool_ops = xennet_ethtool_ops;
SET_NETDEV_DEV(netdev, dev-dev);

-   netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER);
-
np-netdev = netdev;

netif_carrier_off(netdev);


Hmm, this partially reverts commit
9ecd1a75d977e2e8c48139c7d3efed183f898d94



Why xennet_change_mtu() is not changed by your patch ?


I think you are right: the mtu calculation suffers from the same problem. I
believe the value of mtu relates to the size of the whole packet, including
the header (which is why the value of dev-mtu is normally 1500 rather than
1448).

Wei, as the author of commit 9ecd1a75d977 (xen-netfront: reduce
gso_max_size to account for max TCP header), do you agree that the max mtu
formula should not need to subtract MAX_TCP_HEADER?



IIRC at the time I wrote that patch I needed to subtract MAX_TCP_HEADER
otherwise netfront would generate oversized packets then get marked as
malicious by backend.

I think your reasoning is straightforward. Probably other core driver
changes have somehow mitigated the issues I saw.

Presuming you tested this change and saw no problems, I'm of course
happy with making netfront more efficient. :-)


Okay, thanks for confirming.

I'll post a v2 including the change to xennet_change_mtu.

Jonathan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] xen-netfront: transmit fully GSO-sized packets

2015-03-31 Thread Jonathan Davies
xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
sk-sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
(XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
it relates to the size of the whole packet, including the header.

Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max TCP 
header)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com

---
Changes in v2:
 - Also correct max mtu in xennet_change_mtu.
---
 drivers/net/xen-netfront.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e9b960f..720aaf6 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1008,8 +1008,7 @@ err:
 
 static int xennet_change_mtu(struct net_device *dev, int mtu)
 {
-   int max = xennet_can_sg(dev) ?
-   XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER : ETH_DATA_LEN;
+   int max = xennet_can_sg(dev) ? XEN_NETIF_MAX_TX_SIZE : ETH_DATA_LEN;
 
if (mtu  max)
return -EINVAL;
@@ -1279,8 +1278,6 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
netdev-ethtool_ops = xennet_ethtool_ops;
SET_NETDEV_DEV(netdev, dev-dev);
 
-   netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER);
-
np-netdev = netdev;
 
netif_carrier_off(netdev);
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-03-27 Thread Jonathan Davies

On 26/03/15 17:14, Eric Dumazet wrote:

On Thu, 2015-03-26 at 16:46 +, Jonathan Davies wrote:

Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

  * Since TSQ was introduced, the queue size limit was
sysctl_tcp_limit_output_bytes.

  * Commit c9eeec26 (tcp: TSQ can use a dynamic limit) made the limit
max(skb-truesize, sk-sk_pacing_rate  10).
This allows more packets in-flight according to the estimated rate.

  * Commit 98e09386 (tcp: tsq: restore minimal amount of queueing) made the
limit
max_t(unsigned int, sysctl_tcp_limit_output_bytes,
sk-sk_pacing_rate  10).
This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
more if rate estimation shows this to be worthwhile.

  * Commit 605ad7f1 (tcp: refine TSO autosizing) made the limit
min_t(u32, max(2 * skb-truesize, sk-sk_pacing_rate  10),
   sysctl_tcp_limit_output_bytes).
This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
regardless of what rate estimation suggests. It's not clear from the commit
message why this significant change was justified, changing
sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using iperf -c ip -i 1 using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 (tcp: refine TSO autosizing)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
  net/ipv4/tcp_output.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 * One example is wifi aggregation (802.11 AMPDU)
 */
limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
-   limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+   limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(sk-sk_wmem_alloc)  limit) {
set_bit(TSQ_THROTTLED, tp-tsq_flags);



That will get a NACK from me and Google TCP team of course.

Try your patch with low throughput flows, like 64kbps, GSO off...

And observe TCP timestamps and rtt estimations, critical for vegas or
any CC based on delay.

I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.

This topic was discussed for wifi recently, and I suggested multiple
ways to solve the problem on problematic drivers.

There is a reason sysctl_tcp_limit_output_bytes exists : You can change
its value to whatever you want.

vi +652 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
 Controls TCP Small Queue limit per tcp socket.
 TCP bulk sender tends to increase packets in flight until it
 gets losses notifications. With SNDBUF autotuning, this can
 result in a large amount of packets queued in qdisc/device
 on the local machine, hurting latency of other flows, for
 typical pfifo_fast qdiscs.
 tcp_limit_output_bytes limits the number of bytes on qdisc
 or device to reduce artificial RTT/cwnd and reduce bufferbloat.
 Default: 131072

Documentation is pretty clear : This is the max value, not a min one.


Okay, thanks for your feedback. It wasn't clear to me from the commit 
message in 605ad7f1 (tcp: refine TSO autosizing) why 
tcp_limit_output_bytes changed from being a lower bound to an upper 
bound in that patch. But it's now clear from your reply that that's the 
behaviour you intend as correct.


Thanks for drawing my attention to the wifi thread, which is helpful. As 
I understand it, the only available options for fixing the performance 
regression experienced by xen-netfront are:

  1. Reduce TX completion latency, or
  2. Bypass TSQ to allow buffering of more than

[Xen-devel] [PATCH] xen-netfront: transmit fully GSO-sized packets

2015-03-26 Thread Jonathan Davies
xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
sk-sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
(XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max TCP 
header)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
 drivers/net/xen-netfront.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e9b960f..fb6e978 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1279,8 +1279,6 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
netdev-ethtool_ops = xennet_ethtool_ops;
SET_NETDEV_DEV(netdev, dev-dev);
 
-   netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER);
-
np-netdev = netdev;
 
netif_carrier_off(netdev);
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-03-26 Thread Jonathan Davies
Network drivers with slow TX completion can experience poor network transmit
throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit.
The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
in-flight. This has been observed to limit transmit throughput with xen-netfront
because its TX completion can be relatively slow compared to physical NIC
drivers.

There have been several modifications to the calculation of the sk_wmem_alloc
limit in the past. Here is a brief history:

 * Since TSQ was introduced, the queue size limit was
   sysctl_tcp_limit_output_bytes.

 * Commit c9eeec26 (tcp: TSQ can use a dynamic limit) made the limit
   max(skb-truesize, sk-sk_pacing_rate  10).
   This allows more packets in-flight according to the estimated rate.

 * Commit 98e09386 (tcp: tsq: restore minimal amount of queueing) made the
   limit
   max_t(unsigned int, sysctl_tcp_limit_output_bytes,
   sk-sk_pacing_rate  10).
   This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
   more if rate estimation shows this to be worthwhile.

 * Commit 605ad7f1 (tcp: refine TSO autosizing) made the limit
   min_t(u32, max(2 * skb-truesize, sk-sk_pacing_rate  10),
  sysctl_tcp_limit_output_bytes).
   This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
   regardless of what rate estimation suggests. It's not clear from the commit
   message why this significant change was justified, changing
   sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.

This patch restores the behaviour that allows the limit to grow above
sysctl_tcp_limit_output_bytes according to the rate estimation.

This has been measured to improve xen-netfront throughput from a domU to dom0
from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
latter case, TX completion is especially slow, explaining the large improvement.
These values were measured against 4.0-rc5 using iperf -c ip -i 1 using
CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon
E5-2650 v3 CPUs.

Fixes: 605ad7f184b6 (tcp: refine TSO autosizing)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..3a49af8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 * One example is wifi aggregation (802.11 AMPDU)
 */
limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
-   limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
+   limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
if (atomic_read(sk-sk_wmem_alloc)  limit) {
set_bit(TSQ_THROTTLED, tp-tsq_flags);
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: transmit fully GSO-sized packets

2015-03-26 Thread Jonathan Davies


On 26/03/15 12:05, Eric Dumazet wrote:

On Thu, 2015-03-26 at 11:13 +, Jonathan Davies wrote:

xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
 XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
 sk-sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
 (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max TCP 
header)
Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
  drivers/net/xen-netfront.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e9b960f..fb6e978 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1279,8 +1279,6 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
netdev-ethtool_ops = xennet_ethtool_ops;
SET_NETDEV_DEV(netdev, dev-dev);

-   netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER);
-
np-netdev = netdev;

netif_carrier_off(netdev);


Hmm, this partially reverts commit
9ecd1a75d977e2e8c48139c7d3efed183f898d94



Why xennet_change_mtu() is not changed by your patch ?


I think you are right: the mtu calculation suffers from the same 
problem. I believe the value of mtu relates to the size of the whole 
packet, including the header (which is why the value of dev-mtu is 
normally 1500 rather than 1448).


Wei, as the author of commit 9ecd1a75d977 (xen-netfront: reduce 
gso_max_size to account for max TCP header), do you agree that the max 
mtu formula should not need to subtract MAX_TCP_HEADER?


Regards,
Jonathan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel