Hello guys!
I would also like to thank you for sharing with us your design in such a short
period of time!
I believe it is a great idea to have event registration for IOCTL (for
notifications / waits).
I have a few things that I would need some clarifications with:
o) "Transaction based DPIF primitive": does this mean that we do Reads and
Writes here?
o) "State aware DPIF Dump commands"
"Thus, the driver does not have to maintain any dump transaction outstanding
request nor need to allocate any resources for it."
You mean, whenever, say, a Flow dump request is issued, in one reply to give
back all flows?
o) "Event notification / NL multicast subscription"
"An event (such as port addition/deletion link up/down) are propagated from
the kernel to user mode through a subscription of a socket to a
multicast
group (nl_sock_join_mcgroup()) and a synchronous Receive
(nl_sock_recv())
for retrieving the events"
1. I understand you do not speak of events here as API waitable / notification
events, right?
Otherwise, what is the format of the structs that would be read from
nl_sock_recv()?
2. What would the relationship between hyper-v ports and hyper-v nics and dp
ports would be?
I mean, in the sense that the dp port additions and deletions would be requests
coming from the userspace to the kernel (so
no notification needed), while we get OIDs when nics connect & disconnect. In
this sense, I see the hyper-v nic connection and
disconnection as something that could be implemented as API notification events.
o) "C. Implementation work flow"
So our incremental development here would be:
1. Add a new device (alongside the existing one)
2. Implement a netlink protocol (for basic parsing attributes, etc.) for the
new device
3. Implement netlink datapath operations for this device (get and dump only)
4. further & more advanced things are to be dealt with later.
o) "One thing
though is that, nl_sock_transact_multiple() might have to be modified to the
series of nl_sock_send__() and nl_sock_recv__(), rather than doing a bunch
of
sends first and then doing the recvs. This is because Windows may not
preserve
message boundaries when we do the recv."
If I understand what you mean, I think this is an implementation detail.
Basically, for our driver, for unicast messages I know that we can do
sequential reads. We hold an 'offset' in the buffer where
the next read must begin from. However, as I remember, the implementation for
"write" simply overwrites the previous buffer
(of the corresponding socket). I believe it is good to keep one-write then
one-receive instead of doing all writes, then all receives.
However, I think we need to take into account the situation where the userspace
might be providing a smaller buffer than it is the
total to read. Also, I think the "dump" mechanism requires it.
My suggestions & opinions:
o) I think we must do dumping via writes and reads. The main reason is the fact
that we don't know the total size to read
when we request, say, a flow dump.
o) I believe we shouldn't use the netlink overhead (nlmsghdr, genlmsghdr,
attributes) when not needed (say, when registering a
KEVENT notification) , and, if we choose not to use netlink protocol always, we
may need a way to differentiate between netlink and
non-netlink requests.
Sam
________________________________________
From: Eitan Eliahu [[email protected]]
Sent: Wednesday, August 06, 2014 9:15 PM
To: [email protected]; Rajiv Krishnamurthy; Alin Serdean; Ben Pfaff; Kaushik
Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Alin Serdean; Ankur Sharma; Samuel
Ghinet; Linda Sun; Keith Amidon
Subject: Design notes for provisioning Netlink interface from the OVS Windows
driver (Switch extension)
Hello all,
Here is a summary of our initial design. Not all areas are covered so we would
be glad to discuss anything listed here and any other code/features we could
leverage.
Thanks!
Eitan
A. Objectives:
[1] Create a NetLink (NL) driver interface for Windows which interoperates with
the OVS NL user mode.
[2] User mode code should be mostly cross platform with some minimal changes to
support specific Windows OS calls.
[3] The Driver should not have to maintain a state or resources for transaction
or dumps
[4] Reduce the number of system calls: User mode NL code should use Device IOCTL
system call to send an NL commands and to receive the associated NL reply
in the same system call, whenever possible (*).
[5] An event may be associated with a NL socket I/O request to signal a
completion for an outstanding receive operation on the socket.
(For simplicity a single outstanding I/O request could be associated
with
a socket for the signaling purpose)
(*) We assume Multiple NL transactions for the same socket can never be
interleaved
B. Netlink operation types:
There are four types of interactions carried by processes through the NL layer:
[1] Transaction based DPIF primitives: these DPIF commands are mapped to
nl_sock_transact NL interface to nl_sock_transact_multiple. The transaction
based command creates an ad hoc socket and submits a synchronous device
I/O to the driver. The driver constructs the NL reply and copies it to
the
output buffer of the IRP representing the I/O transaction.
(Provisioning of transaction based command can be brought up and exercised
through the ovs-dpctl command in parallel to the exsisting DPIF device)
[2] State aware DPIF Dump commands: port and flow dump calls the following NL
interfaces:
a) nl_dump_start()
b) nl_dump_next()
c) nl_dump_done()
With the exception of nl_dump_start these NL primitives are based on a
synchronous IOCTL system call rather than Write/Read. Thus, the
driver
does not have to maintain any dump transaction outstanding request nor
need to allocate any resources for it.
[3] UpCall Port/PID/Unicast socket:
The driver maintains per socket queue for all packets which have no
matching flow in the flow table. The socket has a single overlapped
(event)
structure which will be signalled through a completion of a pending I/O
request sent by user mode on subscription (similar to the current
implementation). When dpif_recv_wait is called, the event associated
with
the pending I/O request is passed poll_fd_wait_event inorder to wake the
thread which polls the port queue.
dpif_recv calls nl_socket_recv which in turn drains the queue
maintained by the kernel in a synchronous fashion (through the use of
system ioctl call). The overlapped structure is rearmed when the
recv_set
DPIF callback function is called.
[4] Event notification / NL multicast subscription:
An event (such as port addition/deletion link up/down) are propagated from
the kernel to user mode through a subscription of a socket to a
multicast
group (nl_sock_join_mcgroup()) and a synchronous Receive
(nl_sock_recv())
for retrieving the events. The driver maintains a single event queue for
all events. Similar to the UpCall mechanism, a user mode process keeps
an
outstanding I/O request in the driver which is triggered whenever a new
event is generated. The event associated with the overlapped structure
of
the socket is passed to poll_fd_wait_event() whenever
dpif_port_poll_wait()
callback function is called. dpif_poll() will drain the event queue
through
the call of nl_sock_recv().
C. Implementation work flow:
The driver creates a device object which provides a NetLink interface for user
mode processes. During the development phase this device is created in addition
to the existing DPIF device. (This means that the bring-up of the NL based user
mode can be done on a live kernel with resident DPs, ports and flows)
All transaction and dump based DPIF functions could be developed and brought up
when the NL device is a secondary device (ovs-dpctl show and dump XXX should
work). After the initial phase is completed (i.e. all transaction and dump based
DPIF primitives are implemented), the original device interface will be removed
and packet and event propagation path will be brought up (driven by vswicth.exe)
[1] Socket creation
Since PID should be allocated on a system wide basis and unique across all
processes, the kernel
assigns the PID for a newly created socket. A new IOCTL command OVS_GET_PID
returns the PID to a user
mode client to be associated with the socket.
[2] Detailed description
nl_sock_transact_multiple() which calls into a series of nl_sock_send__()
and nl_sock_recv__(). These can be implemented using ReadFile() and
WriteFile()
or an ioctl modeled on a transaction which does both read and write. One
thing
though is that, nl_sock_transact_multiple() might have to be modified to the
series of nl_sock_send__() and nl_sock_recv__(), rather than doing a bunch
of
sends first and then doing the recvs. This is because Windows may not
preserve
message boundaries when we do the recv.
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev