Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-26 Thread Evgeniy Polyakov
Hi Kyle.

On Wed, Jul 25, 2007 at 11:43:38PM -0400, Kyle Moffett ([EMAIL PROTECTED]) 
wrote:
 If you made all your sockets and inter-thread pipes nonblocking then  
 in userspace you would just epoll_wait() on the sockets and pipes and  
 be easily able to react to any IO from anywhere.
 
 In kernel space there are similar nonblocking interfaces, although it  
 would probably be easier just to use a couple threads.

There are no such interfaces in kernel - one must create own state
machine on top of -poll() callback, or use sys_poll()/epoll, but likely 
it is not what one wants for high-performance in-kernel event processing 
engine. Having two threads does not solve the problem - eventually one 
needs to send a header before receiving data. So, the solution would 
either to use always-blocking mode like in NBD, or create own state 
machine using -poll() callbacks.

 Cheers,
 Kyle Moffett

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-25 Thread Kyle Moffett

On Jul 25, 2007, at 22:03:37, [EMAIL PROTECTED] wrote:

On Wed, 25 Jul 2007, Satyam Sharma wrote:

On 7/25/07, Lars Ellenberg [EMAIL PROTECTED] wrote:

On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote:

[...]
But where does the send come into the picture over here -- a  
send won't block forever, so I don't foresee any issues  
whatsoever w.r.t.  kthreads conversion for that. [ BTW I hope  
you're *not* using any signals-based interface for your kernel  
thread _at all_. Kthreads disallow (ignore) all signals by  
default, as they should, and you really shouldn't need to write  
any logic to handle or   do-certain-things-on-seeing a signal  
in a well designed kernel thread. ] and the sending latency is  
crucial to performance, while the recv will not timeout for the  
next few seconds.  Again, I don't see what sending latency has  
to do with a kernel_thread to kthread conversion. Or with  
signals, for that matter. Anyway, as Kyle Moffett mentioned  
elsewhere, you could probably look at other examples (say  
cifs_demultiplexer_thread() in fs/cifs/connect.c).


the basic problem, and what we use signals for, is:  it is  
waiting in recv, waiting for the peer to say something.  but I  
want it to stop recv, and go send something right now.


That's ... weird. Most (all?) communication between any two  
parties would follow a protocol where someone recv's stuff, does  
something with it, and sends it back ... what would you send  
right now if you didn't receive anything?


becouse even though you didn't receive anything you now have  
something important to send.


remember that both sides can be sitting in receive mode. this puts  
them both in a position to respond to the other if the other has  
something to say.


Why not just have 2 threads, one for sending and one for  
receiving.  When your receiving thread gets data it takes  
appropriate locks and processes it, then releases the locks and goes  
back to waiting for packets.  Your sending thread would take  
appropriate locks, generate data to send, release locks, and transmit  
packets.  You don't have to interrupt the receive thread to send  
packets, so where's the latency problem, exactly?


If I were writing that in userspace I would have:

(A) The pool of IO-generating threads (IE: What would ordinarily be  
userspace)

(B) One or a small number of data-reception threads.
(C) One or a small number of data-transmission threads.

When you get packets to process in your network-reception thread(s),  
you queue appropriate disk IOs and any appropriate responses with  
your transmission thread(s).  You can basically just sit in a loop on  
tcp_recvmsg=demultiplex=do-stuff.  When your IO-generators actually  
make stuff to send you queue such data for disk IO, then packetize it  
and hand it off to your data-transmission threads.


If you made all your sockets and inter-thread pipes nonblocking then  
in userspace you would just epoll_wait() on the sockets and pipes and  
be easily able to react to any IO from anywhere.


In kernel space there are similar nonblocking interfaces, although it  
would probably be easier just to use a couple threads.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-23 Thread Kyle Moffett
For the guys on netdev, would you please look at the tcp_recvmsg- 
threading and TCP_NAGLE_CORK issues below and give opinions on the  
best way to proceed?


One thing to remember, you don't necessarily have to merge every  
feature right away.  As long as the new code is configured off by  
default with an (EXPERIMENTAL) warning, you can start getting the  
core parts and the cleanups upstream before you have to resolve all  
the issues with low-latency, dynamic-tracing-frameworks, etc.


On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote:

On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:

+/* I don't remember why XCPU ...
+ * This is used to wake the asender,
+ * and to interrupt sending the sending task
+ * on disconnect.
+ */
+#define DRBD_SIG SIGXCPU


Don't use signals between kernel threads, use proper primitives  
like notifiers and waitqueues, which means you should also  
probably switch away from kernel_thread() to the kthread_*()  
APIs.  Also you should fix this FIXME or remove it if it no longer  
applies:-D.


right.
but how to I tell a network thread in tcp_recvmsg to stop early,  
without using signals?


I'm not really a kernel-networking guy, so I can't answer this  
definitively, but I'm pretty sure the problem has been solved in many  
network filesystems and such, so I've added a netdev CC.  The way I'd  
do it in userspace is with nonblocking IO and epoll(), that way I  
don't actually have to stop or signal the thread, I can just add  
a socket to epoll fd when I want to pay attention to it, and remove  
it from my epoll fd when I'm done with it.  I'd assume there's some  
equivalent way in kernelspace based around the struct kiocb *iocb  
and int nonblock parameters to the tcp_recvmsg() kernel function.



+/* see kernel/printk.c:printk_ratelimit
+ * macro, so it is easy do have independend rate limits at  
different locations

+ * initializer element not constant ... with kernel 2.4 :(
+ * so I initialize toks to something large
+ */
+#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \

Any particular reason you can't just use printk_ratelimit for this?


I want to be able to do a rate-limit per specific message/code  
fragment, without affecting other messages or execution paths.


Ok, so could you change your patch to modify __printk_ratelimit() to  
also accept a struct printk_rate datastructure and make  
printk_ratelimit() call __printk_ratelimit(global_printk_rate);??


Typically if $KERNEL_FEATURE is insufficient for your needs you  
should fix $KERNEL_FEATURE instead of duplicating a replacement in  
your driver.  This applies to basically all of the things I'm talking  
about, kernel-threads, workqueues (BTW: I believe you can make your  
own custom workqueue thread(s) instead of using the default events/ 
* ones), debugging macros, fault-insertion, integer math, lock- 
checking, dynamic tracing, etc.  If you find some reason that some  
generic code won't work for you, please try to fix it first so we can  
all benefit from it.


Umm, how about fixing this to actually use proper workqueues or  
something instead of this open-coded mess?


unlikely to happen right now.  but it is on our todo list...


Unfortunately problems like these need to be fixed before a mainline  
merge.  Merging duplicated code is a big no-no, and historically  
there have been problems with people who merge code and never  
properly maintain it once it's in tree.  As a result the rule is your  
code has to be easily maintainable before anybody will even  
*consider* merging it.



+/* I want the packet to fit within one page
+ * THINK maybe use a special bitmap header,
+ * including offset and compression scheme and whatnot
+ * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
+ */
+#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof 
(long))


Yuck.  Definitely use PAGE_SIZE here, so at least if it's broken  
on an arch with multiple page sizes, somebody can grep for  
PAGE_SIZE to fix it.  It also means that on archs/configs with 8k  
or 64k pages you won't waste a bunch of memory.


No. This is not to allocate anything, but defines the chunk size  
with which we transmit the bitmap, when we have to.  We need to be  
able to talk from one arch (say i586) to some other (say s390, or  
sparc, or whatever).  The receiving side has a one-page buffer,  
from which it may or may not to endian-conversion.  The hardcoded  
4096 is the minimal common denominator here.


Ahhh.  Please replace the constant 4096 with:
/* This is the maximum amount of bitmap we will send per packet */
# define MAX_BITMAP_CHUNK_SIZE 4096
# define BM_PACKET_WORDS \
((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long))

It's more text but dramatically improves the readability by  
eliminating more magic numbers.  This is a much milder case than I've  
seen in the past, so it's not that big of a deal.




+/* Dynamic tracing framework */
guess we