Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Chase Venters
On Tuesday 17 October 2006 00:09, Johann Borck wrote:
 Regarding mukevent I'm thinking of a event-type specific struct, that is
 filled by the originating code, and placed into a per-event-type ring
 buffer (which  requires modification of kevent_wait).

I'd personally worry about an implementation that used a per-event-type ring 
buffer, because you're still left having to hack around starvation issues in 
user-space. It is of course possible under the current model for anyone who 
wants per-event-type ring buffers to have them - just make separate kevent 
sets.

I haven't thought this through all the way yet, but why not have variable 
length event structures and have the kernel fill in a next pointer in each 
one? This could even be used to keep backwards binary compatibility while 
adding additional fields to the structures over time, though no space would 
be wasted on modern programs. You still end up with a question of what to do 
in case of overflow, but I'm thinking the thing to do in that case might be 
to start pushing overflow events onto a linked list which can be written back 
into the ring buffer when space becomes available. The appropriate behavior 
would be to throw new events on the linked list if the linked list had any 
events, so that things are delivered in order, but write to the mapped buffer 
directly otherwise.

Deciding when to do that is tricky, and I haven't thought through the 
implications fully when I say this, but what about activating a bottom half 
when more space becomes available, and let that drain overflowed events back 
into the mapped buffer? Or perhaps the time to do it would be in the next 
blocking wait, when the queue emptied? 

I think it is very important to avoid any limits that can not be adjusted on 
the fly at run-time by CAP_SYS_ADMIN or what have you. Doing it this way may 
have other problems I've ignored but at least the big one - compile-time 
capacity limits in the year 2006 - would be largely avoided :P

Nothing real solid yet, just some electrical storms in the grey matter...

Thanks,
Chase
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 07:10:14AM +0200, Johann Borck ([EMAIL PROTECTED]) 
wrote:
 Ulrich Drepper wrote:
  Evgeniy Polyakov wrote:
  Existing design does not allow overflow.
 
  And I've pointed out a number of times that this is not practical at
  best.  There are event sources which can create events which cannot be
  coalesced into one single event as it would be required with your design.
 
  Signals are one example, specifically realtime signals.  If we do not
  want the design to be limited from the start this approach has to be
  thought over.
 
 
  So zap mmap() support completely, since it is not usable at all. We
  wont discuss on it.
 
  Initial implementation did not have it.
  But I was requested to do it, and it is ready now.
  No one likes it, but no one provides an alternative implementation.
  We are stuck.
 
  We need the mapped ring buffer.  The current design (before it was
  removed) was broken but this does not mean it shouldn't be
  implemented.  We just need more time to figure out how to implement it
  correctly.
 
 Considering the if at all and if then how of ring buffer implemetation
 I'd like to throw in some ideas I had when reading the discussion and
 respective code. If I understood Ulrich Drepper right, his notion of a
 generic event handling interface is, that it has to be flexible enough
 to transport additional info from origin to userspace, and to support
 queuing of events from the same origin, so that additional
 per-event-occurrence data doesn't get lost, which would happen when
 coalescing multiple events into one until delivery. From what I read he
 says ring buffer is broken because of  insufficient space for additional
 data (mukevent) and the limited number of events that can be put into
 ring buffer. Another argument is missing notification of userspace about
 dropped events in case ring buffer limit is reached. (is that right?)

I can add such notification, but its existense _is_ the broken design.
After such condition happend, all new events will dissapear (although
they are still accessible through usual queue) from mapped buffer.

While writing this I have come to the idea on how to imrove the case of
the size of mapped buffer - we can make it with limited size, and when
it is full, some bit will be set in the shared area and obviously no new
events can be added there, but when user commits some events from that
buffer (i.e. says to kernel that appropriate kevents can be freed or
requeued according to theirs flags), new ready events from ready queue
can be copied into mapped buffer.

It still does not solve (and I do insist that it is broken behaviour)
the case when kernel is going to generate infinite number of events for
one requested by userspace (as in case of generating new 'data_has_arrived'
event when new byte has been received).

Userspace events are only marked as ready, they are not generated - it
is high-performance _feature_ of the new design, not some kind of a bug.

 I see no reason why kevent couldn't be modified to fit (all) these
 needs. While modifying the server-example and writing a client using
 kevent I came across the coalescing problem, there were more incoming
 connections than accept events, and I had to work around that. In this

Btw, accept() issue is exactly the same as with usual poll() - repeated
insertion of the same kevent will fire immediately, which requires event
to be one-shot. One of the initial implementation contained number of
ready for accept sockets as one of the returned parameters though.

 case the pure number of coalesced events would suffice, while it
 wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
 coalescing can be done at all or if it is impossible depends on the type
 of event. The same goes for additional data delivered with the events.
 There might be no panacea for all possible scenarios with one fixed
 design. Either performance suffers for 'lightweight' events  which don't
 need additional data and/or coalescing is not problematic and/or ring
 buffer, or kevent is not usable for other types of events. Why not treat
 different things differently, and let the (kernel-)user decide.
 I don't know if I got all this right, but if, then ring buffer is needed
 especially for cases where coalescing is not possible and additional
 data has to be delivered for each triggered notification (so the pure
 number of events is not enough; other reasons? performance? ). To me it
 doesn't make sense to have kevent fill memory and use processor-time if
 buffer is not used at all, which is the case when using kevent_getevents.
 So here are my Ideas:
 Make usage of ring buffer optional, if not required for specific
 event-type it might be chosen by userspace-code.
 Make limit of events in ring buffer optional and controllable from
 userspace.

It is of course possible, main problem is that existing design of the
mapped buffer is not sufficient, and there are no other propositions
except that 'it 

Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 17 October 2006 00:09, Johann Borck wrote:
  Regarding mukevent I'm thinking of a event-type specific struct, that is
  filled by the originating code, and placed into a per-event-type ring
  buffer (which  requires modification of kevent_wait).
 
 I'd personally worry about an implementation that used a per-event-type ring 
 buffer, because you're still left having to hack around starvation issues in 
 user-space. It is of course possible under the current model for anyone who 
 wants per-event-type ring buffers to have them - just make separate kevent 
 sets.
 
 I haven't thought this through all the way yet, but why not have variable 
 length event structures and have the kernel fill in a next pointer in each 
 one? This could even be used to keep backwards binary compatibility while 

Why do we want variable size structures in mmap ring buffer?

 adding additional fields to the structures over time, though no space would 
 be wasted on modern programs. You still end up with a question of what to do 
 in case of overflow, but I'm thinking the thing to do in that case might be 
 to start pushing overflow events onto a linked list which can be written back 
 into the ring buffer when space becomes available. The appropriate behavior 
 would be to throw new events on the linked list if the linked list had any 
 events, so that things are delivered in order, but write to the mapped buffer 
 directly otherwise.

I think in a similar way.
Kevent actually do not require such list, since it has already queue of
the ready events.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Chase Venters
On Tuesday 17 October 2006 05:42, Evgeniy Polyakov wrote:
 On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters 
([EMAIL PROTECTED]) wrote:
  On Tuesday 17 October 2006 00:09, Johann Borck wrote:
   Regarding mukevent I'm thinking of a event-type specific struct, that
   is filled by the originating code, and placed into a per-event-type
   ring buffer (which  requires modification of kevent_wait).
 
  I'd personally worry about an implementation that used a per-event-type
  ring buffer, because you're still left having to hack around starvation
  issues in user-space. It is of course possible under the current model
  for anyone who wants per-event-type ring buffers to have them - just make
  separate kevent sets.
 
  I haven't thought this through all the way yet, but why not have variable
  length event structures and have the kernel fill in a next pointer in
  each one? This could even be used to keep backwards binary compatibility
  while

 Why do we want variable size structures in mmap ring buffer?

Flexibility primarily. So when we all decide to add a new event type six 
months from now, or add more information to an existing one, we don't run the 
risk that the existing mukevent isn't big enough.

  adding additional fields to the structures over time, though no space
  would be wasted on modern programs. You still end up with a question of
  what to do in case of overflow, but I'm thinking the thing to do in that
  case might be to start pushing overflow events onto a linked list which
  can be written back into the ring buffer when space becomes available.
  The appropriate behavior would be to throw new events on the linked list
  if the linked list had any events, so that things are delivered in order,
  but write to the mapped buffer directly otherwise.

 I think in a similar way.
 Kevent actually do not require such list, since it has already queue of
 the ready events.

The current event types coalesce if there are multiple events, correct? It 
sounds like there may be other event types where coalescing multiple events 
is not the correct approach.

Thanks,
Chase
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:

 I can add such notification, but its existense _is_ the broken design.
 After such condition happend, all new events will dissapear (although
 they are still accessible through usual queue) from mapped buffer.

 While writing this I have come to the idea on how to imrove the case of
 the size of mapped buffer - we can make it with limited size, and when
 it is full, some bit will be set in the shared area and obviously no new
 events can be added there, but when user commits some events from that
 buffer (i.e. says to kernel that appropriate kevents can be freed or
 requeued according to theirs flags), new ready events from ready queue
 can be copied into mapped buffer.

 It still does not solve (and I do insist that it is broken behaviour)
 the case when kernel is going to generate infinite number of events for
 one requested by userspace (as in case of generating new 'data_has_arrived'
 event when new byte has been received).

Behavior is not broken. It's quite usefull and works 99.% of time.

I was trying to suggest you but you missed my point.

You dont want to use a bit, but a full sequence counter, 32bits.

A program may handle XXX.XXX handles, but use a 4096 entries ring 
buffer 'only'.

The user program keeps a local copy of a special word 
named 'ring_buffer_full_counter'

Each time the kernel cannot queue an event in the ring buffer, it increase 
the ring_buffer_was_full_counter (exported to user app in the mmap view)

When the user application notice the kernel 
changed ring_buffer_was_full_counter it does a full scan of all file 
handles (preferably using poll() to get all relevant info in one syscall) :

do {
   if (read_event_from_mmap()) {handle_event(fd); continue;}
   /* ring buffer is empty, check if we missed some events */
   if (unlikely(mmap-ring_buffer_full_counter !=  
my_ring_buffer_full_counter)) {
my_ring_buffer_full_counter = mmap-ring_buffer_full_counter;
/* slow PATH */
/* can use a big poll() for example, or just a loop without poll() */
for_all_file_desc_do() {
check if some event/data is waiting on THIS fd
}
/* 
}
else  syscall_wait_for_one_available_kevent(queue)
}

This is how a program can recover. If ring buffer has a reasonable size, this 
kind of event should not happen very frequently. If it does (because events 
continue to fill ring_buffer during recovery and might hit FULL again), maybe 
a smart program is able to resize the ring_buffer, and start using it after 
yet another recovery pass.
If not, we dont care, because a big poll() give us many ready file-descriptors 
in one syscall, and maybe this is much better than kevent/epoll when XX.XXX 
events are ready.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 08:12:04AM -0500, Chase Venters ([EMAIL PROTECTED]) 
wrote:
Regarding mukevent I'm thinking of a event-type specific struct, that
is filled by the originating code, and placed into a per-event-type
ring buffer (which  requires modification of kevent_wait).
  
   I'd personally worry about an implementation that used a per-event-type
   ring buffer, because you're still left having to hack around starvation
   issues in user-space. It is of course possible under the current model
   for anyone who wants per-event-type ring buffers to have them - just make
   separate kevent sets.
  
   I haven't thought this through all the way yet, but why not have variable
   length event structures and have the kernel fill in a next pointer in
   each one? This could even be used to keep backwards binary compatibility
   while
 
  Why do we want variable size structures in mmap ring buffer?
 
 Flexibility primarily. So when we all decide to add a new event type six 
 months from now, or add more information to an existing one, we don't run the 
 risk that the existing mukevent isn't big enough.

Do we need such flexibility, when we have unique id attached to each
event? User can store any information in own buffers, which are indexed
by that id.

   adding additional fields to the structures over time, though no space
   would be wasted on modern programs. You still end up with a question of
   what to do in case of overflow, but I'm thinking the thing to do in that
   case might be to start pushing overflow events onto a linked list which
   can be written back into the ring buffer when space becomes available.
   The appropriate behavior would be to throw new events on the linked list
   if the linked list had any events, so that things are delivered in order,
   but write to the mapped buffer directly otherwise.
 
  I think in a similar way.
  Kevent actually do not require such list, since it has already queue of
  the ready events.
 
 The current event types coalesce if there are multiple events, correct? It 
 sounds like there may be other event types where coalescing multiple events 
 is not the correct approach.

There is no events coalescing, I think that it is even incorrect to say, that
something is being coalesced in kevents.

There is 'new' (which is well forgotten old) approach - user _asks_ kernel 
about some information, and kernel says when it is ready. Kernel does not 
say: part of the info is ready, part of the info is ready and so on, it 
just marks user's request as ready - that means that it is possible that
there were zillions of events, each one could mark the _same_ userspace
request as ready, and exactly what user requested is transferred back. 
Thus it is very fast and is correct way to deal with problem of pipes of 
different diameters.

Kernel does not generate events - only user creates requests, which are
marked as ready.

I made that decision to remove _any_ kind of possible overflows from
kernel side - if user was scheduled away, or has unsufficient space or
bad mood, to not introduce any kind of ugly priorities (higher one 
could fill the whole pipe while lower could not even send a single event). 
Instead kernel does just what it was requested to do, and it can provide 
some hints on how that process happend (for example how many sockets are 
ready for accept(), or how many bytes are in the receiving queue).

And that approach does solve the problem of the cases when it looks like
it is logical to _generate_ event - for example in inotify case, where
new event is _generated_ each time requested case happens. For example
the case when new files are created in the directory - it is possible
that there will be queue overflow (btw, watch for each file in the kernel 
tree takes about 2gb of kernel mem), if many files were created, so
userspace must rescan the whole directory to check missed files, so why
is it needed at all to generate info about first two or ten files,
instead userspace asks kernel to notify it when directory has changed or
some new files were created, and kernelspace will answer when directory
has been changed or new files were created (with some hint with number
of them).

Likely request for generation of events in kernel is a workaround for 
some other problems, which in long term will hit us with new troubles -
queue length and overflows.

 Thanks,
 Chase

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:
 
  I can add such notification, but its existense _is_ the broken design.
  After such condition happend, all new events will dissapear (although
  they are still accessible through usual queue) from mapped buffer.
 
  While writing this I have come to the idea on how to imrove the case of
  the size of mapped buffer - we can make it with limited size, and when
  it is full, some bit will be set in the shared area and obviously no new
  events can be added there, but when user commits some events from that
  buffer (i.e. says to kernel that appropriate kevents can be freed or
  requeued according to theirs flags), new ready events from ready queue
  can be copied into mapped buffer.
 
  It still does not solve (and I do insist that it is broken behaviour)
  the case when kernel is going to generate infinite number of events for
  one requested by userspace (as in case of generating new 'data_has_arrived'
  event when new byte has been received).
 
 Behavior is not broken. It's quite usefull and works 99.% of time.

 I was trying to suggest you but you missed my point.
 
 You dont want to use a bit, but a full sequence counter, 32bits.
 
 A program may handle XXX.XXX handles, but use a 4096 entries ring 
 buffer 'only'.
 
 The user program keeps a local copy of a special word 
 named 'ring_buffer_full_counter'
 
 Each time the kernel cannot queue an event in the ring buffer, it increase 
 the ring_buffer_was_full_counter (exported to user app in the mmap view)
 
 When the user application notice the kernel 
 changed ring_buffer_was_full_counter it does a full scan of all file 
 handles (preferably using poll() to get all relevant info in one syscall) :

I.e. to scan the rest of the xxx.xxx events?

 do {
if (read_event_from_mmap()) {handle_event(fd); continue;}
/* ring buffer is empty, check if we missed some events */
if (unlikely(mmap-ring_buffer_full_counter !=  
 my_ring_buffer_full_counter)) {
   my_ring_buffer_full_counter = mmap-ring_buffer_full_counter;
   /* slow PATH */
   /* can use a big poll() for example, or just a loop without poll() */
   for_all_file_desc_do() {
   check if some event/data is waiting on THIS fd
   }
   /* 
   }
 else  syscall_wait_for_one_available_kevent(queue)
 }
 
 This is how a program can recover. If ring buffer has a reasonable size, this 
 kind of event should not happen very frequently. If it does (because events 
 continue to fill ring_buffer during recovery and might hit FULL again), maybe 
 a smart program is able to resize the ring_buffer, and start using it after 
 yet another recovery pass.
 If not, we dont care, because a big poll() give us many ready 
 file-descriptors 
 in one syscall, and maybe this is much better than kevent/epoll when XX.XXX 
 events are ready.

What about the case, which I described in other e-mail, when in case of
the full ring buffer, no new events are written there, and when
userspace commits (i.e. marks as ready to be freed or requeued by kernel) 
some events, new ones will be copied from ready queue into the buffer?

 Eric

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 15:42, Evgeniy Polyakov wrote:
 On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:
   I can add such notification, but its existense _is_ the broken design.
   After such condition happend, all new events will dissapear (although
   they are still accessible through usual queue) from mapped buffer.
  
   While writing this I have come to the idea on how to imrove the case of
   the size of mapped buffer - we can make it with limited size, and when
   it is full, some bit will be set in the shared area and obviously no
   new events can be added there, but when user commits some events from
   that buffer (i.e. says to kernel that appropriate kevents can be freed
   or requeued according to theirs flags), new ready events from ready
   queue can be copied into mapped buffer.
  
   It still does not solve (and I do insist that it is broken behaviour)
   the case when kernel is going to generate infinite number of events for
   one requested by userspace (as in case of generating new
   'data_has_arrived' event when new byte has been received).
 
  Behavior is not broken. It's quite usefull and works 99.% of time.
 
  I was trying to suggest you but you missed my point.
 
  You dont want to use a bit, but a full sequence counter, 32bits.
 
  A program may handle XXX.XXX handles, but use a 4096 entries ring
  buffer 'only'.
 
  The user program keeps a local copy of a special word
  named 'ring_buffer_full_counter'
 
  Each time the kernel cannot queue an event in the ring buffer, it
  increase the ring_buffer_was_full_counter (exported to user app in the
  mmap view)
 
  When the user application notice the kernel
  changed ring_buffer_was_full_counter it does a full scan of all file
  handles (preferably using poll() to get all relevant info in one syscall)
  :

 I.e. to scan the rest of the xxx.xxx events?

  do {
 if (read_event_from_mmap()) {handle_event(fd); continue;}
 /* ring buffer is empty, check if we missed some events */
 if (unlikely(mmap-ring_buffer_full_counter !=
  my_ring_buffer_full_counter)) {
  my_ring_buffer_full_counter = mmap-ring_buffer_full_counter;
  /* slow PATH */
  /* can use a big poll() for example, or just a loop without poll() */
  for_all_file_desc_do() {
  check if some event/data is waiting on THIS fd
  }
  /*
  }
  else  syscall_wait_for_one_available_kevent(queue)
  }
 
  This is how a program can recover. If ring buffer has a reasonable size,
  this kind of event should not happen very frequently. If it does (because
  events continue to fill ring_buffer during recovery and might hit FULL
  again), maybe a smart program is able to resize the ring_buffer, and
  start using it after yet another recovery pass.
  If not, we dont care, because a big poll() give us many ready
  file-descriptors in one syscall, and maybe this is much better than
  kevent/epoll when XX.XXX events are ready.

 What about the case, which I described in other e-mail, when in case of
 the full ring buffer, no new events are written there, and when
 userspace commits (i.e. marks as ready to be freed or requeued by kernel)
 some events, new ones will be copied from ready queue into the buffer?

Then, user might receive 'false events', exactly like poll()/select()/epoll() 
can do sometime. IE a 'ready' indication while there is no current event 
available on a particular fd / event_source.

This should be safe, since those programs already ignore read() 
returns -EAGAIN and other similar things.

Programmer prefers to receive two 'event available' indications than ZERO (and 
be stuck for infinite time). Of course, hot path (normal cases) should return 
one 'event' only.

In order words, being ultra fast 99.99 % of the time, but being able to block 
forever once in a while is not an option.
 
Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  What about the case, which I described in other e-mail, when in case of
  the full ring buffer, no new events are written there, and when
  userspace commits (i.e. marks as ready to be freed or requeued by kernel)
  some events, new ones will be copied from ready queue into the buffer?
 
 Then, user might receive 'false events', exactly like poll()/select()/epoll() 
 can do sometime. IE a 'ready' indication while there is no current event 
 available on a particular fd / event_source.

Only if user simultaneously uses oth interfaces and remove even from the
queue when it's copy was in mapped buffer, but in that case it's user's
problem (and if we do want, we can store pointer/index of the ring
buffer entry, so when event is removed from the ready queue (using 
kevent_get_events()), appropriate entry in the ring buffer will be
updated to show that it is no longer valid.

 This should be safe, since those programs already ignore read() 
 returns -EAGAIN and other similar things.
 
 Programmer prefers to receive two 'event available' indications than ZERO 
 (and 
 be stuck for infinite time). Of course, hot path (normal cases) should return 
 one 'event' only.
 
 In order words, being ultra fast 99.99 % of the time, but being able to block 
 forever once in a while is not an option.

Have I missed something? It looks like the only problematic situation is
described above when user simultaneously uses both interfaces.

 Eric

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
 On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
   What about the case, which I described in other e-mail, when in case of
   the full ring buffer, no new events are written there, and when
   userspace commits (i.e. marks as ready to be freed or requeued by
   kernel) some events, new ones will be copied from ready queue into the
   buffer?
 
  Then, user might receive 'false events', exactly like
  poll()/select()/epoll() can do sometime. IE a 'ready' indication while
  there is no current event available on a particular fd / event_source.

 Only if user simultaneously uses oth interfaces and remove even from the
 queue when it's copy was in mapped buffer, but in that case it's user's
 problem (and if we do want, we can store pointer/index of the ring
 buffer entry, so when event is removed from the ready queue (using
 kevent_get_events()), appropriate entry in the ring buffer will be
 updated to show that it is no longer valid.

  This should be safe, since those programs already ignore read()
  returns -EAGAIN and other similar things.
 
  Programmer prefers to receive two 'event available' indications than ZERO
  (and be stuck for infinite time). Of course, hot path (normal cases)
  should return one 'event' only.
 
  In order words, being ultra fast 99.99 % of the time, but being able to
  block forever once in a while is not an option.

 Have I missed something? It looks like the only problematic situation is
 described above when user simultaneously uses both interfaces.

In my point of view, user of the 'mmaped ring buffer' should be prepared to 
use both interfaces. Or else you are forced to presize the ring buffer to 
insane limits.

That is :
- Most of the time, we expect consuming events via mmaped ring buffer and no 
syscalls.
- In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume 
events that could not be stored in mmaped buffer (but queued by kevent 
subsystem). If not stored by kevent subsystem (memory failure ?), revert to 
poll() to fetch all 'missed fds' in one row. Go back to normal mode.

- In case of empty ring buffer (or no mmap support at all, because this app 
doesnt expect lot of events per time unit, or because kevent dont have mmap 
support) : Be able to syscall and wait for an event.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
  On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
 wrote:
What about the case, which I described in other e-mail, when in case of
the full ring buffer, no new events are written there, and when
userspace commits (i.e. marks as ready to be freed or requeued by
kernel) some events, new ones will be copied from ready queue into the
buffer?
  
   Then, user might receive 'false events', exactly like
   poll()/select()/epoll() can do sometime. IE a 'ready' indication while
   there is no current event available on a particular fd / event_source.
 
  Only if user simultaneously uses oth interfaces and remove even from the
  queue when it's copy was in mapped buffer, but in that case it's user's
  problem (and if we do want, we can store pointer/index of the ring
  buffer entry, so when event is removed from the ready queue (using
  kevent_get_events()), appropriate entry in the ring buffer will be
  updated to show that it is no longer valid.
 
   This should be safe, since those programs already ignore read()
   returns -EAGAIN and other similar things.
  
   Programmer prefers to receive two 'event available' indications than ZERO
   (and be stuck for infinite time). Of course, hot path (normal cases)
   should return one 'event' only.
  
   In order words, being ultra fast 99.99 % of the time, but being able to
   block forever once in a while is not an option.
 
  Have I missed something? It looks like the only problematic situation is
  described above when user simultaneously uses both interfaces.
 
 In my point of view, user of the 'mmaped ring buffer' should be prepared to 
 use both interfaces. Or else you are forced to presize the ring buffer to 
 insane limits.
 
 That is :
 - Most of the time, we expect consuming events via mmaped ring buffer and no 
 syscalls.
 - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume 
 events that could not be stored in mmaped buffer (but queued by kevent 
 subsystem). If not stored by kevent subsystem (memory failure ?), revert to 
 poll() to fetch all 'missed fds' in one row. Go back to normal mode.

kevent uses smaller amount of memory than epoll() per event, so it is very
unlikely that it will be impossible to store new event there and epoll()
will succeed. The same can be applied to poll(), which allocates the
whole table in syscall.

 - In case of empty ring buffer (or no mmap support at all, because this app 
 doesnt expect lot of events per time unit, or because kevent dont have mmap 
 support) : Be able to syscall and wait for an event.

So the most complex case is when user is going to use both interfaces,
and it's steps when mapped ring buffer has overflow.
In that case user can either read and mark some events as ready in ring
buffer (the latter is being done through special syscall), so kevent
core will put there new ready events.
User can also get events using usual syscall, in that case events in
ring buffer must be updated - and actually I implemented mapped buffer
in the way which allows to remove events from the queue - queue is a
FIFO, and the first entry to be obtained through syscall is _always_ the
first entry in the ring buffer.

So when user reads event through syscall (no matter if we are in overflow
case or not), even being read is easily accessible in the ring buffer.

So I propose following design for ring buffer (quite simple):
kernelspace maintains two indexes - to the first and the last events in
the ring buffer (and maximum size of the buffer of course).
When new event is marked as ready, some info is being copied into ring
buffer and index of the last entry is increased.
When event is being read through syscall it is _guaranteed_ that that 
event will be at the position pointed by the index of the first
element, that index is then increased (thus opening new slot in the
buffer).
If index of the last entry reaches (with possible wrapping) index of the
first entry, that means that overflow has happend. In this case no new
events can be copied into ring buffer, so they are only placed into
ready queue (accessible through syscall kevent_get_events()).

When user calls kevent_get_events() it will obtain the first element
(pointed by index of the first element in the ring buffer), and if there
is ready event, which is not placed into the ring buffer, it is
copied (with appropriate update of the last index and new overflow
condition).

When userspace calls kevent_wait(num), it means that userspace marks as
ready first (from index of the first element) $num elements, which thus
can be removed (or requeued) and replaced by pending ready events.

Does it sound like clawing over the glass or much better?

 Eric
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More 

Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 17:09, Evgeniy Polyakov wrote:
 On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
   On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet
   ([EMAIL PROTECTED])
 
  wrote:
 What about the case, which I described in other e-mail, when in
 case of the full ring buffer, no new events are written there, and
 when userspace commits (i.e. marks as ready to be freed or requeued
 by kernel) some events, new ones will be copied from ready queue
 into the buffer?
   
Then, user might receive 'false events', exactly like
poll()/select()/epoll() can do sometime. IE a 'ready' indication
while there is no current event available on a particular fd /
event_source.
  
   Only if user simultaneously uses oth interfaces and remove even from
   the queue when it's copy was in mapped buffer, but in that case it's
   user's problem (and if we do want, we can store pointer/index of the
   ring buffer entry, so when event is removed from the ready queue (using
   kevent_get_events()), appropriate entry in the ring buffer will be
   updated to show that it is no longer valid.
  
This should be safe, since those programs already ignore read()
returns -EAGAIN and other similar things.
   
Programmer prefers to receive two 'event available' indications than
ZERO (and be stuck for infinite time). Of course, hot path (normal
cases) should return one 'event' only.
   
In order words, being ultra fast 99.99 % of the time, but being able
to block forever once in a while is not an option.
  
   Have I missed something? It looks like the only problematic situation
   is described above when user simultaneously uses both interfaces.
 
  In my point of view, user of the 'mmaped ring buffer' should be prepared
  to use both interfaces. Or else you are forced to presize the ring buffer
  to insane limits.
 
  That is :
  - Most of the time, we expect consuming events via mmaped ring buffer and
  no syscalls.
  - In case we notice a 'mmaped ring buffer overflow', syscalls to
  get/consume events that could not be stored in mmaped buffer (but queued
  by kevent subsystem). If not stored by kevent subsystem (memory failure
  ?), revert to poll() to fetch all 'missed fds' in one row. Go back to
  normal mode.

 kevent uses smaller amount of memory than epoll() per event, so it is very
 unlikely that it will be impossible to store new event there and epoll()
 will succeed. The same can be applied to poll(), which allocates the
 whole table in syscall.

  - In case of empty ring buffer (or no mmap support at all, because this
  app doesnt expect lot of events per time unit, or because kevent dont
  have mmap support) : Be able to syscall and wait for an event.

 So the most complex case is when user is going to use both interfaces,
 and it's steps when mapped ring buffer has overflow.
 In that case user can either read and mark some events as ready in ring
 buffer (the latter is being done through special syscall), so kevent
 core will put there new ready events.
 User can also get events using usual syscall, in that case events in
 ring buffer must be updated - and actually I implemented mapped buffer
 in the way which allows to remove events from the queue - queue is a
 FIFO, and the first entry to be obtained through syscall is _always_ the
 first entry in the ring buffer.

 So when user reads event through syscall (no matter if we are in overflow
 case or not), even being read is easily accessible in the ring buffer.

 So I propose following design for ring buffer (quite simple):
 kernelspace maintains two indexes - to the first and the last events in
 the ring buffer (and maximum size of the buffer of course).
 When new event is marked as ready, some info is being copied into ring
 buffer and index of the last entry is increased.
 When event is being read through syscall it is _guaranteed_ that that
 event will be at the position pointed by the index of the first
 element, that index is then increased (thus opening new slot in the
 buffer).
 If index of the last entry reaches (with possible wrapping) index of the
 first entry, that means that overflow has happend. In this case no new
 events can be copied into ring buffer, so they are only placed into
 ready queue (accessible through syscall kevent_get_events()).

 When user calls kevent_get_events() it will obtain the first element
 (pointed by index of the first element in the ring buffer), and if there
 is ready event, which is not placed into the ring buffer, it is
 copied (with appropriate update of the last index and new overflow
 condition).

Well, I'm not sure its good to do this 'move one event from ready list to slot 
X', one by one, because this event will likely be flushed out of processor 
cache (because we will have to consume 4096 events before reaching this one). 
I think its better to batch 

Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Hans Henrik Happe
On Tuesday 17 October 2006 16:25, Eric Dumazet wrote:
 On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
  On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet 
([EMAIL PROTECTED]) 
 wrote:
What about the case, which I described in other e-mail, when in case 
of
the full ring buffer, no new events are written there, and when
userspace commits (i.e. marks as ready to be freed or requeued by
kernel) some events, new ones will be copied from ready queue into the
buffer?
  
   Then, user might receive 'false events', exactly like
   poll()/select()/epoll() can do sometime. IE a 'ready' indication while
   there is no current event available on a particular fd / event_source.
 
  Only if user simultaneously uses oth interfaces and remove even from the
  queue when it's copy was in mapped buffer, but in that case it's user's
  problem (and if we do want, we can store pointer/index of the ring
  buffer entry, so when event is removed from the ready queue (using
  kevent_get_events()), appropriate entry in the ring buffer will be
  updated to show that it is no longer valid.
 
   This should be safe, since those programs already ignore read()
   returns -EAGAIN and other similar things.
  
   Programmer prefers to receive two 'event available' indications than 
ZERO
   (and be stuck for infinite time). Of course, hot path (normal cases)
   should return one 'event' only.
  
   In order words, being ultra fast 99.99 % of the time, but being able to
   block forever once in a while is not an option.
 
  Have I missed something? It looks like the only problematic situation is
  described above when user simultaneously uses both interfaces.
 
 In my point of view, user of the 'mmaped ring buffer' should be prepared to 
 use both interfaces. Or else you are forced to presize the ring buffer to 
 insane limits.

I don't see why overflow couldn't be handle by a syscall telling the kernel 
that the buffer is ready for new events. As mentioned most of the time 
overflow should not happend and if it does the syscall should be amortized 
nicely by the number of events.

 That is :
 - Most of the time, we expect consuming events via mmaped ring buffer and no 
 syscalls.
 - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume 
 events that could not be stored in mmaped buffer (but queued by kevent 
 subsystem). If not stored by kevent subsystem (memory failure ?), revert to 
 poll() to fetch all 'missed fds' in one row. Go back to normal mode.
 
 - In case of empty ring buffer (or no mmap support at all, because this app 
 doesnt expect lot of events per time unit, or because kevent dont have mmap 
 support) : Be able to syscall and wait for an event.

As I see it there are two main problems with a mmapped ring buffer (correct me 
if I'm wrong):

1. Overflow.
2. Handle multiple kernel event that only needs one  user event. I.e. multiple 
packet arriving at the same socket. The user should only see one IN event at 
the time he is ready to handle it.

In an earlier post I suggested a scheme that solves these issues. It was based 
on the assumption that kernel and user-space share index variables and can 
read/update them atomically without much overhead. Only in cases where the 
buffer is empty and full system call would be required.

Hans Henrik Happe
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 05:32:28PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  So the most complex case is when user is going to use both interfaces,
  and it's steps when mapped ring buffer has overflow.
  In that case user can either read and mark some events as ready in ring
  buffer (the latter is being done through special syscall), so kevent
  core will put there new ready events.
  User can also get events using usual syscall, in that case events in
  ring buffer must be updated - and actually I implemented mapped buffer
  in the way which allows to remove events from the queue - queue is a
  FIFO, and the first entry to be obtained through syscall is _always_ the
  first entry in the ring buffer.
 
  So when user reads event through syscall (no matter if we are in overflow
  case or not), even being read is easily accessible in the ring buffer.
 
  So I propose following design for ring buffer (quite simple):
  kernelspace maintains two indexes - to the first and the last events in
  the ring buffer (and maximum size of the buffer of course).
  When new event is marked as ready, some info is being copied into ring
  buffer and index of the last entry is increased.
  When event is being read through syscall it is _guaranteed_ that that
  event will be at the position pointed by the index of the first
  element, that index is then increased (thus opening new slot in the
  buffer).
  If index of the last entry reaches (with possible wrapping) index of the
  first entry, that means that overflow has happend. In this case no new
  events can be copied into ring buffer, so they are only placed into
  ready queue (accessible through syscall kevent_get_events()).
 
  When user calls kevent_get_events() it will obtain the first element
  (pointed by index of the first element in the ring buffer), and if there
  is ready event, which is not placed into the ring buffer, it is
  copied (with appropriate update of the last index and new overflow
  condition).
 
 Well, I'm not sure its good to do this 'move one event from ready list to 
 slot 
 X', one by one, because this event will likely be flushed out of processor 
 cache (because we will have to consume 4096 events before reaching this one). 
 I think its better to batch this kind of 'push XX events' later, XX being 
 small enough not to waste CPU cache, and when ring buffer is empty again.

Ok, that's possible.

 mmap buffer is good for latency and minimum synchro between user thread and 
 kernel producer. But once we hit an 'overflow', it is better to revert to a 
 mode feeding XX events per syscall, to be sure it fits CPU caches : The user 
 thread will do the copy between kernel memory to user memory, and this thread 
 will shortly use those events in user land.

User can do both - either get events through syscall, or get them from
mapped ring buffer when it is refilled.

 BTW, maintaining coherency on mmap buffer is expensive : once a event is 
 copied to mmap buffer, kernel has to issue a smp_mb() before updating the 
 index, so that a user thread wont start to consume an event with random 
 values because its CPU see the update on index before updates on data.

There will be some tricks with barriers indeed.

 Once all the queue is flushed in efficient way, we can switch to mmap mode 
 again.
 
 Eric

Ok, there is one apologist for mmap buffer implementation, who forced me
to create first implementation, which was dropped due to absense of
remote mental reading abilities. 
Ulrich, does above approach sound good for you? 
I actually do not want to reimplement something, that will be
pointed to with words 'no matter what you say, it is broken and I do not 
want it' again :).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:

 Ok, there is one apologist for mmap buffer implementation, who forced me
 to create first implementation, which was dropped due to absense of
 remote mental reading abilities.
 Ulrich, does above approach sound good for you?
 I actually do not want to reimplement something, that will be
 pointed to with words 'no matter what you say, it is broken and I do not
 want it' again :).

In my humble opinion, you should first write a 'real application', to show how 
the mmap buffer and kevent syscalls would be used (fast path and 
slow/recovery paths). I am sure it would be easier for everybody to agree on 
the API *before* you start coding a *lot* of hard (kernel) stuff : It would 
certainly save your mental CPU cycles (and ours too :) )

This 'real application' could be  the event loop of a simple HTTP server, or a 
basic 'echo all' server. Adding the bits about timers events and signals 
should be done too.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
 
  Ok, there is one apologist for mmap buffer implementation, who forced me
  to create first implementation, which was dropped due to absense of
  remote mental reading abilities.
  Ulrich, does above approach sound good for you?
  I actually do not want to reimplement something, that will be
  pointed to with words 'no matter what you say, it is broken and I do not
  want it' again :).
 
 In my humble opinion, you should first write a 'real application', to show 
 how 
 the mmap buffer and kevent syscalls would be used (fast path and 
 slow/recovery paths). I am sure it would be easier for everybody to agree on 
 the API *before* you start coding a *lot* of hard (kernel) stuff : It would 
 certainly save your mental CPU cycles (and ours too :) )

 This 'real application' could be  the event loop of a simple HTTP server, or 
 a 
 basic 'echo all' server. Adding the bits about timers events and signals 
 should be done too.

I wrote one with previous ring buffer implementation - it used timers
and echoed when they fired, it was even described in details in one of the 
lwn.net articles.

I'm not going to waste others and my time implementing feature requests
without at least _some_ feedback from those who asked them.
In case when person, originally requested some feature, does not answer
and there are other opinions, only they will be get into account of
course.

 Eric

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote:
 On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
   Ok, there is one apologist for mmap buffer implementation, who forced
   me to create first implementation, which was dropped due to absense of
   remote mental reading abilities.
   Ulrich, does above approach sound good for you?
   I actually do not want to reimplement something, that will be
   pointed to with words 'no matter what you say, it is broken and I do
   not want it' again :).
 
  In my humble opinion, you should first write a 'real application', to
  show how the mmap buffer and kevent syscalls would be used (fast path and
  slow/recovery paths). I am sure it would be easier for everybody to agree
  on the API *before* you start coding a *lot* of hard (kernel) stuff : It
  would certainly save your mental CPU cycles (and ours too :) )
 
  This 'real application' could be  the event loop of a simple HTTP server,
  or a basic 'echo all' server. Adding the bits about timers events and
  signals should be done too.

 I wrote one with previous ring buffer implementation - it used timers
 and echoed when they fired, it was even described in details in one of the
 lwn.net articles.

 I'm not going to waste others and my time implementing feature requests
 without at least _some_ feedback from those who asked them.
 In case when person, originally requested some feature, does not answer
 and there are other opinions, only they will be get into account of
 course.

I am not sure I understand what you wrote, English is not our native language.

I think many people gave you feedbacks. I feel that all feedback on this 
mailing list is constructive. Many posts/patches on this list are never 
commented at all.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote:
  On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
 wrote:
   On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
Ok, there is one apologist for mmap buffer implementation, who forced
me to create first implementation, which was dropped due to absense of
remote mental reading abilities.
Ulrich, does above approach sound good for you?
I actually do not want to reimplement something, that will be
pointed to with words 'no matter what you say, it is broken and I do
not want it' again :).
  
   In my humble opinion, you should first write a 'real application', to
   show how the mmap buffer and kevent syscalls would be used (fast path and
   slow/recovery paths). I am sure it would be easier for everybody to agree
   on the API *before* you start coding a *lot* of hard (kernel) stuff : It
   would certainly save your mental CPU cycles (and ours too :) )
  
   This 'real application' could be  the event loop of a simple HTTP server,
   or a basic 'echo all' server. Adding the bits about timers events and
   signals should be done too.
 
  I wrote one with previous ring buffer implementation - it used timers
  and echoed when they fired, it was even described in details in one of the
  lwn.net articles.
 
  I'm not going to waste others and my time implementing feature requests
  without at least _some_ feedback from those who asked them.
  In case when person, originally requested some feature, does not answer
  and there are other opinions, only they will be get into account of
  course.
 
 I am not sure I understand what you wrote, English is not our native language.
 
 I think many people gave you feedbacks. I feel that all feedback on this 
 mailing list is constructive. Many posts/patches on this list are never 
 commented at all.

And I do greatly appreciate feedback from those people!

But I do not understand why I never got feedback on initial design and
implementation (and then created as far as I recall at least 10
releases) from Ulrich, who first asked for such a feture. 
So right now I'm waiting for his opinion on that problem, even if it will 
be 'it sucks' again, but at least in that case I will not waste people's time.

Ulrich, could you please comment on design notes sent couple of mail
above?

 Eric

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-17 Thread Eric Dumazet

Evgeniy Polyakov a e'crit :

On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:

I am not sure I understand what you wrote, English is not our native language.

I think many people gave you feedbacks. I feel that all feedback on this 
mailing list is constructive. Many posts/patches on this list are never 
commented at all.


And I do greatly appreciate feedback from those people!

But I do not understand why I never got feedback on initial design and
implementation (and then created as far as I recall at least 10
releases) from Ulrich, who first asked for such a feture. 
So right now I'm waiting for his opinion on that problem, even if it will 
be 'it sucks' again, but at least in that case I will not waste people's time.


Ulrich, could you please comment on design notes sent couple of mail
above?



Ulrich is a very busy man. We have to live with that.

rant_mode
For example, I *complained* one day, that each glibc fopen()/fread()/fclose() 
pass does a mmap()/munmap() to obtain a single 4KB of memory, without any 
cache mechanism. This badly hurts performance of multi-threaded programs as we 
know mmap()/munmap() has to down_write(mm-mmap_sem); and play VM games.


So to avoid this, I manually call setvbuf() in my own programs, to provide a 
suitable buffer to glibc, because of its suboptimal default allocation, 
vestige of an old epoch...

/rant_mode

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov
On Sun, Oct 15, 2006 at 04:22:45PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 Evgeniy Polyakov wrote:
 Existing design does not allow overflow.
 
 And I've pointed out a number of times that this is not practical at 
 best.  There are event sources which can create events which cannot be 
 coalesced into one single event as it would be required with your design.
 
 Signals are one example, specifically realtime signals.  If we do not 
 want the design to be limited from the start this approach has to be 
 thought over.

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...

According to signals and possibility to overflow in existing ring buffer
implementation.
You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.
It was done _specially_. And it is not limitation, but new approach.
Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.
So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow. That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.
RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.

 So zap mmap() support completely, since it is not usable at all. We wont 
 discuss on it.
 
 Initial implementation did not have it.
 But I was requested to do it, and it is ready now.
 No one likes it, but no one provides an alternative implementation.
 We are stuck.
 
 We need the mapped ring buffer.  The current design (before it was 
 removed) was broken but this does not mean it shouldn't be implemented. 
  We just need more time to figure out how to implement it correctly.

In the latest patchset it was removed. I'm waiting for your code.

Mmap implementation can be added separately, since it does not affect
kevent core.

 -- 
 ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
 CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...


What kind of argumentation is that?

   Because my attempt to implement it doesn't work and nobody right
away has a better suggestion this means the idea is broken.

Nonsense.

It just means that time should be spend on thinking about this.  You cut 
all this short by rushing out your attempt without any discussions. 
Unfortunately nobody else really looked at the approach so it lingered 
around for some weeks.  Well, now it is clear that it is not the right 
approach and we can start thinking about it again.



You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.

It was done _specially_. And it is not limitation, but new approach.


I know that it is done deliberately and I tell you that this is wrong 
and unacceptable.  Realtime signals are one event which need to have 
more than one event queued.  This is no description of what you have 
implemented, it's a description of the reality of realtime signals.


RT signals are queued.  They carry a data value (the sigval_t object) 
which can be unique for each signal delivery.  Coalescing the signal 
events therefore leads to information loss.


Therefore, at the very least for signal we need to have the ability to 
queue more than one event for each event source.  Not having this 
functionality means that signals and likely other types of events cannot 
be implemented using kevent queues.




Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.


Of course there are additional problems.  Overflows need to be handled. 
 But this is nothing which is unsolvable.




So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow.


That's complete nonsense.  Again, for RT signals it is very reasonable 
and not broken to have multiple outstanding signals.




That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.


It makes no sense to drag poll() into this discussion.  poll() is a very 
limited interface.  The new event handling is supposed to be the 
opposite, namely, usable for all kinds of events.  Arguing that because 
poll() does it like this just means you don't see what big step is 
needed to get to the goal of a unified event handling.  The shackles of 
poll() must be left behind.




RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.


I don't know what to say.  You claim to be the source of all wisdom is 
OS design.  Maybe you should design your own OS, from ground up.  I 
wonder how many people would like that since all your arguments are 
squarely geared towards optimizing the implementation.  But: the 
implementation is irrelevant without users.  The functionality users (= 
programmers) want and need is what must drive the implementation.  And 
RT signals are definitely heavily used and liked by programmers.  You 
have to accept that you try to modify an OS which has that functionality 
regardless of how much you hate it and want to fight it.




Mmap implementation can be added separately, since it does not affect
kevent core.


That I doubt very much and it is why I would not want the kevent stuff 
go into any released kernel until that detail is resolved.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov
On Mon, Oct 16, 2006 at 03:16:15AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 Evgeniy Polyakov wrote:
 The whole idea of mmap buffer seems to be broken, since those who asked
 for creation do not like existing design and do not show theirs...
 
 What kind of argumentation is that?
 
Because my attempt to implement it doesn't work and nobody right
 away has a better suggestion this means the idea is broken.
 
 Nonsense.

Ok, let's reformulate:
My attempt works, but nobody around likes it, I remove it and wait until
some other implement it.

 It just means that time should be spend on thinking about this.  You cut 
 all this short by rushing out your attempt without any discussions. 
 Unfortunately nobody else really looked at the approach so it lingered 
 around for some weeks.  Well, now it is clear that it is not the right 
 approach and we can start thinking about it again.

I talked about it in the last 13 releases of the kevent, and _noone_
said at least some comments. And now I get - 'it is broken, it does not
work, there are problems, we do not want it' and the like. I tried
hardly to show that it does work and problems shown can not happen, but
noone still hears me. Since I think it is not that interface which is
100% required for correct functionality, I removed it. When there are
better suggestions and implementation we can return to them of course.

 You seems to not checked the code - each event can be marked as ready 
 only one time, which means only one copy and so on.
 It was done _specially_. And it is not limitation, but new approach.
 
 I know that it is done deliberately and I tell you that this is wrong 
 and unacceptable.  Realtime signals are one event which need to have 
 more than one event queued.  This is no description of what you have 
 implemented, it's a description of the reality of realtime signals.
 
 RT signals are queued.  They carry a data value (the sigval_t object) 
 which can be unique for each signal delivery.  Coalescing the signal 
 events therefore leads to information loss.
 
 Therefore, at the very least for signal we need to have the ability to 
 queue more than one event for each event source.  Not having this 
 functionality means that signals and likely other types of events cannot 
 be implemented using kevent queues.

Well, my point about rt-signals is that they do not deserve to be
resurrected, but it is only my point :)
In case it is still used, each signal setup should create event - many
signals means many events, each signal can be sent with different
parameters - each event should correspond to one unique case.

 Queue of the same signals or any other events has fundamental flawness
 (as any other ring buffer implementation, which has queue size)  -
 it's size of the queue and extremely bad case of the overflow.
 
 Of course there are additional problems.  Overflows need to be handled. 
  But this is nothing which is unsolvable.

I strongly disagree that having design which allows overflows is
acceptible - do we really want rt-signals queue overflow problems in new
place? Instead some complex allocation scheme can be created.

 So, the same event may not be ready several times. Any design which
 allows to create infinite number of events generated for the same case
 is broken, since consumer can be in situation, when it can not handle
 that flow.
 
 That's complete nonsense.  Again, for RT signals it is very reasonable 
 and not broken to have multiple outstanding signals.

The same signal with different payload is acceptible, but when number of
them increases ulimit and they are started to be forgotten - that's what
I call broken design.

 That is why poll() returns only POLLIN when data is ready in
 network stack, but is not trying to generate some kind of a signal for 
 each byte/packet/MTU/MSS received.
 
 It makes no sense to drag poll() into this discussion.  poll() is a very 
 limited interface.  The new event handling is supposed to be the 
 opposite, namely, usable for all kinds of events.  Arguing that because 
 poll() does it like this just means you don't see what big step is 
 needed to get to the goal of a unified event handling.  The shackles of 
 poll() must be left behind.

Kevent is that subsystem, and for now it works quite good.

 RT signals have design problems, and I will not repeate the same error
 with similar limits in kevent.
 
 I don't know what to say.  You claim to be the source of all wisdom is 
 OS design.  Maybe you should design your own OS, from ground up.  I 
 wonder how many people would like that since all your arguments are 
 squarely geared towards optimizing the implementation.  But: the 
 implementation is irrelevant without users.  The functionality users (= 
 programmers) want and need is what must drive the implementation.  And 
 RT signals are definitely heavily used and liked by programmers.  You 
 have to accept that you try to modify an OS which has that functionality 
 regardless of how 

Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Johann Borck
Ulrich Drepper wrote:
 Evgeniy Polyakov wrote:
 Existing design does not allow overflow.

 And I've pointed out a number of times that this is not practical at
 best.  There are event sources which can create events which cannot be
 coalesced into one single event as it would be required with your design.

 Signals are one example, specifically realtime signals.  If we do not
 want the design to be limited from the start this approach has to be
 thought over.


 So zap mmap() support completely, since it is not usable at all. We
 wont discuss on it.

 Initial implementation did not have it.
 But I was requested to do it, and it is ready now.
 No one likes it, but no one provides an alternative implementation.
 We are stuck.

 We need the mapped ring buffer.  The current design (before it was
 removed) was broken but this does not mean it shouldn't be
 implemented.  We just need more time to figure out how to implement it
 correctly.

Considering the if at all and if then how of ring buffer implemetation
I'd like to throw in some ideas I had when reading the discussion and
respective code. If I understood Ulrich Drepper right, his notion of a
generic event handling interface is, that it has to be flexible enough
to transport additional info from origin to userspace, and to support
queuing of events from the same origin, so that additional
per-event-occurrence data doesn't get lost, which would happen when
coalescing multiple events into one until delivery. From what I read he
says ring buffer is broken because of  insufficient space for additional
data (mukevent) and the limited number of events that can be put into
ring buffer. Another argument is missing notification of userspace about
dropped events in case ring buffer limit is reached. (is that right?)
I see no reason why kevent couldn't be modified to fit (all) these
needs. While modifying the server-example and writing a client using
kevent I came across the coalescing problem, there were more incoming
connections than accept events, and I had to work around that. In this
case the pure number of coalesced events would suffice, while it
wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
coalescing can be done at all or if it is impossible depends on the type
of event. The same goes for additional data delivered with the events.
There might be no panacea for all possible scenarios with one fixed
design. Either performance suffers for 'lightweight' events  which don't
need additional data and/or coalescing is not problematic and/or ring
buffer, or kevent is not usable for other types of events. Why not treat
different things differently, and let the (kernel-)user decide.
I don't know if I got all this right, but if, then ring buffer is needed
especially for cases where coalescing is not possible and additional
data has to be delivered for each triggered notification (so the pure
number of events is not enough; other reasons? performance? ). To me it
doesn't make sense to have kevent fill memory and use processor-time if
buffer is not used at all, which is the case when using kevent_getevents.
So here are my Ideas:
Make usage of ring buffer optional, if not required for specific
event-type it might be chosen by userspace-code.
Make limit of events in ring buffer optional and controllable from
userspace.
Regarding mukevent I'm thinking of a event-type specific struct, that is
filled by the originating code, and placed into a per-event-type ring
buffer (which  requires modification of kevent_wait). To my limited
understanding it seems that alternative or modified versions of
kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event
could return a void pointer to the position in buffer, and all kevent
has to know about is the size of the struct.
If coalescing doesn't hurt for a specific event-type it might just be
modified to notify userspace about the number of coalesced events. Make
it depend on type of event.

I know this doesn't address all objections that have been made, and
Evgeniy, big sorry for this being just talk again, and maybe not even
applicable for some reasons I do not overlook, but maybe it's worth
consideration. I'll gladly try to put that into code, and see where it
leads. I think kevent is great, and if things can be done to increase
it's genericity without sacrifying performance, why not.
Sorry for the length of post and repetitions,

Johann
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-15 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

Existing design does not allow overflow.


And I've pointed out a number of times that this is not practical at 
best.  There are event sources which can create events which cannot be 
coalesced into one single event as it would be required with your design.


Signals are one example, specifically realtime signals.  If we do not 
want the design to be limited from the start this approach has to be 
thought over.



So zap mmap() support completely, since it is not usable at all. We wont 
discuss on it.


Initial implementation did not have it.
But I was requested to do it, and it is ready now.
No one likes it, but no one provides an alternative implementation.
We are stuck.


We need the mapped ring buffer.  The current design (before it was 
removed) was broken but this does not mean it shouldn't be implemented. 
 We just need more time to figure out how to implement it correctly.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Evgeniy Polyakov
On Wed, Oct 04, 2006 at 10:57:32AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
 http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c
 
 These are simple programs which by themselves have problems.  For
 instance, I consider a very bad idea to hardcode the size of the ring
 buffer.  Specifying macros in the header file counts as hardcoding.
 Systems grow over time and so will the demand of connections.  I have
 no problem with the kernel hardcoding the value internally (or having
 a /proc entry to select it) but programs should be able to dynamically
 learn about the value so they don't have to be recompiled.

Well, it is possible to create /sys/proc entry for that, and even now 
userspace can grow mapping ring until it is forbiden by kernel, which
means limit is reached.

Actually the whole idea with global limit of kevents does not sound very
good to me, but it is required to remove overflow in mapped buffer.

 But more problematic is that I don't see how the interfaces can be
 efficiently used in multi-threaded (or multi-process) programs.  How
 would multiple threads using the same kevent queue and running in the
 same kevent_get_events() loop work out?  How do they guarantee that
 each request is only handled once?

kqueue_dequeue_ready() is atomic and this function removes kevent from
ready queue so other thread can not get it.

 From what I see now this means a second data structure is needed to
 keep track of the state of each entry.  But even then, how do we even
 recognized used ring buffer entries?
 
 For instance, assume two threads.  Both call get_events, one event is
 reported, both threads are woken up (which is another thing to
 consider, more later).  One thread uses ring buffer entry, the other
 goes back to sleep in get_events.  Now, how does the kernel know when
 the other thread is done working on the ring buffer entry?  There
 might be lots of entries coming in overflowing the entire buffer.
 Heck, you don't even need two threads for this scenario.

Are you talking about mapped buffer or syscall interface?
The former has special syscall kevent_wait(), which reports number of
'processed' events and first processed number, so kernel can remove all
appropriate events. The latter is described above -
kqueue_dequeue_ready() is atomic, so that event will be removed from the
ready queue and optionally from the whole kevent tree.

It is possible to work with both interfaces at the same time, since
mapped buffer contains a copy of the event, which is potentially freed
and processed by other thread. 

Actually I do not like idea of mapped ring anyway, since if application 
uses a lot of events, it will batch them into big chunks, so syscall 
overhead is negligible, if application uses small number of events, 
syscalls will be rare and will not hurt performance.

 When I was thinking about this (and discussing it in Ottawa) I was
 always assuming that we have a status field in the ring buffer entry
 which lets the userlevel code indicate whether the entry is free again
 or not.  This requires a writable mapping, yes, and potentially causes
 cache line ping-pong.  I think Zach mentioned he has some ideas about
 this.

As far as I can see, there are no other ideas on how to implement ring
buffer, so I did it like I wanted. It has some limitation indeed, but
since I do not see any other code, how can I say what is better or
worse?
 
 As for the multiple thread wakeup, I mentioned this before.  We have
 to avoid the trampling herd problem.  We cannot wakeup all waiters.
 But we also cannot assume that, without protocols, waking up just one
 for each available entry is sufficient.  So the first question is:
 what is the current policy?

It is a good practice to _not_ share the same queue between a lot of
threads. Currently all waiters are awakened.

 AIO was removed from patchset by request of Cristoph.
 Timers, network AIO, fs AIO, socket nortifications and poll/select
 events work well with existing structures.
 
 Well, excuse me if I don't take your word for it.  I agree, the AIO
 code should not be submitted along with this.  The same for any other
 code using the event handling.  But we need to check whether the
 interface is generic enough to accomodate them in a way which actually
 makes sense.  Again, think highly threaded processes or multiple
 processes sharing the same event queue.

You missed the point.
I implemented _all_ above and it does work.
Although it was removed from submission patchset.
You can find all patches on kevent homepage, they were posted to lkml@
and netdev@ too many times to miss them.
 
 It is even possible to create variable sized kevents - each kevent
 contain pointer to user's data, which can be considered as pointer to
 additional area (it's size kernel implementation for given kevent type
 can determine from other parameters or use 

Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Eric Dumazet
On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote:

 Well, it is possible to create /sys/proc entry for that, and even now
 userspace can grow mapping ring until it is forbiden by kernel, which
 means limit is reached.

No need for yet another /sys/proc entry.

Right now, I (for example) may have a use for Generic event handling, but for 
a program that needs XXX.XXX handles, and about XX.XXX events per second.

Right now, this program uses epoll, and reaches no limit at all, once you pass 
the ulimit -n, and other kernel wide tunes of course, not related to epoll.

With your current kevent, I cannot switch to it, because of hardcoded limits.

I may be wrong, but what is currently missing for me is :

- No hardcoded limit on the max number of events. (A process that can open 
XXX.XXX files should be allowed to open a kevent queue with at least XXX.XXX 
events). Right now thats not clear what happens IF the current limit is 
reached.

- In order to avoid touching the whole ring buffer, it might be good to be 
able to reset the indexes to the beginning when ring buffer is empty. (So if 
the user land is responsive enough to consume events, only first pages of the 
mapping would be used : that saves L1/L2 cpu caches)

A plus would be

- A working/usable mmap ring buffer implementation, but I think its not 
mandatory. System calls are not that expensive, especially if you can batch 
XX events per syscall (like epoll). Nice thing with a ring buffer is that we 
touch less cache lines than say epoll that have lot of linked structures.

About mmap, I think you might want a hybrid thing :

One writable page where userland can write its index, (and hold one or more 
futex shared by kernel) (with appropriate thread locking in case multiple 
threads want to dequeue events). In fast path, no syscalls are needed to 
maintain this user index.

XXX readonly pages (for user, but r/w for kernel), where kernel write its own 
index, and events of course.

Using separate cache lines avoid false sharing : kernel can update its own 
index and events without having to pay the price of cache line ping pongs.
It could use futex infrastructure to wakeup one thread 'only' instead of all 
threads waiting an event.


Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Eric Dumazet
On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote:
 On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
 
  What is missing or not obvious is : If events are skipped because of
  overflows, What happens ? Connections stuck forever ? Hope that
  everything will restore itself ? Is kernel able to SIGNAL this problem to
  user land ?

 Exisitng  code does not overflow by design, but can consume a lot of
 memory. I talked about the case, when there will be some limit on
 number of entries put into mapped buffer.

You still dont answer my question. Please answer the question.
Recap : You have a max of  events queued. A network message come and 
kernel want to add another event. It cannot because limit is reached. How the 
User Program knows that this problem was hit ?


 It is the same.
 What if reing buffer was grown upto 3 entry, and is now empty, and we
 need to put there 4 entries? Grow it again?
 It can be done, easily, but it looks like a workaround not as solution.
 And it is highly unlikely that in situation, when there are a lot of
 event, ring can be empty.

I dont speak of re-allocation of ring buffer. I dont care to allocate at 
startup a big enough buffer.

Say you have allocated a ring buffer of 1024*1024 entries.
Then you queue 100 events per second, and dequeue them immediatly.
No need to blindly use all 1024*1024 slots in the ring buffer, doing 
index = (index+1)%(1024*1024)



 epoll() does not have mmap.
 Problem is not about how many events can be put into the kernel, but how
 many of them can be put into mapped buffer.
 There is no problem if mmap is turned off.

So zap mmap() support completely, since it is not usable at all. We wont 
discuss on it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Evgeniy Polyakov
On Thu, Oct 05, 2006 at 02:09:31PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
 On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote:
  On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
  
   What is missing or not obvious is : If events are skipped because of
   overflows, What happens ? Connections stuck forever ? Hope that
   everything will restore itself ? Is kernel able to SIGNAL this problem to
   user land ?
 
  Exisitng  code does not overflow by design, but can consume a lot of
  memory. I talked about the case, when there will be some limit on
  number of entries put into mapped buffer.
 
 You still dont answer my question. Please answer the question.
 Recap : You have a max of  events queued. A network message come and 
 kernel want to add another event. It cannot because limit is reached. How the 
 User Program knows that this problem was hit ?

Existing design does not allow overflow.
If event was added into the queue (like user requested notification,
when new data has arrived), it is guaranteed that there will be place to
put that event into mapped buffer when it is ready.

If user wants to add anotehr event (for example after accept() user
wants to add another socket with request for notification about data
arrival into that socket), it can fail though. This limit is introduced
only because of mmap buffer.
 
  It is the same.
  What if reing buffer was grown upto 3 entry, and is now empty, and we
  need to put there 4 entries? Grow it again?
  It can be done, easily, but it looks like a workaround not as solution.
  And it is highly unlikely that in situation, when there are a lot of
  event, ring can be empty.
 
 I dont speak of re-allocation of ring buffer. I dont care to allocate at 
 startup a big enough buffer.
 
 Say you have allocated a ring buffer of 1024*1024 entries.
 Then you queue 100 events per second, and dequeue them immediatly.
 No need to blindly use all 1024*1024 slots in the ring buffer, doing 
 index = (index+1)%(1024*1024)

But what if they are not dequeued immediateyl? What if rate is high and
while one tries to dequeue, system adds another events?

  epoll() does not have mmap.
  Problem is not about how many events can be put into the kernel, but how
  many of them can be put into mapped buffer.
  There is no problem if mmap is turned off.
 
 So zap mmap() support completely, since it is not usable at all. We wont 
 discuss on it.

Initial implementation did not have it.
But I was requested to do it, and it is ready now.
No one likes it, but no one provides an alternative implementation.
We are stuck.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Hans Henrik Happe
On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote:
 On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
  On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote:
  
   Well, it is possible to create /sys/proc entry for that, and even now
   userspace can grow mapping ring until it is forbiden by kernel, which
   means limit is reached.
  
  No need for yet another /sys/proc entry.
  
  Right now, I (for example) may have a use for Generic event handling, but 
for 
  a program that needs XXX.XXX handles, and about XX.XXX events per second.
  
  Right now, this program uses epoll, and reaches no limit at all, once you 
pass 
  the ulimit -n, and other kernel wide tunes of course, not related to 
epoll.
  
  With your current kevent, I cannot switch to it, because of hardcoded 
limits.
  
  I may be wrong, but what is currently missing for me is :
  
  - No hardcoded limit on the max number of events. (A process that can open 
  XXX.XXX files should be allowed to open a kevent queue with at least 
XXX.XXX 
  events). Right now thats not clear what happens IF the current limit is 
  reached.
 
 This forces to overflows in fixed sized memory mapped buffer.
 If we remove memory mapped buffer or will allow to have overflows (and
 thus skipped entries) keven can easily scale to that limits (tested with
 xx.xxx events though).
 
  - In order to avoid touching the whole ring buffer, it might be good to be 
  able to reset the indexes to the beginning when ring buffer is empty. (So 
if 
  the user land is responsive enough to consume events, only first pages of 
the 
  mapping would be used : that saves L1/L2 cpu caches)
 
 And what happens when there are 3 empty at the beginning and \we need to
 put there 4 ready events?

Couldn't there be 3 areas in the mmap buffer:

- Unused: entries that the kernel can alloc from.
- Alloced: entries alloced by kernel but not yet used by user. Kernel can 
update these if new events requires that.
- Consumed: entries that the user are processing.

The user takes a set of alloced entries and make them consumed. Then it 
processes the events after which it makes them unused. 

If there are no unused entries and the kernel needs some, it has wait for free 
entries. The user has to notify when unused entries becomes available. It 
could set a flag in the mmap'ed area to avoid unnessesary wakeups.

The are some details with indexing and wakeup notification that I have left 
out, but I hope my idea is clear. I could give a more detailed description if 
requested. Also, I'm a user-level programmer so I might not get the whole 
picture.

Hans Henrik Happe
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-05 Thread Evgeniy Polyakov
On Thu, Oct 05, 2006 at 04:01:19PM +0200, Hans Henrik Happe ([EMAIL PROTECTED]) 
wrote:
  And what happens when there are 3 empty at the beginning and \we need to
  put there 4 ready events?
 
 Couldn't there be 3 areas in the mmap buffer:
 
 - Unused: entries that the kernel can alloc from.
 - Alloced: entries alloced by kernel but not yet used by user. Kernel can 
 update these if new events requires that.
 - Consumed: entries that the user are processing.
 
 The user takes a set of alloced entries and make them consumed. Then it 
 processes the events after which it makes them unused. 
 
 If there are no unused entries and the kernel needs some, it has wait for 
 free 
 entries. The user has to notify when unused entries becomes available. It 
 could set a flag in the mmap'ed area to avoid unnessesary wakeups.
 
 The are some details with indexing and wakeup notification that I have left 
 out, but I hope my idea is clear. I could give a more detailed description if 
 requested. Also, I'm a user-level programmer so I might not get the whole 
 picture.

This looks good on a picture, but how can you put it into page-based
storage without major and complex shared structures, which should be
properly locked between kernelspace and userspace?

 Hans Henrik Happe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-04 Thread Ulrich Drepper

On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

This patch includes core kevent files:
[...]


I tried to look at the example programs before and failed.  I tried
again.  Where can I find up-to-date example code?

Some other points:

- I really would prefer not to rush all this into the upstream kernel.
The main problem is that the ring buffer interface is a shared data
structure.  These are always tricky.  We need to find the right
combination between size (as small as possible) and supporting all the
interfaces.

- so far only the timer and aio notification is speced out.  What
about the rest?  Are we sure all aspects can be expressed?  I am not
yet.

- we need an interface to add an event from userlevel.  I.e., we need
to be able to synthesize events.  There are events (like, for instance
the async DNS functionality) which come from userlevel code.

I would very much prefer we look at the other events before setting
the data structures in stone.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-04 Thread Evgeniy Polyakov
On Tue, Oct 03, 2006 at 11:34:02PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 This patch includes core kevent files:
 [...]
 
 I tried to look at the example programs before and failed.  I tried
 again.  Where can I find up-to-date example code?

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c

Structures were not changed from the beginning of kevent project.

 Some other points:
 
 - I really would prefer not to rush all this into the upstream kernel.
 The main problem is that the ring buffer interface is a shared data
 structure.  These are always tricky.  We need to find the right
 combination between size (as small as possible) and supporting all the
 interfaces.

mmap interface itself is in question, since it allows to create dos
since there are no rlimits for pinned memory.

 - so far only the timer and aio notification is speced out.  What
 about the rest?  Are we sure all aspects can be expressed?  I am not
 yet.

AIO was removed from patchset by request of Cristoph.
Timers, network AIO, fs AIO, socket nortifications and poll/select
events work well with existing structures.

 - we need an interface to add an event from userlevel.  I.e., we need
 to be able to synthesize events.  There are events (like, for instance
 the async DNS functionality) which come from userlevel code.
 
 I would very much prefer we look at the other events before setting
 the data structures in stone.

Signals and userspace events (hello solaris) easily fits into existing
structures.

It is even possible to create variable sized kevents - each kevent
contain pointer to user's data, which can be considered as pointer to
additional area (it's size kernel implementation for given kevent type
can determine from other parameters or use predefined one and fetch
additional data in -enqueue() callback).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-04 Thread Ulrich Drepper

On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c


These are simple programs which by themselves have problems.  For
instance, I consider a very bad idea to hardcode the size of the ring
buffer.  Specifying macros in the header file counts as hardcoding.
Systems grow over time and so will the demand of connections.  I have
no problem with the kernel hardcoding the value internally (or having
a /proc entry to select it) but programs should be able to dynamically
learn about the value so they don't have to be recompiled.

But more problematic is that I don't see how the interfaces can be
efficiently used in multi-threaded (or multi-process) programs.  How
would multiple threads using the same kevent queue and running in the
same kevent_get_events() loop work out?  How do they guarantee that
each request is only handled once?


From what I see now this means a second data structure is needed to

keep track of the state of each entry.  But even then, how do we even
recognized used ring buffer entries?

For instance, assume two threads.  Both call get_events, one event is
reported, both threads are woken up (which is another thing to
consider, more later).  One thread uses ring buffer entry, the other
goes back to sleep in get_events.  Now, how does the kernel know when
the other thread is done working on the ring buffer entry?  There
might be lots of entries coming in overflowing the entire buffer.
Heck, you don't even need two threads for this scenario.

When I was thinking about this (and discussing it in Ottawa) I was
always assuming that we have a status field in the ring buffer entry
which lets the userlevel code indicate whether the entry is free again
or not.  This requires a writable mapping, yes, and potentially causes
cache line ping-pong.  I think Zach mentioned he has some ideas about
this.


As for the multiple thread wakeup, I mentioned this before.  We have
to avoid the trampling herd problem.  We cannot wakeup all waiters.
But we also cannot assume that, without protocols, waking up just one
for each available entry is sufficient.  So the first question is:
what is the current policy?



AIO was removed from patchset by request of Cristoph.
Timers, network AIO, fs AIO, socket nortifications and poll/select
events work well with existing structures.


Well, excuse me if I don't take your word for it.  I agree, the AIO
code should not be submitted along with this.  The same for any other
code using the event handling.  But we need to check whether the
interface is generic enough to accomodate them in a way which actually
makes sense.  Again, think highly threaded processes or multiple
processes sharing the same event queue.



It is even possible to create variable sized kevents - each kevent
contain pointer to user's data, which can be considered as pointer to
additional area (it's size kernel implementation for given kevent type
can determine from other parameters or use predefined one and fetch
additional data in -enqueue() callback).


That sounds interesting and certainly helps with securing the
interface for the future.  But if there is anything we can do to avoid
unnecessary costs we should do it, even if this means investigation
all this further.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take19 1/4] kevent: Core files.

2006-09-20 Thread Evgeniy Polyakov

Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..c10698e 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,6 @@ ENTRY(sys_call_table)
.long sys_tee   /* 315 */
.long sys_vmsplice
.long sys_move_pages
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl
+   .long sys_kevent_wait   /* 320 */
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..a06b76f 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -710,7 +710,10 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl
+   .quad sys_kevent_wait   /* 320 */
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..68072b5 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,13 @@ #define __NR_sync_file_range  314
 #define __NR_tee   315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl319
+#define __NR_kevent_wait   320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 321
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..ee907ad 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait   282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..24ced10
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,195 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include linux/types.h
+#include linux/list.h
+#include linux/rbtree.h
+#include linux/spinlock.h
+#include linux/mutex.h
+#include linux/wait.h
+#include linux/net.h
+#include linux/rcupdate.h
+#include linux/kevent_storage.h
+#include linux/ukevent.h
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+   kevent_callback_t   callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY   0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER0x4
+
+struct kevent
+{
+   /* Used for kevent freeing.*/
+   struct rcu_head rcu_head;
+   struct ukevent  event;
+   /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+   spinlock_t  ulock;
+
+   /* Entry of user's tree. */
+   struct rb_node  kevent_node;
+   /* Entry of origin's queue. */
+   struct list_headstorage_entry;
+   /* Entry of user's ready. */
+   struct list_headready_entry;
+
+   u32 flags;
+
+   /* User who requested this kevent. */
+