Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

please CC: replies to me, since I'm not subscribed to the list.

On Fri, 15 Jun 2001 15:50:33 -0300 (BRST)
Rik van Riel <[EMAIL PROTECTED]> wrote:

> On Fri, 15 Jun 2001, Ivan Schreter wrote:
>> In 2Q, when a page is present in LRU queue, you move it to the front of
>> [...]

> This description has me horribly confused. Do you have
> any pointers to state diagrams of this thing? ;)

Well, I posted an URL where is complete paper about 2Q, here it is again:

http://citeseer.nj.nec.com/63909.html

(click there in top right corner on download PS or PDF).

In general, it is like LRU, but the pages make it into LRU only after
going through FIFO. So a page which is requested only once (or few times
in a row) will pass through this FIFO and be swapped out. When a page is
actively used, it will pass through FIFO and on a new request for this
page it will not be loaded into FIFO, but into LRU.

Since FIFO is small relative to LRU (10% or so), you don't waste buffer
space for long time for once (or few times) used pages which are not
needed anymore (like big find, cat, dd, etc.).

The trick is how to determine whether the page should be loaded into FIFO
or into LRU at swap-in. And here comes another queue - they call it A1out
in original paper. This queue (or cyclic buffer or whatever) is another
FIFO queue, which stores INDICES of physical pages, which were swapped out
of FIFO queue (and lived only shortly). When a page is found in this A1out
queue, it is put into LRU (it is a "hot" page) and will live longer in LRU
list. A1out queue size is up to experiments, I used 50% of memory buffer
count and it performed well.

And yes, look into the program I posted last time, look into functions
r2q_page(), which loads a page into buffer and r2q_reclaim() which swaps
out a page to make space.

I was also doing some experiments with not swapping out hot pages out of
FIFO queue (USE_FASTPAGE conditional), but they didn't bring any
reasonable improvement (few tenths of percent up or down from original
performance), so it's probably not worth it.

BTW, this 2Q algorithm can be well used for madvise() syscall
implementation, like this:

- NORMAL - no change
- RANDOM - no change
- SEQUENTIAL - load pages in FIFO and DON'T put them in A1out
  after expiry
- WILLNEED - load pages directly in LRU
- DONTNEED - move pages to FRONT of FIFO (or TAIL of LRU)

(see BSD madvise() syscall, but I believe you know what I'm talking about
;-)

BTW2, I'm quite happy that people care about new ideas :-) I would be
happy to hack the kernel full-time and implement them myself, but
unfortunately I need to make money out of something, so no time :-)

--
Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

I'm not subscribed to list, so please replies CC: to me

On Fri, 15 Jun 2001 12:20:52 -0400 (EDT)
John Clemens <[EMAIL PROTECTED]> wrote:

> Is there any way for you to measure the relative computational overhead
> required for the two algorithms? the benefit of the higher hit rate may
> be lost of the computational time increases by too much.

Well, *computational* overhead is negligible - processing is O(1), like
LRU. Look at the program that was attached to my previous message. There
is an implementation of this algorithm.

In LRU, all you have to do is move page to the front of doubly-linked list
when page is accessed and swap out last page when buffer is full and
request for a new page is generated.

In 2Q, when a page is present in LRU queue, you move it to the front of
LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue
(the FIFO one), so you DON'T do anything. When the page is NOT in memory,
you load it conditionally to Am LRU queue (if it is present in A1out
queue) or to A1 queue (FIFO), if not.

It gets interesting when you need to swap out a page from memory. When the
size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page
from A1 is swapped out (and put into A1out list), otherwise a page from Am
LRU is swapped out (and NOT put into A1out, since it hasn't been accessed
for long time).

So the general algorithm is not much more complicated as LRU.

There is only one interesting question: How to implement A1out queue (also
FIFO) so that we can quickly tell a page is in A1out. I'm currently using
cyclic buffer for A1out in sample implementation and a flag in buffer map
if a page is in A1out. In real system this flag must be replaced by some
kind of hashtable which will contain an entry for all pages which are in
A1out queue, so that we can decide in O(1) if a page is or isn't in A1out
when loading it from the disk. Alternative is a bitmask, but then we need,
e.g., for 64GB at 4kB sized pages, 2MB of RAM (which is too much).
Considering we may have upto 32K pages mapped in memory (128MB buffer), is
a hashtable with say 64K entries at 50% A1out (16K entries) much more
memory efficient (512kB). If we limit A1out queue to something less (maybe
10-20% of buffer block count) we need even less space.

Since I use it for database application and know in advance the size of
the underlying dataspace, I use bitmask method in my program. This is
probably not good for the kernel.

Somebody should investigate the potential impact of the algorithm on real
system. If you send me a trace from live system, I am willing to analyze
it.

Form for the trace:

A text file with lines in format:

time blockid buffersizecur buffersizemax
time blockid touch

where time is monotonically increasing, blockid is some unique ID of block
requested (also for blocks already in memory!), buffersizecur is current
size of buffer space (since we have variable buffer space under linux) and
buffersizemax is buffersizecur + size of free memory (or a flag whether we
can put a new block in a free memory block).

Second form (with word "touch") is for the purposes of counting extra time
needed for swapping out the block back to the disk.

Optionally the time may be left out.

--
Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

Please CC: replies to me, since I am not subscribed.

>> http://citeseer.nj.nec.com/63909.html

> The "resistance to scanning" seemed interesting, maybe one-time
> activities like a "find" run or big cat/dd will have less impact with
> this.

Exactly. But not only that.

I have made some experimental tests. When you have totally random access
to data, you get degradation to LRU performance (but no worse). However,
when doing normal work, results are quite encouraging. Speedup percentage
is computed assuming processing in-memory buffer takes x and loading
on-disk buffer takes 100x time:

#blocks buffers P   hitLRU  hit2Q   2Q+ speedup
262400  16384   8   19.29%  24.89%  29% 7.365%
262400  16384   4   11.56%  14.23%  23% 3.084%
1024K   16384   32  16.90%  22.82%  35% 7.573%
1024K   16384   16  9.00%   11.94%  33% 3.305%
1024K   16384   8   4.94%   6.38%   29% 1.515%
4096K   16384   64  8.40%   11.39%  36% 3.339%
4096K   16384   32  4.32%   5.79%   34% 1.536%

I used reasonable figures for filesystem size (1G, 4G and 16G) and buffer
cache (64MB), with blocksize 4K. All simulations used 10% for A1in and 25%
for A1out queues. Almost all simulations show over 30% better hit rate as
LRU, translating to ~3% real time speedup for normal workload. Speedup for
scanning type workload (find, cat large file, etc.) should be even better
- write access pattern generator and try it :-)

Constant P determines "randomness" of accesses as follows:

int x = random() % (NR * LEVEL);
for (int i = LEVEL - 1; i > 0; --i)
if (x >= NR * i) x = (x - NR * i) / (i + 1);

where x is generated block # to be accessed, LEVEL is P and NR is # of
disk blocks.

I attach a crude program to simulate LRU, 2Q and FIFO buffer management
policies. If you want, you can play with parameters a little bit and
simulate other workloads (e.g., record what's going on in the system and
then simulate real system block accesses).

I would like to hear from you if you have some interesting results.

--
Ivan Schreter
[EMAIL PROTECTED]


#include 
#include 
#include 
#include 
#include 

int ops = 0, hit_lru = 0, hit_2q = 0, hit_fifo = 0;

#define BUFS16384   // # of buffer entries in memory
#define NR  262400  // # of DB buffers
#define LEVEL   8   // max. test level (1=total random)

//#define   USE_FASTAGE


#define BUFSA1IN(BUFS/10)
#define BUFSA1OUT   (BUFS/4)

typedef struct {

int next;
int prev;
int id;
int age;

} lru_entry;


lru_entry   lru[BUFS];
int lru_idx[NR];
int lru_free = 0, lru_head = 0, lru_tail = 0;


void lru_page(int x)
{
if (lru_idx[x] >= 0) {
hit_lru++;

// move to beginning of the buffer
x = lru_idx[x];
if (x == lru_head) return;

lru[lru[x].prev].next = lru[x].next;
if (x == lru_tail) {
lru_tail = lru[x].prev;
} else {
lru[lru[x].next].prev = lru[x].prev;
}
lru[x].prev = -1;
lru[x].next = lru_head;
lru[lru_head].prev = x;
lru_head = x;

return;
}

if (lru_free == -1) {

// remove one from tail
lru[lru_tail].next = lru_free;
lru[lru_tail = lru[lru_free = lru_tail].prev].next = -1;
lru_idx[lru[lru_free].id] = -1;
}

// add to head
int r = lru_free;
lru_free = lru[r].next;

lru[r].prev = -1;
lru[r].next = lru_head;
lru[r].id = x;
lru[lru_head].prev = r;
if (lru_head < 0) lru_tail = r;
lru_head = r;

lru_idx[x] = r;
}


lru_entry   r2q[BUFS];
int r2q_a1out[BUFSA1OUT];
int r2q_idx[NR];
int r2q_free = 0, r2q_head = -1, r2q_tail = -1;
int r2q_a1i_head = -1, r2q_a1i_tail = -1, r2q_a1i_cnt = 0;
int r2q_a1o_head = 0, r2q_a1o_tail = 0, r2q_age = 0;

#define R2Q_MASK0xff
#define R2Q_FLG_MASK0xff00
#define R2Q_A1I_FLG 0x100
#define R2Q_A1O_FLG 0x200
#define R2Q_AM_FLG  0x400

void r2q_reclaim()
{
// free one page

if (r2q_a1i_cnt > BUFSA1IN) {

// remove last page from A1in and put to A1out

int r = r2q_a1i_tail;
#if 0
if (r < 0) {
printf("Reclaim err: %d - %d\n", r2q_a1i_tail, r2q_a1i_head);
raise(SIGSEGV);
}
#endif
r2q_a1i_tail = r2q[r].prev;
r2q[r2q_a1i_tail].next = -1;
--r2q_a1i_cnt;
//printf("Del2 %d, cnt = %d\n", r, r2q_a1i_cnt);

int x = r2q[r].id;
r2q[r].next = r2q_free;
r2q_free

Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

Please CC: replies to me, since I am not subscribed.

 http://citeseer.nj.nec.com/63909.html

 The resistance to scanning seemed interesting, maybe one-time
 activities like a find run or big cat/dd will have less impact with
 this.

Exactly. But not only that.

I have made some experimental tests. When you have totally random access
to data, you get degradation to LRU performance (but no worse). However,
when doing normal work, results are quite encouraging. Speedup percentage
is computed assuming processing in-memory buffer takes x and loading
on-disk buffer takes 100x time:

#blocks buffers P   hitLRU  hit2Q   2Q+ speedup
262400  16384   8   19.29%  24.89%  29% 7.365%
262400  16384   4   11.56%  14.23%  23% 3.084%
1024K   16384   32  16.90%  22.82%  35% 7.573%
1024K   16384   16  9.00%   11.94%  33% 3.305%
1024K   16384   8   4.94%   6.38%   29% 1.515%
4096K   16384   64  8.40%   11.39%  36% 3.339%
4096K   16384   32  4.32%   5.79%   34% 1.536%

I used reasonable figures for filesystem size (1G, 4G and 16G) and buffer
cache (64MB), with blocksize 4K. All simulations used 10% for A1in and 25%
for A1out queues. Almost all simulations show over 30% better hit rate as
LRU, translating to ~3% real time speedup for normal workload. Speedup for
scanning type workload (find, cat large file, etc.) should be even better
- write access pattern generator and try it :-)

Constant P determines randomness of accesses as follows:

int x = random() % (NR * LEVEL);
for (int i = LEVEL - 1; i  0; --i)
if (x = NR * i) x = (x - NR * i) / (i + 1);

where x is generated block # to be accessed, LEVEL is P and NR is # of
disk blocks.

I attach a crude program to simulate LRU, 2Q and FIFO buffer management
policies. If you want, you can play with parameters a little bit and
simulate other workloads (e.g., record what's going on in the system and
then simulate real system block accesses).

I would like to hear from you if you have some interesting results.

--
Ivan Schreter
[EMAIL PROTECTED]


#include stdio.h
#include string.h
#include stdlib.h
#include time.h
#include signal.h

int ops = 0, hit_lru = 0, hit_2q = 0, hit_fifo = 0;

#define BUFS16384   // # of buffer entries in memory
#define NR  262400  // # of DB buffers
#define LEVEL   8   // max. test level (1=total random)

//#define   USE_FASTAGE


#define BUFSA1IN(BUFS/10)
#define BUFSA1OUT   (BUFS/4)

typedef struct {

int next;
int prev;
int id;
int age;

} lru_entry;


lru_entry   lru[BUFS];
int lru_idx[NR];
int lru_free = 0, lru_head = 0, lru_tail = 0;


void lru_page(int x)
{
if (lru_idx[x] = 0) {
hit_lru++;

// move to beginning of the buffer
x = lru_idx[x];
if (x == lru_head) return;

lru[lru[x].prev].next = lru[x].next;
if (x == lru_tail) {
lru_tail = lru[x].prev;
} else {
lru[lru[x].next].prev = lru[x].prev;
}
lru[x].prev = -1;
lru[x].next = lru_head;
lru[lru_head].prev = x;
lru_head = x;

return;
}

if (lru_free == -1) {

// remove one from tail
lru[lru_tail].next = lru_free;
lru[lru_tail = lru[lru_free = lru_tail].prev].next = -1;
lru_idx[lru[lru_free].id] = -1;
}

// add to head
int r = lru_free;
lru_free = lru[r].next;

lru[r].prev = -1;
lru[r].next = lru_head;
lru[r].id = x;
lru[lru_head].prev = r;
if (lru_head  0) lru_tail = r;
lru_head = r;

lru_idx[x] = r;
}


lru_entry   r2q[BUFS];
int r2q_a1out[BUFSA1OUT];
int r2q_idx[NR];
int r2q_free = 0, r2q_head = -1, r2q_tail = -1;
int r2q_a1i_head = -1, r2q_a1i_tail = -1, r2q_a1i_cnt = 0;
int r2q_a1o_head = 0, r2q_a1o_tail = 0, r2q_age = 0;

#define R2Q_MASK0xff
#define R2Q_FLG_MASK0xff00
#define R2Q_A1I_FLG 0x100
#define R2Q_A1O_FLG 0x200
#define R2Q_AM_FLG  0x400

void r2q_reclaim()
{
// free one page

if (r2q_a1i_cnt  BUFSA1IN) {

// remove last page from A1in and put to A1out

int r = r2q_a1i_tail;
#if 0
if (r  0) {
printf(Reclaim err: %d - %d\n, r2q_a1i_tail, r2q_a1i_head);
raise(SIGSEGV);
}
#endif
r2q_a1i_tail = r2q[r].prev;
r2q[r2q_a1i_tail].next = -1;
--r2q_a1i_cnt;
//printf(Del2 %d, cnt = %d\n, r, r2q_a1i_cnt);

int x = r2q[r].id;
r2q[r].next = r2q_free;
r2q_free = r;

r2q_idx[x] = R2Q_MASK | R2Q_A1O_FLG;

//if (x

Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

I'm not subscribed to list, so please replies CC: to me

On Fri, 15 Jun 2001 12:20:52 -0400 (EDT)
John Clemens [EMAIL PROTECTED] wrote:

 Is there any way for you to measure the relative computational overhead
 required for the two algorithms? the benefit of the higher hit rate may
 be lost of the computational time increases by too much.

Well, *computational* overhead is negligible - processing is O(1), like
LRU. Look at the program that was attached to my previous message. There
is an implementation of this algorithm.

In LRU, all you have to do is move page to the front of doubly-linked list
when page is accessed and swap out last page when buffer is full and
request for a new page is generated.

In 2Q, when a page is present in LRU queue, you move it to the front of
LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue
(the FIFO one), so you DON'T do anything. When the page is NOT in memory,
you load it conditionally to Am LRU queue (if it is present in A1out
queue) or to A1 queue (FIFO), if not.

It gets interesting when you need to swap out a page from memory. When the
size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page
from A1 is swapped out (and put into A1out list), otherwise a page from Am
LRU is swapped out (and NOT put into A1out, since it hasn't been accessed
for long time).

So the general algorithm is not much more complicated as LRU.

There is only one interesting question: How to implement A1out queue (also
FIFO) so that we can quickly tell a page is in A1out. I'm currently using
cyclic buffer for A1out in sample implementation and a flag in buffer map
if a page is in A1out. In real system this flag must be replaced by some
kind of hashtable which will contain an entry for all pages which are in
A1out queue, so that we can decide in O(1) if a page is or isn't in A1out
when loading it from the disk. Alternative is a bitmask, but then we need,
e.g., for 64GB at 4kB sized pages, 2MB of RAM (which is too much).
Considering we may have upto 32K pages mapped in memory (128MB buffer), is
a hashtable with say 64K entries at 50% A1out (16K entries) much more
memory efficient (512kB). If we limit A1out queue to something less (maybe
10-20% of buffer block count) we need even less space.

Since I use it for database application and know in advance the size of
the underlying dataspace, I use bitmask method in my program. This is
probably not good for the kernel.

Somebody should investigate the potential impact of the algorithm on real
system. If you send me a trace from live system, I am willing to analyze
it.

Form for the trace:

A text file with lines in format:

time blockid buffersizecur buffersizemax
time blockid touch

where time is monotonically increasing, blockid is some unique ID of block
requested (also for blocks already in memory!), buffersizecur is current
size of buffer space (since we have variable buffer space under linux) and
buffersizemax is buffersizecur + size of free memory (or a flag whether we
can put a new block in a free memory block).

Second form (with word touch) is for the purposes of counting extra time
needed for swapping out the block back to the disk.

Optionally the time may be left out.

--
Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Buffer management - interesting idea

2001-06-15 Thread Ivan Schreter

Hello,

please CC: replies to me, since I'm not subscribed to the list.

On Fri, 15 Jun 2001 15:50:33 -0300 (BRST)
Rik van Riel [EMAIL PROTECTED] wrote:

 On Fri, 15 Jun 2001, Ivan Schreter wrote:
 In 2Q, when a page is present in LRU queue, you move it to the front of
 [...]

 This description has me horribly confused. Do you have
 any pointers to state diagrams of this thing? ;)

Well, I posted an URL where is complete paper about 2Q, here it is again:

http://citeseer.nj.nec.com/63909.html

(click there in top right corner on download PS or PDF).

In general, it is like LRU, but the pages make it into LRU only after
going through FIFO. So a page which is requested only once (or few times
in a row) will pass through this FIFO and be swapped out. When a page is
actively used, it will pass through FIFO and on a new request for this
page it will not be loaded into FIFO, but into LRU.

Since FIFO is small relative to LRU (10% or so), you don't waste buffer
space for long time for once (or few times) used pages which are not
needed anymore (like big find, cat, dd, etc.).

The trick is how to determine whether the page should be loaded into FIFO
or into LRU at swap-in. And here comes another queue - they call it A1out
in original paper. This queue (or cyclic buffer or whatever) is another
FIFO queue, which stores INDICES of physical pages, which were swapped out
of FIFO queue (and lived only shortly). When a page is found in this A1out
queue, it is put into LRU (it is a hot page) and will live longer in LRU
list. A1out queue size is up to experiments, I used 50% of memory buffer
count and it performed well.

And yes, look into the program I posted last time, look into functions
r2q_page(), which loads a page into buffer and r2q_reclaim() which swaps
out a page to make space.

I was also doing some experiments with not swapping out hot pages out of
FIFO queue (USE_FASTPAGE conditional), but they didn't bring any
reasonable improvement (few tenths of percent up or down from original
performance), so it's probably not worth it.

BTW, this 2Q algorithm can be well used for madvise() syscall
implementation, like this:

- NORMAL - no change
- RANDOM - no change
- SEQUENTIAL - load pages in FIFO and DON'T put them in A1out
  after expiry
- WILLNEED - load pages directly in LRU
- DONTNEED - move pages to FRONT of FIFO (or TAIL of LRU)

(see BSD madvise() syscall, but I believe you know what I'm talking about
;-)

BTW2, I'm quite happy that people care about new ideas :-) I would be
happy to hack the kernel full-time and implement them myself, but
unfortunately I need to make money out of something, so no time :-)

--
Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Buffer management - interesting idea

2001-06-06 Thread Ivan Schreter

Hello,

I'm working on some hi-speed DB projects under Linux and I was researching
various buffer-replacement algorithms. I found 2Q buffer replacement policy at

http://citeseer.nj.nec.com/63909.html

Maybe it would be interesting to use it instead of LRU for disk buffer
replacement. Seems relatively easy to implement and costs are about the same as
for LRU.

I'm not subscribed to the list, so replies please CC: to me ([EMAIL PROTECTED]).


Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Buffer management - interesting idea

2001-06-06 Thread Ivan Schreter

Hello,

I'm working on some hi-speed DB projects under Linux and I was researching
various buffer-replacement algorithms. I found 2Q buffer replacement policy at

http://citeseer.nj.nec.com/63909.html

Maybe it would be interesting to use it instead of LRU for disk buffer
replacement. Seems relatively easy to implement and costs are about the same as
for LRU.

I'm not subscribed to the list, so replies please CC: to me ([EMAIL PROTECTED]).


Ivan Schreter
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] sched_yield in 2.2.x

2001-05-29 Thread Ivan Schreter

Hello,

I'm not subscribed, so eventual replies please CC: to me ([EMAIL PROTECTED]).

Here is a 2-line patch that fixes sched_yield() call to actually really yield
the processor in 2.2.x kernels. I am using 2.2.16 and so have tested it in
2.2.16 only, but I suppose it should work with other kernels as well (there
seem not to be so many changes).

Bug description: When a process called sched_yield() it was properly marked for
reschedule and bit SCHED_YIELD for reschedule was properly set in p->policy.
However, prev_goodness() cleared this bit returning goodness 0 (I changed it to
-1 just to be sure this process isn't accidentally picked when there is other
process to run - maybe I'm wrong here, but 2.4.5 gives it also goodness -1, so
it should be OK). This was not that bad, but successive calls to goodness()
while searching for next process to run included last process, which had
meanwhile YIELD bit cleared and thus it's goodness value was calculated as
better. And we come to second line of the fix - do not consider prev process in
searching for next process to run, as it is anyway already selected as next by
default when no better process is found.

I hope that it will work in SMP environment as well (it should, since
schedule() seems to be mostly independent of UP/SMP).

And why do I want to use sched_yield()? Well, to implement user-space
longer-duration locks which don't consume the whole timeslice when waiting, but
rather relinquish processor to other task so it finishes it's work in critical
region sooner.

It's funny nobody has fixed this by now, but as I've seen there were couple
of discussion about sched_yield() already... I come probably too late...

Ivan Schreter
[EMAIL PROTECTED]


--- kernel/sched.c.orig Wed May 30 01:17:24 2001
+++ kernel/sched.c  Wed May 30 01:41:34 2001
@@ -196,7 +196,7 @@
 {
if (p->policy & SCHED_YIELD) {
p->policy &= ~SCHED_YIELD;
-   return 0;
+   return -1;
}
return goodness(prev, p, this_cpu);
 }
@@ -729,7 +729,7 @@
  * list, so our list starting at "p" is essentially fixed.
  */
while (p != _task) {
-   if (can_schedule(p)) {
+   if (p != prev && can_schedule(p)) {
int weight = goodness(prev, p, this_cpu);
if (weight > c)
c = weight, next = p;



[patch] sched_yield in 2.2.x

2001-05-29 Thread Ivan Schreter

Hello,

I'm not subscribed, so eventual replies please CC: to me ([EMAIL PROTECTED]).

Here is a 2-line patch that fixes sched_yield() call to actually really yield
the processor in 2.2.x kernels. I am using 2.2.16 and so have tested it in
2.2.16 only, but I suppose it should work with other kernels as well (there
seem not to be so many changes).

Bug description: When a process called sched_yield() it was properly marked for
reschedule and bit SCHED_YIELD for reschedule was properly set in p-policy.
However, prev_goodness() cleared this bit returning goodness 0 (I changed it to
-1 just to be sure this process isn't accidentally picked when there is other
process to run - maybe I'm wrong here, but 2.4.5 gives it also goodness -1, so
it should be OK). This was not that bad, but successive calls to goodness()
while searching for next process to run included last process, which had
meanwhile YIELD bit cleared and thus it's goodness value was calculated as
better. And we come to second line of the fix - do not consider prev process in
searching for next process to run, as it is anyway already selected as next by
default when no better process is found.

I hope that it will work in SMP environment as well (it should, since
schedule() seems to be mostly independent of UP/SMP).

And why do I want to use sched_yield()? Well, to implement user-space
longer-duration locks which don't consume the whole timeslice when waiting, but
rather relinquish processor to other task so it finishes it's work in critical
region sooner.

It's funny nobody has fixed this by now, but as I've seen there were couple
of discussion about sched_yield() already... I come probably too late...

Ivan Schreter
[EMAIL PROTECTED]


--- kernel/sched.c.orig Wed May 30 01:17:24 2001
+++ kernel/sched.c  Wed May 30 01:41:34 2001
@@ -196,7 +196,7 @@
 {
if (p-policy  SCHED_YIELD) {
p-policy = ~SCHED_YIELD;
-   return 0;
+   return -1;
}
return goodness(prev, p, this_cpu);
 }
@@ -729,7 +729,7 @@
  * list, so our list starting at p is essentially fixed.
  */
while (p != init_task) {
-   if (can_schedule(p)) {
+   if (p != prev  can_schedule(p)) {
int weight = goodness(prev, p, this_cpu);
if (weight  c)
c = weight, next = p;