Re: Buffer management - interesting idea
Hello, please CC: replies to me, since I'm not subscribed to the list. On Fri, 15 Jun 2001 15:50:33 -0300 (BRST) Rik van Riel <[EMAIL PROTECTED]> wrote: > On Fri, 15 Jun 2001, Ivan Schreter wrote: >> In 2Q, when a page is present in LRU queue, you move it to the front of >> [...] > This description has me horribly confused. Do you have > any pointers to state diagrams of this thing? ;) Well, I posted an URL where is complete paper about 2Q, here it is again: http://citeseer.nj.nec.com/63909.html (click there in top right corner on download PS or PDF). In general, it is like LRU, but the pages make it into LRU only after going through FIFO. So a page which is requested only once (or few times in a row) will pass through this FIFO and be swapped out. When a page is actively used, it will pass through FIFO and on a new request for this page it will not be loaded into FIFO, but into LRU. Since FIFO is small relative to LRU (10% or so), you don't waste buffer space for long time for once (or few times) used pages which are not needed anymore (like big find, cat, dd, etc.). The trick is how to determine whether the page should be loaded into FIFO or into LRU at swap-in. And here comes another queue - they call it A1out in original paper. This queue (or cyclic buffer or whatever) is another FIFO queue, which stores INDICES of physical pages, which were swapped out of FIFO queue (and lived only shortly). When a page is found in this A1out queue, it is put into LRU (it is a "hot" page) and will live longer in LRU list. A1out queue size is up to experiments, I used 50% of memory buffer count and it performed well. And yes, look into the program I posted last time, look into functions r2q_page(), which loads a page into buffer and r2q_reclaim() which swaps out a page to make space. I was also doing some experiments with not swapping out hot pages out of FIFO queue (USE_FASTPAGE conditional), but they didn't bring any reasonable improvement (few tenths of percent up or down from original performance), so it's probably not worth it. BTW, this 2Q algorithm can be well used for madvise() syscall implementation, like this: - NORMAL - no change - RANDOM - no change - SEQUENTIAL - load pages in FIFO and DON'T put them in A1out after expiry - WILLNEED - load pages directly in LRU - DONTNEED - move pages to FRONT of FIFO (or TAIL of LRU) (see BSD madvise() syscall, but I believe you know what I'm talking about ;-) BTW2, I'm quite happy that people care about new ideas :-) I would be happy to hack the kernel full-time and implement them myself, but unfortunately I need to make money out of something, so no time :-) -- Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Buffer management - interesting idea
Hello, I'm not subscribed to list, so please replies CC: to me On Fri, 15 Jun 2001 12:20:52 -0400 (EDT) John Clemens <[EMAIL PROTECTED]> wrote: > Is there any way for you to measure the relative computational overhead > required for the two algorithms? the benefit of the higher hit rate may > be lost of the computational time increases by too much. Well, *computational* overhead is negligible - processing is O(1), like LRU. Look at the program that was attached to my previous message. There is an implementation of this algorithm. In LRU, all you have to do is move page to the front of doubly-linked list when page is accessed and swap out last page when buffer is full and request for a new page is generated. In 2Q, when a page is present in LRU queue, you move it to the front of LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue (the FIFO one), so you DON'T do anything. When the page is NOT in memory, you load it conditionally to Am LRU queue (if it is present in A1out queue) or to A1 queue (FIFO), if not. It gets interesting when you need to swap out a page from memory. When the size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page from A1 is swapped out (and put into A1out list), otherwise a page from Am LRU is swapped out (and NOT put into A1out, since it hasn't been accessed for long time). So the general algorithm is not much more complicated as LRU. There is only one interesting question: How to implement A1out queue (also FIFO) so that we can quickly tell a page is in A1out. I'm currently using cyclic buffer for A1out in sample implementation and a flag in buffer map if a page is in A1out. In real system this flag must be replaced by some kind of hashtable which will contain an entry for all pages which are in A1out queue, so that we can decide in O(1) if a page is or isn't in A1out when loading it from the disk. Alternative is a bitmask, but then we need, e.g., for 64GB at 4kB sized pages, 2MB of RAM (which is too much). Considering we may have upto 32K pages mapped in memory (128MB buffer), is a hashtable with say 64K entries at 50% A1out (16K entries) much more memory efficient (512kB). If we limit A1out queue to something less (maybe 10-20% of buffer block count) we need even less space. Since I use it for database application and know in advance the size of the underlying dataspace, I use bitmask method in my program. This is probably not good for the kernel. Somebody should investigate the potential impact of the algorithm on real system. If you send me a trace from live system, I am willing to analyze it. Form for the trace: A text file with lines in format: time blockid buffersizecur buffersizemax time blockid touch where time is monotonically increasing, blockid is some unique ID of block requested (also for blocks already in memory!), buffersizecur is current size of buffer space (since we have variable buffer space under linux) and buffersizemax is buffersizecur + size of free memory (or a flag whether we can put a new block in a free memory block). Second form (with word "touch") is for the purposes of counting extra time needed for swapping out the block back to the disk. Optionally the time may be left out. -- Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Buffer management - interesting idea
Hello, Please CC: replies to me, since I am not subscribed. >> http://citeseer.nj.nec.com/63909.html > The "resistance to scanning" seemed interesting, maybe one-time > activities like a "find" run or big cat/dd will have less impact with > this. Exactly. But not only that. I have made some experimental tests. When you have totally random access to data, you get degradation to LRU performance (but no worse). However, when doing normal work, results are quite encouraging. Speedup percentage is computed assuming processing in-memory buffer takes x and loading on-disk buffer takes 100x time: #blocks buffers P hitLRU hit2Q 2Q+ speedup 262400 16384 8 19.29% 24.89% 29% 7.365% 262400 16384 4 11.56% 14.23% 23% 3.084% 1024K 16384 32 16.90% 22.82% 35% 7.573% 1024K 16384 16 9.00% 11.94% 33% 3.305% 1024K 16384 8 4.94% 6.38% 29% 1.515% 4096K 16384 64 8.40% 11.39% 36% 3.339% 4096K 16384 32 4.32% 5.79% 34% 1.536% I used reasonable figures for filesystem size (1G, 4G and 16G) and buffer cache (64MB), with blocksize 4K. All simulations used 10% for A1in and 25% for A1out queues. Almost all simulations show over 30% better hit rate as LRU, translating to ~3% real time speedup for normal workload. Speedup for scanning type workload (find, cat large file, etc.) should be even better - write access pattern generator and try it :-) Constant P determines "randomness" of accesses as follows: int x = random() % (NR * LEVEL); for (int i = LEVEL - 1; i > 0; --i) if (x >= NR * i) x = (x - NR * i) / (i + 1); where x is generated block # to be accessed, LEVEL is P and NR is # of disk blocks. I attach a crude program to simulate LRU, 2Q and FIFO buffer management policies. If you want, you can play with parameters a little bit and simulate other workloads (e.g., record what's going on in the system and then simulate real system block accesses). I would like to hear from you if you have some interesting results. -- Ivan Schreter [EMAIL PROTECTED] #include #include #include #include #include int ops = 0, hit_lru = 0, hit_2q = 0, hit_fifo = 0; #define BUFS16384 // # of buffer entries in memory #define NR 262400 // # of DB buffers #define LEVEL 8 // max. test level (1=total random) //#define USE_FASTAGE #define BUFSA1IN(BUFS/10) #define BUFSA1OUT (BUFS/4) typedef struct { int next; int prev; int id; int age; } lru_entry; lru_entry lru[BUFS]; int lru_idx[NR]; int lru_free = 0, lru_head = 0, lru_tail = 0; void lru_page(int x) { if (lru_idx[x] >= 0) { hit_lru++; // move to beginning of the buffer x = lru_idx[x]; if (x == lru_head) return; lru[lru[x].prev].next = lru[x].next; if (x == lru_tail) { lru_tail = lru[x].prev; } else { lru[lru[x].next].prev = lru[x].prev; } lru[x].prev = -1; lru[x].next = lru_head; lru[lru_head].prev = x; lru_head = x; return; } if (lru_free == -1) { // remove one from tail lru[lru_tail].next = lru_free; lru[lru_tail = lru[lru_free = lru_tail].prev].next = -1; lru_idx[lru[lru_free].id] = -1; } // add to head int r = lru_free; lru_free = lru[r].next; lru[r].prev = -1; lru[r].next = lru_head; lru[r].id = x; lru[lru_head].prev = r; if (lru_head < 0) lru_tail = r; lru_head = r; lru_idx[x] = r; } lru_entry r2q[BUFS]; int r2q_a1out[BUFSA1OUT]; int r2q_idx[NR]; int r2q_free = 0, r2q_head = -1, r2q_tail = -1; int r2q_a1i_head = -1, r2q_a1i_tail = -1, r2q_a1i_cnt = 0; int r2q_a1o_head = 0, r2q_a1o_tail = 0, r2q_age = 0; #define R2Q_MASK0xff #define R2Q_FLG_MASK0xff00 #define R2Q_A1I_FLG 0x100 #define R2Q_A1O_FLG 0x200 #define R2Q_AM_FLG 0x400 void r2q_reclaim() { // free one page if (r2q_a1i_cnt > BUFSA1IN) { // remove last page from A1in and put to A1out int r = r2q_a1i_tail; #if 0 if (r < 0) { printf("Reclaim err: %d - %d\n", r2q_a1i_tail, r2q_a1i_head); raise(SIGSEGV); } #endif r2q_a1i_tail = r2q[r].prev; r2q[r2q_a1i_tail].next = -1; --r2q_a1i_cnt; //printf("Del2 %d, cnt = %d\n", r, r2q_a1i_cnt); int x = r2q[r].id; r2q[r].next = r2q_free; r2q_free
Re: Buffer management - interesting idea
Hello, Please CC: replies to me, since I am not subscribed. http://citeseer.nj.nec.com/63909.html The resistance to scanning seemed interesting, maybe one-time activities like a find run or big cat/dd will have less impact with this. Exactly. But not only that. I have made some experimental tests. When you have totally random access to data, you get degradation to LRU performance (but no worse). However, when doing normal work, results are quite encouraging. Speedup percentage is computed assuming processing in-memory buffer takes x and loading on-disk buffer takes 100x time: #blocks buffers P hitLRU hit2Q 2Q+ speedup 262400 16384 8 19.29% 24.89% 29% 7.365% 262400 16384 4 11.56% 14.23% 23% 3.084% 1024K 16384 32 16.90% 22.82% 35% 7.573% 1024K 16384 16 9.00% 11.94% 33% 3.305% 1024K 16384 8 4.94% 6.38% 29% 1.515% 4096K 16384 64 8.40% 11.39% 36% 3.339% 4096K 16384 32 4.32% 5.79% 34% 1.536% I used reasonable figures for filesystem size (1G, 4G and 16G) and buffer cache (64MB), with blocksize 4K. All simulations used 10% for A1in and 25% for A1out queues. Almost all simulations show over 30% better hit rate as LRU, translating to ~3% real time speedup for normal workload. Speedup for scanning type workload (find, cat large file, etc.) should be even better - write access pattern generator and try it :-) Constant P determines randomness of accesses as follows: int x = random() % (NR * LEVEL); for (int i = LEVEL - 1; i 0; --i) if (x = NR * i) x = (x - NR * i) / (i + 1); where x is generated block # to be accessed, LEVEL is P and NR is # of disk blocks. I attach a crude program to simulate LRU, 2Q and FIFO buffer management policies. If you want, you can play with parameters a little bit and simulate other workloads (e.g., record what's going on in the system and then simulate real system block accesses). I would like to hear from you if you have some interesting results. -- Ivan Schreter [EMAIL PROTECTED] #include stdio.h #include string.h #include stdlib.h #include time.h #include signal.h int ops = 0, hit_lru = 0, hit_2q = 0, hit_fifo = 0; #define BUFS16384 // # of buffer entries in memory #define NR 262400 // # of DB buffers #define LEVEL 8 // max. test level (1=total random) //#define USE_FASTAGE #define BUFSA1IN(BUFS/10) #define BUFSA1OUT (BUFS/4) typedef struct { int next; int prev; int id; int age; } lru_entry; lru_entry lru[BUFS]; int lru_idx[NR]; int lru_free = 0, lru_head = 0, lru_tail = 0; void lru_page(int x) { if (lru_idx[x] = 0) { hit_lru++; // move to beginning of the buffer x = lru_idx[x]; if (x == lru_head) return; lru[lru[x].prev].next = lru[x].next; if (x == lru_tail) { lru_tail = lru[x].prev; } else { lru[lru[x].next].prev = lru[x].prev; } lru[x].prev = -1; lru[x].next = lru_head; lru[lru_head].prev = x; lru_head = x; return; } if (lru_free == -1) { // remove one from tail lru[lru_tail].next = lru_free; lru[lru_tail = lru[lru_free = lru_tail].prev].next = -1; lru_idx[lru[lru_free].id] = -1; } // add to head int r = lru_free; lru_free = lru[r].next; lru[r].prev = -1; lru[r].next = lru_head; lru[r].id = x; lru[lru_head].prev = r; if (lru_head 0) lru_tail = r; lru_head = r; lru_idx[x] = r; } lru_entry r2q[BUFS]; int r2q_a1out[BUFSA1OUT]; int r2q_idx[NR]; int r2q_free = 0, r2q_head = -1, r2q_tail = -1; int r2q_a1i_head = -1, r2q_a1i_tail = -1, r2q_a1i_cnt = 0; int r2q_a1o_head = 0, r2q_a1o_tail = 0, r2q_age = 0; #define R2Q_MASK0xff #define R2Q_FLG_MASK0xff00 #define R2Q_A1I_FLG 0x100 #define R2Q_A1O_FLG 0x200 #define R2Q_AM_FLG 0x400 void r2q_reclaim() { // free one page if (r2q_a1i_cnt BUFSA1IN) { // remove last page from A1in and put to A1out int r = r2q_a1i_tail; #if 0 if (r 0) { printf(Reclaim err: %d - %d\n, r2q_a1i_tail, r2q_a1i_head); raise(SIGSEGV); } #endif r2q_a1i_tail = r2q[r].prev; r2q[r2q_a1i_tail].next = -1; --r2q_a1i_cnt; //printf(Del2 %d, cnt = %d\n, r, r2q_a1i_cnt); int x = r2q[r].id; r2q[r].next = r2q_free; r2q_free = r; r2q_idx[x] = R2Q_MASK | R2Q_A1O_FLG; //if (x
Re: Buffer management - interesting idea
Hello, I'm not subscribed to list, so please replies CC: to me On Fri, 15 Jun 2001 12:20:52 -0400 (EDT) John Clemens [EMAIL PROTECTED] wrote: Is there any way for you to measure the relative computational overhead required for the two algorithms? the benefit of the higher hit rate may be lost of the computational time increases by too much. Well, *computational* overhead is negligible - processing is O(1), like LRU. Look at the program that was attached to my previous message. There is an implementation of this algorithm. In LRU, all you have to do is move page to the front of doubly-linked list when page is accessed and swap out last page when buffer is full and request for a new page is generated. In 2Q, when a page is present in LRU queue, you move it to the front of LRU queue as usual. Otherwise, if it is in memory, it must be in A1 queue (the FIFO one), so you DON'T do anything. When the page is NOT in memory, you load it conditionally to Am LRU queue (if it is present in A1out queue) or to A1 queue (FIFO), if not. It gets interesting when you need to swap out a page from memory. When the size of A1 FIFO is greater than limit (e.g., 10% of buffer space), a page from A1 is swapped out (and put into A1out list), otherwise a page from Am LRU is swapped out (and NOT put into A1out, since it hasn't been accessed for long time). So the general algorithm is not much more complicated as LRU. There is only one interesting question: How to implement A1out queue (also FIFO) so that we can quickly tell a page is in A1out. I'm currently using cyclic buffer for A1out in sample implementation and a flag in buffer map if a page is in A1out. In real system this flag must be replaced by some kind of hashtable which will contain an entry for all pages which are in A1out queue, so that we can decide in O(1) if a page is or isn't in A1out when loading it from the disk. Alternative is a bitmask, but then we need, e.g., for 64GB at 4kB sized pages, 2MB of RAM (which is too much). Considering we may have upto 32K pages mapped in memory (128MB buffer), is a hashtable with say 64K entries at 50% A1out (16K entries) much more memory efficient (512kB). If we limit A1out queue to something less (maybe 10-20% of buffer block count) we need even less space. Since I use it for database application and know in advance the size of the underlying dataspace, I use bitmask method in my program. This is probably not good for the kernel. Somebody should investigate the potential impact of the algorithm on real system. If you send me a trace from live system, I am willing to analyze it. Form for the trace: A text file with lines in format: time blockid buffersizecur buffersizemax time blockid touch where time is monotonically increasing, blockid is some unique ID of block requested (also for blocks already in memory!), buffersizecur is current size of buffer space (since we have variable buffer space under linux) and buffersizemax is buffersizecur + size of free memory (or a flag whether we can put a new block in a free memory block). Second form (with word touch) is for the purposes of counting extra time needed for swapping out the block back to the disk. Optionally the time may be left out. -- Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Buffer management - interesting idea
Hello, please CC: replies to me, since I'm not subscribed to the list. On Fri, 15 Jun 2001 15:50:33 -0300 (BRST) Rik van Riel [EMAIL PROTECTED] wrote: On Fri, 15 Jun 2001, Ivan Schreter wrote: In 2Q, when a page is present in LRU queue, you move it to the front of [...] This description has me horribly confused. Do you have any pointers to state diagrams of this thing? ;) Well, I posted an URL where is complete paper about 2Q, here it is again: http://citeseer.nj.nec.com/63909.html (click there in top right corner on download PS or PDF). In general, it is like LRU, but the pages make it into LRU only after going through FIFO. So a page which is requested only once (or few times in a row) will pass through this FIFO and be swapped out. When a page is actively used, it will pass through FIFO and on a new request for this page it will not be loaded into FIFO, but into LRU. Since FIFO is small relative to LRU (10% or so), you don't waste buffer space for long time for once (or few times) used pages which are not needed anymore (like big find, cat, dd, etc.). The trick is how to determine whether the page should be loaded into FIFO or into LRU at swap-in. And here comes another queue - they call it A1out in original paper. This queue (or cyclic buffer or whatever) is another FIFO queue, which stores INDICES of physical pages, which were swapped out of FIFO queue (and lived only shortly). When a page is found in this A1out queue, it is put into LRU (it is a hot page) and will live longer in LRU list. A1out queue size is up to experiments, I used 50% of memory buffer count and it performed well. And yes, look into the program I posted last time, look into functions r2q_page(), which loads a page into buffer and r2q_reclaim() which swaps out a page to make space. I was also doing some experiments with not swapping out hot pages out of FIFO queue (USE_FASTPAGE conditional), but they didn't bring any reasonable improvement (few tenths of percent up or down from original performance), so it's probably not worth it. BTW, this 2Q algorithm can be well used for madvise() syscall implementation, like this: - NORMAL - no change - RANDOM - no change - SEQUENTIAL - load pages in FIFO and DON'T put them in A1out after expiry - WILLNEED - load pages directly in LRU - DONTNEED - move pages to FRONT of FIFO (or TAIL of LRU) (see BSD madvise() syscall, but I believe you know what I'm talking about ;-) BTW2, I'm quite happy that people care about new ideas :-) I would be happy to hack the kernel full-time and implement them myself, but unfortunately I need to make money out of something, so no time :-) -- Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Buffer management - interesting idea
Hello, I'm working on some hi-speed DB projects under Linux and I was researching various buffer-replacement algorithms. I found 2Q buffer replacement policy at http://citeseer.nj.nec.com/63909.html Maybe it would be interesting to use it instead of LRU for disk buffer replacement. Seems relatively easy to implement and costs are about the same as for LRU. I'm not subscribed to the list, so replies please CC: to me ([EMAIL PROTECTED]). Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Buffer management - interesting idea
Hello, I'm working on some hi-speed DB projects under Linux and I was researching various buffer-replacement algorithms. I found 2Q buffer replacement policy at http://citeseer.nj.nec.com/63909.html Maybe it would be interesting to use it instead of LRU for disk buffer replacement. Seems relatively easy to implement and costs are about the same as for LRU. I'm not subscribed to the list, so replies please CC: to me ([EMAIL PROTECTED]). Ivan Schreter [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] sched_yield in 2.2.x
Hello, I'm not subscribed, so eventual replies please CC: to me ([EMAIL PROTECTED]). Here is a 2-line patch that fixes sched_yield() call to actually really yield the processor in 2.2.x kernels. I am using 2.2.16 and so have tested it in 2.2.16 only, but I suppose it should work with other kernels as well (there seem not to be so many changes). Bug description: When a process called sched_yield() it was properly marked for reschedule and bit SCHED_YIELD for reschedule was properly set in p->policy. However, prev_goodness() cleared this bit returning goodness 0 (I changed it to -1 just to be sure this process isn't accidentally picked when there is other process to run - maybe I'm wrong here, but 2.4.5 gives it also goodness -1, so it should be OK). This was not that bad, but successive calls to goodness() while searching for next process to run included last process, which had meanwhile YIELD bit cleared and thus it's goodness value was calculated as better. And we come to second line of the fix - do not consider prev process in searching for next process to run, as it is anyway already selected as next by default when no better process is found. I hope that it will work in SMP environment as well (it should, since schedule() seems to be mostly independent of UP/SMP). And why do I want to use sched_yield()? Well, to implement user-space longer-duration locks which don't consume the whole timeslice when waiting, but rather relinquish processor to other task so it finishes it's work in critical region sooner. It's funny nobody has fixed this by now, but as I've seen there were couple of discussion about sched_yield() already... I come probably too late... Ivan Schreter [EMAIL PROTECTED] --- kernel/sched.c.orig Wed May 30 01:17:24 2001 +++ kernel/sched.c Wed May 30 01:41:34 2001 @@ -196,7 +196,7 @@ { if (p->policy & SCHED_YIELD) { p->policy &= ~SCHED_YIELD; - return 0; + return -1; } return goodness(prev, p, this_cpu); } @@ -729,7 +729,7 @@ * list, so our list starting at "p" is essentially fixed. */ while (p != _task) { - if (can_schedule(p)) { + if (p != prev && can_schedule(p)) { int weight = goodness(prev, p, this_cpu); if (weight > c) c = weight, next = p;
[patch] sched_yield in 2.2.x
Hello, I'm not subscribed, so eventual replies please CC: to me ([EMAIL PROTECTED]). Here is a 2-line patch that fixes sched_yield() call to actually really yield the processor in 2.2.x kernels. I am using 2.2.16 and so have tested it in 2.2.16 only, but I suppose it should work with other kernels as well (there seem not to be so many changes). Bug description: When a process called sched_yield() it was properly marked for reschedule and bit SCHED_YIELD for reschedule was properly set in p-policy. However, prev_goodness() cleared this bit returning goodness 0 (I changed it to -1 just to be sure this process isn't accidentally picked when there is other process to run - maybe I'm wrong here, but 2.4.5 gives it also goodness -1, so it should be OK). This was not that bad, but successive calls to goodness() while searching for next process to run included last process, which had meanwhile YIELD bit cleared and thus it's goodness value was calculated as better. And we come to second line of the fix - do not consider prev process in searching for next process to run, as it is anyway already selected as next by default when no better process is found. I hope that it will work in SMP environment as well (it should, since schedule() seems to be mostly independent of UP/SMP). And why do I want to use sched_yield()? Well, to implement user-space longer-duration locks which don't consume the whole timeslice when waiting, but rather relinquish processor to other task so it finishes it's work in critical region sooner. It's funny nobody has fixed this by now, but as I've seen there were couple of discussion about sched_yield() already... I come probably too late... Ivan Schreter [EMAIL PROTECTED] --- kernel/sched.c.orig Wed May 30 01:17:24 2001 +++ kernel/sched.c Wed May 30 01:41:34 2001 @@ -196,7 +196,7 @@ { if (p-policy SCHED_YIELD) { p-policy = ~SCHED_YIELD; - return 0; + return -1; } return goodness(prev, p, this_cpu); } @@ -729,7 +729,7 @@ * list, so our list starting at p is essentially fixed. */ while (p != init_task) { - if (can_schedule(p)) { + if (p != prev can_schedule(p)) { int weight = goodness(prev, p, this_cpu); if (weight c) c = weight, next = p;