Re: More on 2.2.18pre2aa2

2000-09-12 Thread Matthew Hawkins

On 2000-09-11 09:22:23 -0400, Chris Mason wrote:
> Thanks Andrea, Andi, new patch is attached, with the warning messages
> removed.  The first patch got munged somewhere between test machine and
> mailer, please don't use it.

I've been hammering this all day installing the relevent tools and
building win32 mozilla under vmware (the drives being 2Gb files on a
reiserfs partition).  This partition also doubles as /home.

Very stable so far, and having Andrea's VM patches in (I usually didn't
put them in) has made a noticeable difference - xmms has rarely skipped
and things start faster and run smoother.  Hopefully I'll see the same
(or better) results from Rik's new VM in 2.4 when that's released.

# free
 total   used   free sharedbuffers cached
Mem:392792 390232   2560  0  25948 261484
-/+ buffers/cache: 102800 289992
Swap:   196548  0 196548

# vmstat 2
   procs  memoryswap  io system cpu
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id
 1  0  0  0   2412  26084 261496   0   0 817  454   922  17  40  44
 1  0  0  0   2292  26204 261496   0   0 213  443   516   6  50  45
 2  0  0  0   2244  26252 261496   0   0 0 0  439   367  13  38  49
 1  0  0  0   2784  26016 261192   0   059 0  453   815   8  50  42
 1  0  0  0   2420  26152 261420   0   03023  460   675   4  59  37
 1  0  0  0   2252  26208 261532   0   015 0  445   465   5  55  40
 3  0  0  0   2084  26252 261656   0   01819  450   446   4  56  40
 3  0  0  0   2056  26280 261656   0   0 1 0  442   366   3  54  43
 4  0  0  0   3012  25868 261108   0   018 0  443   423   9  45  46
 2  0  0  0   2348  25932 261708   0   07519  463   703   6  51  44

FWIW the reiserfs partition is on a 30Gb IBM 75GXP attached to a Promise
ATA100 PCI controller.  I have Andre's 2904 ide patch in.  Nice
drive if you have to use IDE, and I have no complaints about the
controller either.

-- 
* Matthew Hawkins <[EMAIL PROTECTED]> :(){ :|:&};:
** Information Specialist, tSA Group Pty. Ltd.   Ph: +61 2 6257 7111
*** 1 Hall Street, Lyneham ACT 2602 Australia.   Fx: +61 2 6257 7311
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Matthew Hawkins

On 2000-09-11 09:22:23 -0400, Chris Mason wrote:
 Thanks Andrea, Andi, new patch is attached, with the warning messages
 removed.  The first patch got munged somewhere between test machine and
 mailer, please don't use it.

I've been hammering this all day installing the relevent tools and
building win32 mozilla under vmware (the drives being 2Gb files on a
reiserfs partition).  This partition also doubles as /home.

Very stable so far, and having Andrea's VM patches in (I usually didn't
put them in) has made a noticeable difference - xmms has rarely skipped
and things start faster and run smoother.  Hopefully I'll see the same
(or better) results from Rik's new VM in 2.4 when that's released.

# free
 total   used   free sharedbuffers cached
Mem:392792 390232   2560  0  25948 261484
-/+ buffers/cache: 102800 289992
Swap:   196548  0 196548

# vmstat 2
   procs  memoryswap  io system cpu
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id
 1  0  0  0   2412  26084 261496   0   0 817  454   922  17  40  44
 1  0  0  0   2292  26204 261496   0   0 213  443   516   6  50  45
 2  0  0  0   2244  26252 261496   0   0 0 0  439   367  13  38  49
 1  0  0  0   2784  26016 261192   0   059 0  453   815   8  50  42
 1  0  0  0   2420  26152 261420   0   03023  460   675   4  59  37
 1  0  0  0   2252  26208 261532   0   015 0  445   465   5  55  40
 3  0  0  0   2084  26252 261656   0   01819  450   446   4  56  40
 3  0  0  0   2056  26280 261656   0   0 1 0  442   366   3  54  43
 4  0  0  0   3012  25868 261108   0   018 0  443   423   9  45  46
 2  0  0  0   2348  25932 261708   0   07519  463   703   6  51  44

FWIW the reiserfs partition is on a 30Gb IBM 75GXP attached to a Promise
ATA100 PCI controller.  I have Andre's 2904 ide patch in.  Nice
drive if you have to use IDE, and I have no complaints about the
controller either.

-- 
* Matthew Hawkins [EMAIL PROTECTED] :(){ :|:};:
** Information Specialist, tSA Group Pty. Ltd.   Ph: +61 2 6257 7111
*** 1 Hall Street, Lyneham ACT 2602 Australia.   Fx: +61 2 6257 7311
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Matthew Hawkins wrote:

 Very stable so far, and having Andrea's VM patches in (I usually
 didn't put them in) has made a noticeable difference - xmms has
 rarely skipped and things start faster and run smoother.  
 Hopefully I'll see the same (or better) results from Rik's new
 VM in 2.4 when that's released.

I'm working on it. ;)

I've just uploaded a new snapshot of my new VM for
2.4 to my home page, this version contains a
wakeup_kswapd() function (copied from wakeup_bdflush)
and should balance memory a bit better.

Also, streaming IO seems to be almost back on track.

The large IO delays I'm seeing in certain tests have
been traced back to the /elevator/ code. I think I'll
be playing with the "close the door on timeout" idea
Jeff Merkey proposed some time ago...

(I talked about this with Jens Axboe on irc ... a patch
to solve the elevator problems and maybe make a self-tuning
elevator (with maximum latency specified in /proc?) should
be ready later today)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Rik van Riel wrote:

The large IO delays I'm seeing in certain tests have
been traced back to the /elevator/ code. I think I'll

Actually the elevator works as in 2.2.15 (before any fix). The latency
settings are too high. They should be around 250 for reads and 500 for
writes.

If you want better latency than that you should reinsert the stuff that we
had until test1 that account the position where the request is put on the
queue. We backed it out because it was making a performance difference.

Even if you disable the elevator completly (so if you use elevator_noop)
with the long queues we have these days you'll still get bad latency.

Please try to decrease the size of the queue if even "enabling" the
latency control doesn't make a difference.

BTW, 2.2.18pre2aa2 put the elevator in sync with 2.4.x using the settings 
250 for reads and 500 for writes. Note that 250 for reads - without the
logic that take into account the position the request is put in the queue
- is much more than 250 with the previous logic included.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
 On Tue, 12 Sep 2000, Rik van Riel wrote:
 
 The large IO delays I'm seeing in certain tests have
 been traced back to the /elevator/ code. I think I'll
 
 Actually the elevator works as in 2.2.15 (before any fix). The
 latency settings are too high. They should be around 250 for
 reads and 500 for writes.

There's a much much simpler solution.

We simply keep track of how old the oldest request
in the queue is, and when that request is getting
too old (say 1/2 second), we /stop/ all the others
from entering their request into the queue (except
if it can be merged with another request ???).

And when either the queue gets emptied, OR the oldest
request is below some other threshold (say, 1/10th of
a second), then we wake up the other tasks and let them
put their requests on the queue.

This is a very simple idea that gets the /number/ of
requests batched automagically right for the configuration
of disks on the machine it's running on.

No need for a magic number of requests to be bypassed or
not, the user can specify the wanted latency and the system
automatically gets right what the user specifies...

(courtesy of Jeff Merkey who did this for Netware?)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Rik van Riel wrote:

We simply keep track of how old the oldest request
in the queue is, and when that request is getting
too old (say 1/2 second), we /stop/ all the others

Going in function of time is obviously wrong. A blockdevice can write 1
request every two seconds or 1 request every msecond. You can't assume
anything in function of time _unless_ you have per harddisk timing
informations into the kernel.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
 On Tue, 12 Sep 2000, Rik van Riel wrote:
 
 We simply keep track of how old the oldest request
 in the queue is, and when that request is getting
 too old (say 1/2 second), we /stop/ all the others
 
 Going in function of time is obviously wrong. A blockdevice can
 write 1 request every two seconds or 1 request every msecond.
 You can't assume anything in function of time _unless_ you have
 per harddisk timing informations into the kernel.

Uhmmm, isn't the elevator about request /latency/ ?

And if so, in what unit do you want to measure
latency if it isn't time?

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Rik van Riel wrote:

Uhmmm, isn't the elevator about request /latency/ ?

Yes, but definitely not absolute "time" latency.

How do you get a 1msec latency for a read request out of a blockdevice
that writes 1 request in 2 seconds? See?

That was one of the first issues I was thinking about when I started
playing with the elevator. (and yes, some of my early patches was setting
a per-request timestamp using jiffies)

Note: I understand you can do in function of time something that works ok
for a normal 10/20Mbyte/sec harddisk, but since the elevator is used every
time you write to any blockdevice out there, you also have to take into
account things like ZIP drives and whatever other slow device that does
less than 1Mbyte/sec I/O or even slower (as well as faster devices). A zip
drive is slow writing, but that doesn't mean it isn't even slower while
seeking.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Alan Cox

 Going in function of time is obviously wrong. A blockdevice can write 1
 request every two seconds or 1 request every msecond. You can't assume
 anything in function of time _unless_ you have per harddisk timing
 informations into the kernel.

Andrea - latency is time measured and perceived. Doing it time based seems to
make reasonable sense. I grant you might want to play with the weighting per
device, but actually keeping device based dirty list length limits based on
average throughput is probably a lot more productive

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Alan Cox wrote:

Andrea - latency is time measured and perceived. Doing it time based seems to
make reasonable sense. I grant you might want to play with the weighting per

When you have a device that writes a request every two seconds you still
want it not to seek all the time because this would mean to make it even
slower. No? The point very is simple: if you want good latency buy a
faster hardware (and with a faster hardware our current elevator can
become even more aggressive than the 1/2 second thing). You can't
workaround the slowness of a slow device by putting the elevator in
function of time, that will only make the global system even slower.

BTW, about the rest of the proposal we're just doing that. It's just that
we don't do that in function of time, but in function of how many requests
are passing a certain request. When too many request passed a certain
request, we stop further requests to pass it again. Simple. You control
the "how many request can pass a request" factor via elvtune. Read and
writes have different factors.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
 On Tue, 12 Sep 2000, Rik van Riel wrote:
 
 Uhmmm, isn't the elevator about request /latency/ ?
 
 Yes, but definitely not absolute "time" latency.
 
 How do you get a 1msec latency for a read request out of a
 blockdevice that writes 1 request in 2 seconds? See?

Of course, if you set a rediculous latency figure you'll
get rediculously bad performance. However, that doesn't
say /anything/ about if the idea is a good one or not...

Along the same lines, I could redicule the current setup
by running one of the following lines and then pointing
out how bad performance would be.

# elvtune -r 1000 -w 1000 /dev/hda
# elvtune -r 1 -w 1 /dev/hda

 That was one of the first issues I was thinking about when I
 started playing with the elevator. (and yes, some of my early
 patches was setting a per-request timestamp using jiffies)
 
 Note: I understand you can do in function of time something that
 works ok for a normal 10/20Mbyte/sec harddisk, but since the
 elevator is used every time you write to any blockdevice out
 there, you also have to take into account things like ZIP drives
 and whatever other slow device that does less than 1Mbyte/sec
 I/O or even slower (as well as faster devices). A zip drive is
 slow writing, but that doesn't mean it isn't even slower while
 seeking.

We can already set different figures for different drives.
Would it really be more than 30 minutes of work to put in
a different request # limit for each drive that automatically
satisfies the latency specified by the user?

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Rik van Riel wrote:

We can already set different figures for different drives.

Right.

Would it really be more than 30 minutes of work to put in
a different request # limit for each drive that automatically
satisfies the latency specified by the user?

Note that if you know the transfer rate of each of your harddisk you can
just almost emulate the "in function of time" behaviour by converting from
"throughput" and "latency in function of time", to "latency in function of
bh sized requests". (one userspace script could do that automatically for
example using hdparm to first benchmark the throughput)

The reason I prefer working with "latency in function of bh sized
requests" in kernel space is that I can just use one constant for most of
the blockdevices. Using the same latency setting with the "in function of
time approch" would be wrong instead (and you should gather some hardware
info from the device before you are able to decide a good constant).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Jamie Lokier

Andrea Arcangeli wrote:
 Andrea - latency is time measured and perceived. Doing it time based
 seems to make reasonable sense. I grant you might want to play with
 the weighting per [device]

Right.  Perception.

 When you have a device that writes a request every two seconds you still
 want it not to seek all the time because this would mean to make it even
 slower. No? The point very is simple: if you want good latency buy a
 faster hardware (and with a faster hardware our current elevator can
 become even more aggressive than the 1/2 second thing). You can't
 workaround the slowness of a slow device by putting the elevator in
 function of time, that will only make the global system even slower.

Sure the global system is slower.  But the "interactive feel" is faster.
If I type "find /" I want it to go quickly.  But I still want Emacs to
start up in a reasonable time, even if that means the overall time for
both processes is slower.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
 On Tue, 12 Sep 2000, Rik van Riel wrote:
 
 But you don't. Transfer rate is very much dependant on the
 kind of load you're putting on the disk...
 
 Transfer rate means `hdparm -t` in single user mode. Try it and
 you'll see you'll get always the same result.

*sigh*

People don't use their machine to run `hdparm -t` all day.
They use their machine for different things, and user
perceived latency varies wildly between different loads
users put on the machine...

 Throughput really isn't that relevant here. The problems are
 
 Thoughput is relevant. Again, how do you get a 1msec latency out
 of a blockdevice that writes 1 request every two seconds?

Why do you always come up with impossible examples?
If you had any realistic example I'd be inclined to
believe your argument, but this is just not realistic
enough to be taken seriously ...

 With equally horrible results for most machines I've
 seen. For a while I actually thought the bug /must/
 have been somewhere else because I saw processes
 'hanging' for about 10 minutes before making progress
 again ...
 
 As said in my earlier email the current 2.4.x elevator scheduler
 is _disabled_. I repeat: you should change
 include/linux/elevator.h and set the read and write latency to
 250 and 500 respectively. You won't get latency as good as in
 test1, but it won't hang for 10 minutes.

Not for 10 minutes no, but even with the latency set to 100
and 100 I sometimes get /bad stalls/ in just one or two
processes (while the rest of the system happily runs on).

On the other hand, when I do something different with my
machine, the settings 250 and 500 (or even higher) give
perfectly fine latency...

By applying a /different/ IO load, the same settings with
the current elevator tuning give wildly different results.

What I'd like to see is an elevator where the settings set
by the user have a direct influence in the behaviour observed.

Doing time-based request sorting should give us that behaviour
and a default of say 1/2 second would work fine for all the
hard disks and cdrom drives I've seen in the last 6 years.

Of course you can dream up an imaginary device where it won't
work, but in that case there's always the possibility of tuning
the elevator (like we do right now) ...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Chris Evans


On Tue, 12 Sep 2000, Rik van Riel wrote:

 On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
  On Tue, 12 Sep 2000, Rik van Riel wrote:
  
  Uhmmm, isn't the elevator about request /latency/ ?
  
  Yes, but definitely not absolute "time" latency.
  
  How do you get a 1msec latency for a read request out of a
  blockdevice that writes 1 request in 2 seconds? See?
 
 Of course, if you set a rediculous latency figure you'll
 get rediculously bad performance. However, that doesn't
 say /anything/ about if the idea is a good one or not...

People,

Remember why the elevator algorithm was changed in the first place? It was
introduced to solve a very specific problem.

That problem: the original elevator code did not schedule I/O particularly
fairly under certain I/O usage patterns. So it got fixed.

Now, I see people trying to introduce the concept of elapsed time into
that fix, which smells strongly of hack. How will this hack be cobbled
into the elevator code so that it copes with block devices from fast RAID
arrays to slow floppies to network block device!

So I have to agree with Andrea that the concept of time does not belong in
the elevator code. Keep it to a queue management system, and suddenly it
scales to slow or fast devices without any gross device-type specific
tuning.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Hans Reiser

I really think Rik has it right here.  In particular, an MP3 player needs to be able 
to say, I have
X milliseconds of buffer so make my worst case latency X milliseconds.  The number of 
requests is
the wrong metric, because the time required per request depends on disk geometry, disk 
caching, etc.

Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Chris Evans


On Tue, 12 Sep 2000, Hans Reiser wrote:

 I really think Rik has it right here.  In particular, an MP3 player
 needs to be able to say, I have X milliseconds of buffer so make my
 worst case latency X milliseconds.  The number of requests is the
 wrong metric, because the time required per request depends on disk
 geometry, disk caching, etc.

Hi,

We need to separate "what's a good idea" from "what's the best way to do
it".

Sure, it's a good idea to get an mp3 player's I/O request serviced within
a certain amount of time. Is the best way to do that hacking the concept
of time into the elevator algorithm? That's currently under debate.


In fact, the guarantee of I/O service time for a single process (mp3
player) is pretty orthogonal to the per-device elevator settings. If you
have a certain block device set to a max latency of 0.25s, and lots of
processes are hammering the disk, then something will have to give,
i.e. under heavy load this setting will be useless and not honoured.

The solution to this particular mp3 player scenario would be something
like the task scheduler policies we have. For example, maybe we could flag
a given process so that all it's I/O requests go to the head of the queue.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Martin Dalecki

Chris Evans wrote:
 
 On Tue, 12 Sep 2000, Hans Reiser wrote:
 
  I really think Rik has it right here.  In particular, an MP3 player
  needs to be able to say, I have X milliseconds of buffer so make my
  worst case latency X milliseconds.  The number of requests is the
  wrong metric, because the time required per request depends on disk
  geometry, disk caching, etc.
 
 Hi,
 
 We need to separate "what's a good idea" from "what's the best way to do
 it".
 
 Sure, it's a good idea to get an mp3 player's I/O request serviced within
 a certain amount of time. Is the best way to do that hacking the concept
 of time into the elevator algorithm? That's currently under debate.
 
 In fact, the guarantee of I/O service time for a single process (mp3
 player) is pretty orthogonal to the per-device elevator settings. If you
 have a certain block device set to a max latency of 0.25s, and lots of
 processes are hammering the disk, then something will have to give,
 i.e. under heavy load this setting will be useless and not honoured.
 
 The solution to this particular mp3 player scenario would be something
 like the task scheduler policies we have. For example, maybe we could flag
 a given process so that all it's I/O requests go to the head of the queue.
 
 Cheers
 Chris

First of all: In the case of the mp3 player and such there is already a
fine
proper way to give it better chances on getting it's job done smooth - 
RT kernel sceduler priorities and proper IO buffering. I did something
similiar
to a GDI printer driver...

Second: The concept of time can give you very very nasty behaviour in
even
cases. Assume that a disc can only do 1 request per second. And then
imagin
a sceduling based on a 1+epsilon second... basically the disc will be
run
with half the speed it could. Those nasty integer arithmetics can you
catch easly
and mostly allways entierly unexpecting.

Third: All you try to improve is the boundary case between an entierly
overloaded system and a system which has a huge reserve to get the task
done.
I don't think you can find any "improvement" which will not just improve
some
cases and hurt some only slightly different cases badly. That's
basically the
same problem as with the paging strategy to follow. (However we have
some
kind of "common sense" in respect of this, despite the fact that linux
does ignore
it...)

Firth: The most common solution for such boundary cases is some notion
of
cost optimization, like the nice value of a process or page age for
example, or 
alternative some kind of choice between entierly different strategies
(remember
the term strategy routine)
- all of them are just *relative* measures not absolute time constrains.

Fifth: I think that such kind of IO behaviour control isn't something
generic enough for the elevator - it should be all done on the
device driver level, if at all. In fact you have already bad 
interactions between strategies of low level drivers and the high
level code in Linux - like for example the "get from top of queue" or 
"don't get it from top of the IO queue" mess between
IDE and SCSI middlelayers... (However this got a bit better recently.)

-- 
- phone: +49 214 8656 283
- job:   STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Chris Evans


On Tue, 12 Sep 2000, Alan Cox wrote:

  Now, I see people trying to introduce the concept of elapsed time into
  that fix, which smells strongly of hack. How will this hack be cobbled
 
 Actually my brain says that elapsed time based scheduling is the right
 thing to do. It certainly works for networks

Interesting, I'll try and run with this. The mention of networks reminds
me that any "max service time" variable is a tunable quantity depending on
current conditions..

.. and sct's block device I/O accounting patches give us the current
average request service time on a per-device basis. Multiply that up a bit
and maybe you have your threshold for moving things to the head of the
queue.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



(reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Alan Cox

 That problem: the original elevator code did not schedule I/O particularly
 fairly under certain I/O usage patterns. So it got fixed.

No it got hacked up a bit.

 Now, I see people trying to introduce the concept of elapsed time into
 that fix, which smells strongly of hack. How will this hack be cobbled

Actually my brain says that elapsed time based scheduling is the right thing
to do. It certainly works for networks


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Jamie Lokier wrote:

Sure the global system is slower.  But the "interactive feel" is faster.
If I type "find /" I want it to go quickly.  But I still want Emacs to

You always want it to go quickly. But when you're in the blockdevice
layer you lost all the semantics of such I/O request. You have no idea if
somebody is running a background `cp` or if it's your `find` that's doing
the I/O.

start up in a reasonable time, even if that means the overall time for
both processes is slower.

And you as well want your `cp` not to run two times slower, right?

Then you have to choose. And you can choose with elvtune. (currently you
have to choose per blockdevice basis, maybe in the future you'll be able
to choose per 'struct file' basis)

That's not a matter of the algorithm. The only difference between the "in
function of time latency" and "in function of blocksize latency" is that
in kernel space we don't need to know the internal timings of the
hardware. If you know the timings and you want a 2 second latency
(assuming your harddisk write 1 block in less than 2 seconds) that do the
calc in userspace and run the blkelvset ioctl and you're almost happy. You
won't be completly happy because as said the elevator will give a two
seconds latency also to requests that are at the end of the queue, but
this is a matter of the algorithm, not really of the unit of measure of
the latency in kernel space.

The current unit of measure allows us to use the same latency settings for
most devices out there and still providing a good throughput (and avoiding
huge stalls). That's why I prefer it. But you can convert it easily in
userspace (or even you can change the unit in kernel but I don't see the
advantage, I'd rather prefer elvtune to do the conversion and to talk in
function of time instead of putting the timing stuff into the kernel).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Chris Evans wrote:

the elevator code. Keep it to a queue management system, and suddenly it
scales to slow or fast devices without any gross device-type specific
tuning.

Yep, that was the object.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



(reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Martin Dalecki wrote:

 Second: The concept of time can give you very very nasty
 behaviour in even cases. [integer arithmetic]

Point taken.

 Third: All you try to improve is the boundary case between an
 entierly overloaded system and a system which has a huge reserve
 to get the task done. I don't think you can find any
 "improvement" which will not just improve some cases and hurt
 some only slightly different cases badly. That's basically the
 same problem as with the paging strategy to follow. (However we
 have some kind of "common sense" in respect of this, despite the
 fact that linux does ignore it...)

Please don't ignore my VM work  ;)
http://www.surriel.com/patches/

 Firth: The most common solution for such boundary cases is some
 notion of cost optimization, like the nice value of a process or
 page age for example, or alternative some kind of choice between
 entierly different strategies (remember the term strategy
 routine) - all of them are just *relative* measures not
 absolute time constrains.

Indeed, we'll need to work with relative measures to
make sure both throughput and latency are OK. Some kind
of (very simple) self-tuning system is probably best here.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



(reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Rik van Riel

On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
 On Tue, 12 Sep 2000, Rik van Riel wrote:
 
 Also, this possibility is /extremely/ remote, if not
 impossible. Well, it could happen at one point in time,
 
 It's not impossible. Think when you run a backup of you home
 directory while you're listening mp3. Both `tar` and `xmms` will
 read the same file that is out of cache.
 
 `tar` will be the first one who will read the next out-of-cache
 data-page of the file. The I/O will be so issued with low prio
 but then, as soon as `tar` has issued the read I/O, also `xmms`
 will wait on the same page and it will skip the next deadline
 because the I/O is been issued with low prio.

Indeed, this could be an issue...

 To make it work right is not simple.

I don't know if we really have to care about this
case. The process queueing the IO is more than
likely a good guess, and a good guess is (IMHO)
better than not guessing at all and hoping things
will be ok.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Alan Cox

 Why do you say it's not been fixed? Can you still reproduce hangs long as
 a write(2) can write? I certainly can't.

I cant reproduce long hangs. Im not seeing as good I/O throughput as before
but right now Im quite happy with the tradeoff. If someone can make it better
then Im happier still

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



(reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Andrea Arcangeli

On Tue, 12 Sep 2000, Martin Dalecki wrote:

First of all: In the case of the mp3 player and such there is already a
fine
proper way to give it better chances on getting it's job done smooth - 
RT kernel sceduler priorities and proper IO buffering. I did something
similiar
to a GDI printer driver...

Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger
than 128mbyte :) and then run in background `cp /dev/zero .` in the same
fs where your mp3 file out of cache is living. Then you'll see why a large
buffer is useless if there's none kind of I/O fair scheduling into the
elevator. Repeat the same test in 2.2.16 then.

The I/O latency Hans was taking about for the mp3 player, is the time it
takes for the buffer to become empty.

device driver level, if at all. In fact you have already bad 
interactions between strategies of low level drivers and the high
level code in Linux - like for example the "get from top of queue" or 
"don't get it from top of the IO queue" mess between
IDE and SCSI middlelayers... (However this got a bit better recently.)

That's historic cruft, it's unrelated to controlling the elevator
algorithm per-task/per-file basis IMHO.

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2 (summary of elevator ideas)

2000-09-12 Thread Ed Tomlinson

Hi,

Geez, A simple comment on IRC can _really_ generate lots of feedback. 
(There were over 50 messages about this in my queue - did not help 
that some were duplicated three times grin).

I made the comment because I remember back when the discussion was current
on linux kernel.  I thought Jeff Merkey's, message was to the point.  Para-
phrasing from memory, it was something to the effect that novell had 
tried many elevators.  All had problems with some loads.  The best they
had found was the 'close the door' idea.  I do not remember if the door
was based on requests or time.  Another point to remember is that the 
netware people came up with a what they considered a good solution.  From
Jeff's comment they arrived at this solution by experiments and bitter
experience.  Maybe we can learn something from their research?

Here is what I glean from the thread:

From all the discussion I find this suggestion from Alan to make lots of
sense.  Think it can be made to work with number of request almost as easily
as with time...

When you do the scan to decide where to insert the entry you dont consider
insertion before the time. Also you keep two queue heads the real and the
insert head. Whenever you walk from the insert head and find it points to
a 'too old' entry you update the insert_head.

And this suggestion from Rik should counter most of Andrea's time vs requests 
vs slow block devices issues.  We just have to be sure to close the door after
atleast n request or m time.  However, as pointed out by Chris Evan, later we
may not have to do this - there are stats that can give a good idea of a
device's latency.

Not really. What about just using something like
"half a second, but at least 10 requests liberty to
reorder" ?

It's simple, should be good enough for the user and
allows for a little bit of reordering even on very
slow devices ...

As Andrea points out its easy enought to do some sort of test with
the current code.

Well changing that is very easy, you only have to change the unit of
measure w/o changing one bit in the algorithm that is just implemented
indeed.

How

Just assume the req-elevator_sequence to be calculate in jiffy and in
elevator.c change the check for `!req-elevator_sequence' to
`time_before(req-elevator_sequence, jiffies)'. Then of course change the
initialization of req-elevator_sequence to be done with `jiffies +
elevator-read_latency'. Then also elvtune will talk in jiffies and not in
requests.

I wonder if using a wandering insert pointer, as Alan suggests, would give 
lower overhead than the current implementation (and would it really help)?

Again from Alan,

Andrea Now, I see people trying to introduce the concept of elapsed time
Andrea that fix, which smells strongly of hack. How will this hack be cobbled

 Actually my brain says that elapsed time based scheduling is the right
 thing to do. It certainly works for networks

And from Chris Evans,

Interesting, I'll try and run with this. The mention of networks reminds
me that any "max service time" variable is a tunable quantity depending on
current conditions..

.. and sct's block device I/O accounting patches give us the current
average request service time on a per-device basis. Multiply that up a bit
and maybe you have your threshold for moving things to the head of the
queue.

So we could end up using a figure that builds in both number of request and
time on a per device level without to much effort...

And Alan sums up the whole thing nicly with:

Andrea Why do you say it's not been fixed? Can you still reproduce hangs long
Andrea a write(2) can write? I certainly can't.

I cant reproduce long hangs. Im not seeing as good I/O throughput as before
but right now Im quite happy with the tradeoff. If someone can make it better
then Im happier still

If the idea works, lead to simpler code, a more reponsive system maybe with
better benchmarks then its a winner.  Only way we can be sure is to try it.

Thanks,

Ed Tomlinson [EMAIL PROTECTED] (ontadata on IRC)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2 (summary of elevator ideas)

2000-09-12 Thread Jeff V. Merkey


One important point on remirroring I did not mention in my post.  In
NetWare, remirroring scans the disk BACKWARDS (n0) to prevent
artificial starvation while remirring is going on.  This was another
optimization we learned the hard way by trying numerous approaches to
the problem.

Jeff

Ed Tomlinson wrote:
 
 Hi,
 
 Geez, A simple comment on IRC can _really_ generate lots of feedback.
 (There were over 50 messages about this in my queue - did not help
 that some were duplicated three times grin).
 
 I made the comment because I remember back when the discussion was current
 on linux kernel.  I thought Jeff Merkey's, message was to the point.  Para-
 phrasing from memory, it was something to the effect that novell had
 tried many elevators.  All had problems with some loads.  The best they
 had found was the 'close the door' idea.  I do not remember if the door
 was based on requests or time.  Another point to remember is that the
 netware people came up with a what they considered a good solution.  From
 Jeff's comment they arrived at this solution by experiments and bitter
 experience.  Maybe we can learn something from their research?
 
 Here is what I glean from the thread:
 
 From all the discussion I find this suggestion from Alan to make lots of
 sense.  Think it can be made to work with number of request almost as easily
 as with time...
 
 When you do the scan to decide where to insert the entry you dont consider
 insertion before the time. Also you keep two queue heads the real and the
 insert head. Whenever you walk from the insert head and find it points to
 a 'too old' entry you update the insert_head.
 
 And this suggestion from Rik should counter most of Andrea's time vs requests
 vs slow block devices issues.  We just have to be sure to close the door after
 atleast n request or m time.  However, as pointed out by Chris Evan, later we
 may not have to do this - there are stats that can give a good idea of a
 device's latency.
 
 Not really. What about just using something like
 "half a second, but at least 10 requests liberty to
 reorder" ?
 
 It's simple, should be good enough for the user and
 allows for a little bit of reordering even on very
 slow devices ...
 
 As Andrea points out its easy enought to do some sort of test with
 the current code.
 
 Well changing that is very easy, you only have to change the unit of
 measure w/o changing one bit in the algorithm that is just implemented
 indeed.
 
 How
 
 Just assume the req-elevator_sequence to be calculate in jiffy and in
 elevator.c change the check for `!req-elevator_sequence' to
 `time_before(req-elevator_sequence, jiffies)'. Then of course change the
 initialization of req-elevator_sequence to be done with `jiffies +
 elevator-read_latency'. Then also elvtune will talk in jiffies and not in
 requests.
 
 I wonder if using a wandering insert pointer, as Alan suggests, would give
 lower overhead than the current implementation (and would it really help)?
 
 Again from Alan,
 
 Andrea Now, I see people trying to introduce the concept of elapsed time
 Andrea that fix, which smells strongly of hack. How will this hack be cobbled
 
  Actually my brain says that elapsed time based scheduling is the right
  thing to do. It certainly works for networks
 
 And from Chris Evans,
 
 Interesting, I'll try and run with this. The mention of networks reminds
 me that any "max service time" variable is a tunable quantity depending on
 current conditions..
 
 .. and sct's block device I/O accounting patches give us the current
 average request service time on a per-device basis. Multiply that up a bit
 and maybe you have your threshold for moving things to the head of the
 queue.
 
 So we could end up using a figure that builds in both number of request and
 time on a per device level without to much effort...
 
 And Alan sums up the whole thing nicly with:
 
 Andrea Why do you say it's not been fixed? Can you still reproduce hangs long
 Andrea a write(2) can write? I certainly can't.
 
 I cant reproduce long hangs. Im not seeing as good I/O throughput as before
 but right now Im quite happy with the tradeoff. If someone can make it better
 then Im happier still
 
 If the idea works, lead to simpler code, a more reponsive system maybe with
 better benchmarks then its a winner.  Only way we can be sure is to try it.
 
 Thanks,
 
 Ed Tomlinson [EMAIL PROTECTED] (ontadata on IRC)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Martin Dalecki

Andrea Arcangeli wrote:
 
 On Tue, 12 Sep 2000, Martin Dalecki wrote:
 
 First of all: In the case of the mp3 player and such there is already a
 fine
 proper way to give it better chances on getting it's job done smooth -
 RT kernel sceduler priorities and proper IO buffering. I did something
 similiar
 to a GDI printer driver...
 
 Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger
 than 128mbyte :) and then run in background `cp /dev/zero .` in the same
 fs where your mp3 file out of cache is living. Then you'll see why a large
 buffer is useless if there's none kind of I/O fair scheduling into the
 elevator. Repeat the same test in 2.2.16 then.
 
 The I/O latency Hans was taking about for the mp3 player, is the time it
 takes for the buffer to become empty.

I was talking about *proper* buffering not necessary *big* buffers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Mitchell Blank Jr

Alan Cox wrote:
  Now, I see people trying to introduce the concept of elapsed time into
  that fix, which smells strongly of hack. How will this hack be cobbled
 
 Actually my brain says that elapsed time based scheduling is the right thing
 to do.

No, Andrea is right here.  The argument that everyone is using ("Our target -
latency - is measured in time") is utterly bogus.  Yes, it's measured in
time, but remember that there are two things measured in time here:
  A. The time for the whole queue of requests to run (this is what Rik is
 proposing using to throttle)
  B. The time an average request takes to process.

If we limit on the depth of queue we're (to some level of approximation)
making our decision based on A/B.  It's still a magic constant, but at
least it's scaled to take into account the speed of the drive.  And
underneath, it's still based on time.

 It certainly works for networks

Well, actually just about any communications protocol worth its salt
uses some sort of windowing throttle based on the amount of data
outstanding, not the length of time it's been in the queue.  Which
is why TCP works well over both GigE and 28.8. [*]  Now substitute
"big fiberchannel RAID" for GigE and "360K floppy" for 28.8 and
you've got the same problem.

*  -- Yes, for optimal TCP over big WAN pipes you may want to use a
  larger buffer size, but that's a matter of the bandwidth
  delay product, which isn't relavent for talking about storage

If we move to a "length of queue in time" as Rik suggests then we're
going to have to MAKE the user set it manually for each device.
There's too many orders of magnatude difference between even just SCSI
disks (10 yr old drive?  16-way RAID?  Solid state?) to make
supplying any sort of default with the kernel impractical.  The end
result might be a bit better behaved, but only just slightly.
If people absolutely need this behavior for some reason, the current
algorithm should stay as the default.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Alan Cox

 time, but remember that there are two things measured in time here:
   A. The time for the whole queue of requests to run (this is what Rik is
  proposing using to throttle)
   B. The time an average request takes to process.

Your perceived latency is based entirely on A.

 If we limit on the depth of queue we're (to some level of approximation)
 making our decision based on A/B.  It's still a magic constant, but at

I dont suggest you do queue limiting on that basis. I suggest you do order
limiting based on time slots

 Well, actually just about any communications protocol worth its salt
 uses some sort of windowing throttle based on the amount of data

Im talking about flow control/traffic shaping

 If we move to a "length of queue in time" as Rik suggests then we're
 going to have to MAKE the user set it manually for each device.

No

 There's too many orders of magnatude difference between even just SCSI
 disks (10 yr old drive?  16-way RAID?  Solid state?) to make
 supplying any sort of default with the kernel impractical.  The end

The same argument is equally valid for the current scheme, and I think you'll
find equally bogus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Martin Dalecki

Hans Reiser wrote:
 
 I really think Rik has it right here.  In particular, an MP3 player needs to be able 
to say, I have
 X milliseconds of buffer so make my worst case latency X milliseconds.  The number 
of requests is
 the wrong metric, because the time required per request depends on disk geometry, 
disk caching, etc.
 

No the problem is that an application should either: 

1. Take full controll of the underlying system.
2. Don't care about selftuning the OS.

Becouse that's what operating systems are for in first place: Letting
the
applications run without care of the underlying hardware.

Linux is just mistaken by desing that there should be a generic elevator
for any block device sitting on a single queue for any kind of attached
device. Only device drivers know best how to handle queueing and stuff
like
this. The upper layers should only car about semanticall correctness of
the
request orders not about optimization of them.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-12 Thread Mitchell Blank Jr

Alan Cox wrote:
  time, but remember that there are two things measured in time here:
A. The time for the whole queue of requests to run (this is what Rik is
   proposing using to throttle)
B. The time an average request takes to process.
 
 Your perceived latency is based entirely on A.

Yes, but "how hard is it reasonable for the kernel to try" is based on
both items.  A good first order approximation is number of requests.

  If we limit on the depth of queue we're (to some level of approximation)
  making our decision based on A/B.  It's still a magic constant, but at
 
 I dont suggest you do queue limiting on that basis. I suggest you do order
 limiting based on time slots

It's still a queue - the queue of things we're going to take on this
elevator swipe, right?  And the problem is one of keeping a sane
watermark on this queue - not too many requests to destroy latency
but enough to let the elevator do some good.

  Well, actually just about any communications protocol worth its salt
  uses some sort of windowing throttle based on the amount of data
 
 Im talking about flow control/traffic shaping

...where the user sets a number exlpicitly for what performance they
want.  Again, if we're going to make the user set this latency
variable for each of their devices, then doing it based on time will
work great.

  There's too many orders of magnatude difference between even just SCSI
  disks (10 yr old drive?  16-way RAID?  Solid state?) to make
  supplying any sort of default with the kernel impractical.  The end
 
 The same argument is equally valid for the current scheme, and I think you'll
 find equally bogus

There will always need to be tunables - and it's fine to say "if you've
got oddball hardware and/or workload and/or requirements then you should
twiddle this knob".  But it seems to me that the current scheme works
well for a pretty big range of devices.  If you do the setting based
on time, I think it'll be a lot more sensitive since there's nothing
that will scale based on the speed of the device.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-12 Thread David S. Miller

   Date:Tue, 12 Sep 2000 04:23:05 -0300 (BRST)
   From: Rik van Riel [EMAIL PROTECTED]

   I've just uploaded a new snapshot of my new VM for
   2.4 to my home page, this version contains a
   wakeup_kswapd() function (copied from wakeup_bdflush)
   and should balance memory a bit better.

How can drop_behind() work properly?

You do not recompute the hash chain head for each decreasing 'index'
in the main while loop, and thus you search potentially the wrong hash
chain each time.

Thus, you need to change:

+   page = __find_page_nolock(mapping, index, *hash);

to something more like:

+   hash = page_hash(mapping, index);
+   page = __find_page_nolock(mapping, index, *hash);

and remove the now-spurious local variable initialization of
'hash' at the top of this function.

Also, why this?

+   page = NULL;

That's spurious, you set it on the next line, probably this is
from some older revision of this function :-)

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Michael T. Babcock

Considering there are a lot of people still using 2.0.x because they find it
more stable than the 2.2.x series, doesn't it make sense to give this
scalability to people who are already running SMP boxes on 2.2.x and who may
decide to use ReiserFS?

- Original Message -
From: "Andrea Arcangeli" <[EMAIL PROTECTED]>


> On Mon, 11 Sep 2000, Andi Kleen wrote:
>
> >BTW, there is a another optimization that could help reiserfs a lot
> >on SMP settings: do a unlock_kernel()/lock_kernel() around the user
> >copies. It is quite legal to do that (you have to handle sleeping
> >anyways in case of a page fault), and it allows CPUs to run in parallel
> >for long running copies.
>
> I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x
> just fixed that problem by dropping the big lock in first place in the
> read/write paths :). The copy-user reschedule points were bugfixes
> instead.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Andrea Arcangeli

On Mon, 11 Sep 2000, Andi Kleen wrote:

>BTW, there is a another optimization that could help reiserfs a lot 
>on SMP settings: do a unlock_kernel()/lock_kernel() around the user 
>copies. It is quite legal to do that (you have to handle sleeping
>anyways in case of a page fault), and it allows CPUs to run in parallel
>for long running copies.

I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x
just fixed that problem by dropping the big lock in first place in the
read/write paths :). The copy-user reschedule points were bugfixes
instead.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-11 Thread Chris Mason


--On 09/11/00 15:02:34 +0200 Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> In 2.2.18pre2aa2.bz2 there's a latency bugfix, now a:
> 
> read(fd, , 0x7fff)
> write(fd, , 0x7fff)
> sendfile(src, dst, NULL, 0x7fff)
> 
> doesn't hang the machine anymore for several seconds. (well, really they
> are all three still a bit buggy because they don't interrupt themself when
> a signal arrives so we can't use senfile for `cp` of files smaller than 2g
> yet... but at least now you can do something in parallel even if you are
> so lucky to have some giga of fs cache)
> 

Thanks Andrea, Andi, new patch is attached, with the warning messages
removed.  The first patch got munged somewhere between test machine and
mailer, please don't use it.

-chris

 reiserfs-2.2.18p2aa2-2.diff.gz


Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Andi Kleen

On Mon, Sep 11, 2000 at 08:15:15AM -0400, Chris Mason wrote:
> LFS changes for filldir, reiserfs_readpage, and adds limit checking in 
> file_write to make sure we don't go above 2GB (Andi Kleen).  Also fixes 
> include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb.
> 
> Note, you might see debugging messages about items moving during 
> copy_from_user.  These are safe, but I'm leaving them in for now as I'd 
> like to find out why copy_from_user is suddenly scheduling much more than 
> it used to.

That's easy to explain. Andrea's latest aa contains some low latency
patches, which add a if (current->need_resched) schedule() to copy*user
to avoid bad schedule latencies for big copies. The result is that you
see a lot more schedules, everytime the copy*user happens to hit the
end of a time slice.

BTW, there is a another optimization that could help reiserfs a lot 
on SMP settings: do a unlock_kernel()/lock_kernel() around the user 
copies. It is quite legal to do that (you have to handle sleeping
anyways in case of a page fault), and it allows CPUs to run in parallel
for long running copies.


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Andrea Arcangeli

On Mon, 11 Sep 2000, Chris Mason wrote:

>reiserfs-3.5.25, this patch.  I tested against pre3-aa2.

BTW, pre3-aa2 means 2.2.18pre2aa2.bz2 applyed on top of 2.2.18pre3.

>Note, you might see debugging messages about items moving during 
>copy_from_user.  These are safe, but I'm leaving them in for now as I'd 
>like to find out why copy_from_user is suddenly scheduling much more than 
>it used to.

In 2.2.18pre2aa2.bz2 there's a latency bugfix, now a:

read(fd, , 0x7fff)
write(fd, , 0x7fff)
sendfile(src, dst, NULL, 0x7fff)

doesn't hang the machine anymore for several seconds. (well, really they
are all three still a bit buggy because they don't interrupt themself when
a signal arrives so we can't use senfile for `cp` of files smaller than 2g
yet... but at least now you can do something in parallel even if you are
so lucky to have some giga of fs cache)

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More on 2.2.18pre2aa2

2000-09-11 Thread Chris Mason


--On 09/11/00 07:45:16 -0400 Ed Tomlinson <[EMAIL PROTECTED]> wrote:

> Hi Chris,
>
>>> Something between bigmem and his big VM changes makes reiserfs
>>> uncompilable. [..]
>
>> It's due LFS. Chris should have a reiserfs patch that compiles on top of
>> 2.2.18pre2aa2, right? (if not Chris, I can sure find it because the
>> server that was reproducing the DAC960 SMP lock inversion was running
>> 2.2.18pre2aa2+IKD on top of an huge reiserfs fs)
>
> Chris, is this patch posted anywhere?  Alternately can you take a look
> at the updated patch brian posted and comment/correct (if required).
>

Patch attached.  The patch order should be 2.2.18-prex-aa2, 
reiserfs-3.5.25, this patch.  I tested against pre3-aa2.

LFS changes for filldir, reiserfs_readpage, and adds limit checking in 
file_write to make sure we don't go above 2GB (Andi Kleen).  Also fixes 
include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb.

Note, you might see debugging messages about items moving during 
copy_from_user.  These are safe, but I'm leaving them in for now as I'd 
like to find out why copy_from_user is suddenly scheduling much more than 
it used to.

> Has the 'atomic' delete code made it into the main reiserfs trees?
>

No, it is still only 90% done.

-chris


diff -urN diff/linux/fs/reiserfs/dir.c linux/fs/reiserfs/dir.c
--- diff/linux/fs/reiserfs/dir.cSun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/dir.c Mon Sep 11 01:04:19 2000
@@ -177,7 +177,7 @@
// user space buffer is swapped out. At that time
// entry can move to somewhere else
memcpy (local_buf, d_name, d_reclen);
-   if (filldir (dirent, local_buf, d_reclen, d_off, d_ino) < 0) {
+   if (filldir (dirent, local_buf, d_reclen, d_off, d_ino, DT_UNKNOWN) < 
+0) {
pathrelse (_to_entry);
filp->f_pos = next_pos;
if (local_buf != small_buf) {
diff -urN diff/linux/fs/reiserfs/file.c linux/fs/reiserfs/file.c
--- diff/linux/fs/reiserfs/file.c   Sun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/file.cSun Sep 10 22:20:48 2000
@@ -40,6 +40,8 @@
 int reiserfs_readpage (struct file * file, struct page * page);


+/* LFS: should we add a open function that gives EINVAL for O_LARGEFILE? The specs
+   are not clear here. -AK */
 static struct file_operations reiserfs_file_operations = {
 NULL,  /* lseek */
 reiserfs_file_read, /* read */
@@ -743,40 +745,58 @@
   int windex ;
   struct reiserfs_transaction_handle th ;
   unsigned longlimit = current->rlim[RLIMIT_FSIZE].rlim_cur;
-  unsigned long pos = *p_n_pos;
-
   struct buffer_head *buffer_list[REISERFS_NBUF] ;
   int buffer_count = 0 ;
   int n_blocks_flushed = 0 ; /* tracks i/o errors during O_SYNC */
-
-/*
-  if (!p_s_inode->i_op || !p_s_inode->i_op->updatepage)
-  return -EIO;
-  */
+
   if (p_s_filp->f_error) {
   int error = p_s_filp->f_error;
   p_s_filp->f_error = 0;
   return error;
   }
+  if (n_count == 0)
+  return 0;
+  if ((signed)n_count < 0)
+  return -EINVAL;
+
+/*
+  if (!p_s_inode->i_op || !p_s_inode->i_op->updatepage)
+  return -EIO;
+  */

   /* Calculate position in the file. */
+  /* Make sure nothing overflows before converting loff_t to unsigned long */
   if ( p_s_filp->f_flags & O_APPEND ) {
-n_pos_in_file = p_s_inode->i_size + 1;
-pos = p_s_inode->i_size;
-  } else
-n_pos_in_file = *p_n_pos + 1;
-
-  if (pos >= limit)
+/* notify_change should guard against that */
+if (p_s_inode->i_size > (0x7fff - 1))
+  return -EFBIG;
+n_pos_in_file = ((unsigned long)p_s_inode->i_size) + 1;
+  } else {
+if (*p_n_pos > (0x7fff - 1))
   return -EFBIG;
+n_pos_in_file = ((unsigned long)*p_n_pos) + 1;
+  }

-  if (n_count > limit - pos)
-  n_count = limit - pos;
+  if (limit >= 0x7fff)
+limit = 0x7fff - 1;
+
+  if (n_pos_in_file + n_count >= limit) {
+/* Should send SIGXFSZ when above rlim */
+if (n_pos_in_file - 1 < limit)
+  n_count = limit - n_pos_in_file + 1;
+else
+  return -EFBIG;
+  }
+
+  if (n_count > limit - n_pos_in_file)
+  n_count = limit - n_pos_in_file;

   n_written = 0;
   n_tail_bytes_written = 0;
   p_s_sb = p_s_inode->i_sb;

   remove_suid(p_s_inode) ;
+
   journal_begin(, p_s_sb, jbegin_count) ;

   if (p_s_filp->f_flags & O_SYNC) {
diff -urN diff/linux/fs/reiserfs/inode.c linux/fs/reiserfs/inode.c
--- diff/linux/fs/reiserfs/inode.c  Sun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/inode.c   Mon Sep 11 00:33:07 2000
@@ -217,10 +217,12 @@
 inode = file->f_dentry->d_inode;

 increment_i_read_sync_counter(inode) ;
-if (has_tail (inode) && tail_offset (inode) < page->offset + PAGE_SIZE) {
+if (has_tail (inode) &&
+tail_offset(inode) < pgoff2ulong(page->index) * PAGE_SIZE + PAGE_SIZE) {
/* there is a tail and it is in this page */

Re: More on 2.2.18pre2aa2

2000-09-11 Thread Chris Mason


--On 09/11/00 07:45:16 -0400 Ed Tomlinson [EMAIL PROTECTED] wrote:

 Hi Chris,

 Something between bigmem and his big VM changes makes reiserfs
 uncompilable. [..]

 It's due LFS. Chris should have a reiserfs patch that compiles on top of
 2.2.18pre2aa2, right? (if not Chris, I can sure find it because the
 server that was reproducing the DAC960 SMP lock inversion was running
 2.2.18pre2aa2+IKD on top of an huge reiserfs fs)

 Chris, is this patch posted anywhere?  Alternately can you take a look
 at the updated patch brian posted and comment/correct (if required).


Patch attached.  The patch order should be 2.2.18-prex-aa2, 
reiserfs-3.5.25, this patch.  I tested against pre3-aa2.

LFS changes for filldir, reiserfs_readpage, and adds limit checking in 
file_write to make sure we don't go above 2GB (Andi Kleen).  Also fixes 
include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb.

Note, you might see debugging messages about items moving during 
copy_from_user.  These are safe, but I'm leaving them in for now as I'd 
like to find out why copy_from_user is suddenly scheduling much more than 
it used to.

 Has the 'atomic' delete code made it into the main reiserfs trees?


No, it is still only 90% done.

-chris


diff -urN diff/linux/fs/reiserfs/dir.c linux/fs/reiserfs/dir.c
--- diff/linux/fs/reiserfs/dir.cSun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/dir.c Mon Sep 11 01:04:19 2000
@@ -177,7 +177,7 @@
// user space buffer is swapped out. At that time
// entry can move to somewhere else
memcpy (local_buf, d_name, d_reclen);
-   if (filldir (dirent, local_buf, d_reclen, d_off, d_ino)  0) {
+   if (filldir (dirent, local_buf, d_reclen, d_off, d_ino, DT_UNKNOWN)  
+0) {
pathrelse (path_to_entry);
filp-f_pos = next_pos;
if (local_buf != small_buf) {
diff -urN diff/linux/fs/reiserfs/file.c linux/fs/reiserfs/file.c
--- diff/linux/fs/reiserfs/file.c   Sun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/file.cSun Sep 10 22:20:48 2000
@@ -40,6 +40,8 @@
 int reiserfs_readpage (struct file * file, struct page * page);


+/* LFS: should we add a open function that gives EINVAL for O_LARGEFILE? The specs
+   are not clear here. -AK */
 static struct file_operations reiserfs_file_operations = {
 NULL,  /* lseek */
 reiserfs_file_read, /* read */
@@ -743,40 +745,58 @@
   int windex ;
   struct reiserfs_transaction_handle th ;
   unsigned longlimit = current-rlim[RLIMIT_FSIZE].rlim_cur;
-  unsigned long pos = *p_n_pos;
-
   struct buffer_head *buffer_list[REISERFS_NBUF] ;
   int buffer_count = 0 ;
   int n_blocks_flushed = 0 ; /* tracks i/o errors during O_SYNC */
-
-/*
-  if (!p_s_inode-i_op || !p_s_inode-i_op-updatepage)
-  return -EIO;
-  */
+
   if (p_s_filp-f_error) {
   int error = p_s_filp-f_error;
   p_s_filp-f_error = 0;
   return error;
   }
+  if (n_count == 0)
+  return 0;
+  if ((signed)n_count  0)
+  return -EINVAL;
+
+/*
+  if (!p_s_inode-i_op || !p_s_inode-i_op-updatepage)
+  return -EIO;
+  */

   /* Calculate position in the file. */
+  /* Make sure nothing overflows before converting loff_t to unsigned long */
   if ( p_s_filp-f_flags  O_APPEND ) {
-n_pos_in_file = p_s_inode-i_size + 1;
-pos = p_s_inode-i_size;
-  } else
-n_pos_in_file = *p_n_pos + 1;
-
-  if (pos = limit)
+/* notify_change should guard against that */
+if (p_s_inode-i_size  (0x7fff - 1))
+  return -EFBIG;
+n_pos_in_file = ((unsigned long)p_s_inode-i_size) + 1;
+  } else {
+if (*p_n_pos  (0x7fff - 1))
   return -EFBIG;
+n_pos_in_file = ((unsigned long)*p_n_pos) + 1;
+  }

-  if (n_count  limit - pos)
-  n_count = limit - pos;
+  if (limit = 0x7fff)
+limit = 0x7fff - 1;
+
+  if (n_pos_in_file + n_count = limit) {
+/* Should send SIGXFSZ when above rlim */
+if (n_pos_in_file - 1  limit)
+  n_count = limit - n_pos_in_file + 1;
+else
+  return -EFBIG;
+  }
+
+  if (n_count  limit - n_pos_in_file)
+  n_count = limit - n_pos_in_file;

   n_written = 0;
   n_tail_bytes_written = 0;
   p_s_sb = p_s_inode-i_sb;

   remove_suid(p_s_inode) ;
+
   journal_begin(th, p_s_sb, jbegin_count) ;

   if (p_s_filp-f_flags  O_SYNC) {
diff -urN diff/linux/fs/reiserfs/inode.c linux/fs/reiserfs/inode.c
--- diff/linux/fs/reiserfs/inode.c  Sun Sep 10 21:58:07 2000
+++ linux/fs/reiserfs/inode.c   Mon Sep 11 00:33:07 2000
@@ -217,10 +217,12 @@
 inode = file-f_dentry-d_inode;

 increment_i_read_sync_counter(inode) ;
-if (has_tail (inode)  tail_offset (inode)  page-offset + PAGE_SIZE) {
+if (has_tail (inode) 
+tail_offset(inode)  pgoff2ulong(page-index) * PAGE_SIZE + PAGE_SIZE) {
/* there is a tail and it is in this page */
memset ((char *)page_address (page), 0, PAGE_SIZE);
-   

Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Andrea Arcangeli

On Mon, 11 Sep 2000, Andi Kleen wrote:

BTW, there is a another optimization that could help reiserfs a lot 
on SMP settings: do a unlock_kernel()/lock_kernel() around the user 
copies. It is quite legal to do that (you have to handle sleeping
anyways in case of a page fault), and it allows CPUs to run in parallel
for long running copies.

I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x
just fixed that problem by dropping the big lock in first place in the
read/write paths :). The copy-user reschedule points were bugfixes
instead.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: (reiserfs) Re: More on 2.2.18pre2aa2

2000-09-11 Thread Michael T. Babcock

Considering there are a lot of people still using 2.0.x because they find it
more stable than the 2.2.x series, doesn't it make sense to give this
scalability to people who are already running SMP boxes on 2.2.x and who may
decide to use ReiserFS?

- Original Message -
From: "Andrea Arcangeli" [EMAIL PROTECTED]


 On Mon, 11 Sep 2000, Andi Kleen wrote:

 BTW, there is a another optimization that could help reiserfs a lot
 on SMP settings: do a unlock_kernel()/lock_kernel() around the user
 copies. It is quite legal to do that (you have to handle sleeping
 anyways in case of a page fault), and it allows CPUs to run in parallel
 for long running copies.

 I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x
 just fixed that problem by dropping the big lock in first place in the
 read/write paths :). The copy-user reschedule points were bugfixes
 instead.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



<    1   2