Re: [PATCH] Re: BUG: race-cond with partition-check

2001-06-08 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
> --- partitions/check.c~   Thu May 31 22:26:56 2001
> +++ partitions/check.cFri Jun  8 10:44:02 2001
> @@ -418,11 +418,10 @@
>   blk_size[dev->major] = NULL;
>  
>   dev->part[first_minor].nr_sects = size;
> - /* No Such Agen^Wdevice or no minors to use for partitions */
> + /* No such device or no minors to use for partitions */


Any reason why you're silently removing a good old anti-NSA joke?
Conspiracy theorists may have fun with that... :-)

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: BUG: race-cond with partition-check

2001-06-08 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
 --- partitions/check.c~   Thu May 31 22:26:56 2001
 +++ partitions/check.cFri Jun  8 10:44:02 2001
 @@ -418,11 +418,10 @@
   blk_size[dev-major] = NULL;
  
   dev-part[first_minor].nr_sects = size;
 - /* No Such Agen^Wdevice or no minors to use for partitions */
 + /* No such device or no minors to use for partitions */


Any reason why you're silently removing a good old anti-NSA joke?
Conspiracy theorists may have fun with that... :-)

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-24 Thread Malcolm Beattie

[cc list reduced]

Andreas Dilger writes:
> PS - I used to think shrinking a filesystem online was useful, but there
>  are a huge amount of problems with this and very few real-life
>  benefits, as long as you can at least do offline shrinking.  With
>  proper LVM usage, the need to shrink a filesystem never really
>  happens in practise, unlike the partition case where you always
>  have to guess in advance how big a filesystem needs to be, and then
>  add 10% for a safety margin.  With LVM you just create the minimal
>  sized device you need now, and freely grow it in the future.

In an attempt to nudge you back towards your previous opinion: consider
a system-wide spool or tmp filesystem. It would be nice to be able to
add in a few extra volumes for a busy period but then shrink it down
again when usage returns to normal. In the absence of the ability to
shrink a live filesystem, storage management becomes a much harder job.
You can't throw in a spare volume or two where it's needed without
careful thought because you'll be ratchetting up the space on that one
filesystem without being able to change your mind and reduce it again
later. You'll end up with stingy storage admins who refuse to give you
a bunch of extra filesystem space for a while because they can't get it
back again afterwards.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-24 Thread Malcolm Beattie

[cc list reduced]

Andreas Dilger writes:
 PS - I used to think shrinking a filesystem online was useful, but there
  are a huge amount of problems with this and very few real-life
  benefits, as long as you can at least do offline shrinking.  With
  proper LVM usage, the need to shrink a filesystem never really
  happens in practise, unlike the partition case where you always
  have to guess in advance how big a filesystem needs to be, and then
  add 10% for a safety margin.  With LVM you just create the minimal
  sized device you need now, and freely grow it in the future.

In an attempt to nudge you back towards your previous opinion: consider
a system-wide spool or tmp filesystem. It would be nice to be able to
add in a few extra volumes for a busy period but then shrink it down
again when usage returns to normal. In the absence of the ability to
shrink a live filesystem, storage management becomes a much harder job.
You can't throw in a spare volume or two where it's needed without
careful thought because you'll be ratchetting up the space on that one
filesystem without being able to change your mind and reduce it again
later. You'll end up with stingy storage admins who refuse to give you
a bunch of extra filesystem space for a while because they can't get it
back again afterwards.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Malcolm Beattie

Alexander Viro writes:
> thing, we could turn mount(2) into
>   open appropriate fs type
>   convince the sucker that you are allowed, tell which device you want,
> etc.
>   open mountpoint
>   mount(fs_fd, dir_fd)
> Would work like charm, especially since we could fit the network filesystems
> into the same scheme and get rid of the kludges a-la ncpfs mount sequence.
> 
> There's only one sore spot: how'd you mount _that_ fs? ;-)

Start up init with fs_fd on file descriptor 3 and init can put it
where it likes.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Malcolm Beattie

Alexander Viro writes:
 thing, we could turn mount(2) into
   open appropriate fs type
   convince the sucker that you are allowed, tell which device you want,
 etc.
   open mountpoint
   mount(fs_fd, dir_fd)
 Would work like charm, especially since we could fit the network filesystems
 into the same scheme and get rid of the kludges a-la ncpfs mount sequence.
 
 There's only one sore spot: how'd you mount _that_ fs? ;-)

Start up init with fs_fd on file descriptor 3 and init can put it
where it likes.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Not a typewriter

2001-05-11 Thread Malcolm Beattie

Jonathan Lundell writes:
> FWIW, the comment in errno.h under Solaris 2.6 is "Inappropriate 
> ioctl for device". I believe that's the POSIX interpretation.

POSIX has

  [ENOTTY]  Inappropriate I/O control operation
A control function was attempted for a file or special file
for which the operation was inappropriate.

which is quite a nice way of putting it.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Not a typewriter

2001-05-11 Thread Malcolm Beattie

Jonathan Lundell writes:
 FWIW, the comment in errno.h under Solaris 2.6 is Inappropriate 
 ioctl for device. I believe that's the POSIX interpretation.

POSIX has

  [ENOTTY]  Inappropriate I/O control operation
A control function was attempted for a file or special file
for which the operation was inappropriate.

which is quite a nice way of putting it.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Wow! Is memory ever cheap!

2001-05-09 Thread Malcolm Beattie

Larry McVoy writes:
> On Wed, May 09, 2001 at 12:24:25AM -0400, Marty Leisner wrote:
> > My understanding is suns big machines stopped using ecc and they
> 
> The SUN problem was a cache problem and there is no way that I believe
> that SUN would turn of ECC in the cache.  There are good reasons for
> not doing so.  If you think through the end to end argument, you will
> see that you have no way to do checks on the data path into/out of the
> processor.  If that part of the datapath is not checked then no amount
> of checking elsewhere does any good, the processor can be corrupting
> your data and never know it.  If SUN was so stupid as to remove this,
> then it is a dramatically different place.  I heard that there was a
> bug in the cache controller, I never heard that they had removed ECC.

There are issues with error detection/correction/recovery with
different designs of L1 and L2 caches. There's a good paper:

IBM S/390 storage hierarchy - G5 and G6 performance considerations
IBM Journal of Research and Development
Vol 43 No. 5/6
available at
http://www.research.ibm.com/journal/rd/435/jackson.html

which covers IBM's choice of L1 and L2 design for S/390. The section on
"S/390 reliability and performance implications" is relevant here. In
particular, they use a solution which isn't best from the performance
point of view but ensures you don't discover "too late" about an error.
I heard a rumour (now I get to the unsubstantiated part :-) that Sun
chose a higher-performing design for their cache subsystem but which has
a nastier failure mode in the case of cache errors.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Wow! Is memory ever cheap!

2001-05-09 Thread Malcolm Beattie

Larry McVoy writes:
 On Wed, May 09, 2001 at 12:24:25AM -0400, Marty Leisner wrote:
  My understanding is suns big machines stopped using ecc and they
 
 The SUN problem was a cache problem and there is no way that I believe
 that SUN would turn of ECC in the cache.  There are good reasons for
 not doing so.  If you think through the end to end argument, you will
 see that you have no way to do checks on the data path into/out of the
 processor.  If that part of the datapath is not checked then no amount
 of checking elsewhere does any good, the processor can be corrupting
 your data and never know it.  If SUN was so stupid as to remove this,
 then it is a dramatically different place.  I heard that there was a
 bug in the cache controller, I never heard that they had removed ECC.

There are issues with error detection/correction/recovery with
different designs of L1 and L2 caches. There's a good paper:

IBM S/390 storage hierarchy - G5 and G6 performance considerations
IBM Journal of Research and Development
Vol 43 No. 5/6
available at
http://www.research.ibm.com/journal/rd/435/jackson.html

which covers IBM's choice of L1 and L2 design for S/390. The section on
S/390 reliability and performance implications is relevant here. In
particular, they use a solution which isn't best from the performance
point of view but ensures you don't discover too late about an error.
I heard a rumour (now I get to the unsubstantiated part :-) that Sun
chose a higher-performing design for their cache subsystem but which has
a nastier failure mode in the case of cache errors.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Block device strategy and requests

2001-04-26 Thread Malcolm Beattie

I'm designing a block device driver for a high performance disk
subsystem with unusual characteristics. To what extent is the
limited number of "struct request"s (128 by default) necessary for
back-pressure? With this I/O subsystem it would be possible for the
strategy function to rip the requests from the request list straight
away, arrange for the I/Os to be done to/from the buffer_heads (with
no additional state required) with no memory "leak". This would
effectively mean that the only limit on the number of I/Os queued
would be the number of buffer_heads allocated; not a fixed number of
"struct request"s in flight. Is this reasonable or does any memory or
resource balancing depend on the number of I/Os outstanding being
bounded?

Also, there is a lot of flexibility in how often interrupts are sent
to mark the buffer_heads up-to-date. (With the requests pulled
straight off the queue, the job of end_that_request_first() in doing
the linked list updates and bh->b_end_io() callbacks would be done by
the interrupt routine directly.) At one extreme, I could take an
interrupt for each 4K block issued and mark it up-to-date very
quickly making for very low-latency I/O but a very large interrupt
rate when I/O throughput is high. The alternative would be to arrange
for an interrupt every n buffer_heads (or based on some other
criterion) and only take an interrupt and mark buffers up-to-date on
each of those). Are there any rules of thumb on which is best or
doesn't it matter too much?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Block device strategy and requests

2001-04-26 Thread Malcolm Beattie

I'm designing a block device driver for a high performance disk
subsystem with unusual characteristics. To what extent is the
limited number of struct requests (128 by default) necessary for
back-pressure? With this I/O subsystem it would be possible for the
strategy function to rip the requests from the request list straight
away, arrange for the I/Os to be done to/from the buffer_heads (with
no additional state required) with no memory leak. This would
effectively mean that the only limit on the number of I/Os queued
would be the number of buffer_heads allocated; not a fixed number of
struct requests in flight. Is this reasonable or does any memory or
resource balancing depend on the number of I/Os outstanding being
bounded?

Also, there is a lot of flexibility in how often interrupts are sent
to mark the buffer_heads up-to-date. (With the requests pulled
straight off the queue, the job of end_that_request_first() in doing
the linked list updates and bh-b_end_io() callbacks would be done by
the interrupt routine directly.) At one extreme, I could take an
interrupt for each 4K block issued and mark it up-to-date very
quickly making for very low-latency I/O but a very large interrupt
rate when I/O throughput is high. The alternative would be to arrange
for an interrupt every n buffer_heads (or based on some other
criterion) and only take an interrupt and mark buffers up-to-date on
each of those). Are there any rules of thumb on which is best or
doesn't it matter too much?

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: zSeries

2001-04-18 Thread Malcolm Beattie

Frank Fiene writes:
> Who can tell me, how is the performance of the big irons (zSeries) 
> from IBM comparing a pc server with linux?

Different. Very, very different. Elaborate on what problem you're
trying to solve and then there'll be more chance of comparing
platforms.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: zSeries

2001-04-18 Thread Malcolm Beattie

Frank Fiene writes:
 Who can tell me, how is the performance of the big irons (zSeries) 
 from IBM comparing a pc server with linux?

Different. Very, very different. Elaborate on what problem you're
trying to solve and then there'll be more chance of comparing
platforms.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ftruncate not extending files?

2001-03-02 Thread Malcolm Beattie

bert hubert writes:
> I would've sworn, based on the fact that I saw people do it, that ftruncate
> was a legitimate way to extend a file

Well it's not SuSv2 standards compliant:

http://www.opengroup.org/onlinepubs/007908799/xsh/ftruncate.html

If the file previously was larger than length, the extra data is
discarded. If it was previously shorter than length, it is
unspecified whether the file is changed or its size increased. If
^^^
the file is extended, the extended area appears as if it were
zero-filled.

How "legitimate" relates to "SuSv2 standards compliant" is your call.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ftruncate not extending files?

2001-03-02 Thread Malcolm Beattie

bert hubert writes:
 I would've sworn, based on the fact that I saw people do it, that ftruncate
 was a legitimate way to extend a file

Well it's not SuSv2 standards compliant:

http://www.opengroup.org/onlinepubs/007908799/xsh/ftruncate.html

If the file previously was larger than length, the extra data is
discarded. If it was previously shorter than length, it is
unspecified whether the file is changed or its size increased. If
^^^
the file is extended, the extended area appears as if it were
zero-filled.

How "legitimate" relates to "SuSv2 standards compliant" is your call.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Chris Evans writes:
> 
> On Thu, 1 Feb 2001, Malcolm Beattie wrote:
> 
> > Mapping the addresses from whichever ScrollLock combination produced
> > the task list to symbols produces the call trace
> >  do_exit <- do_signal <- tcp_destroy_sock <- inet_ioctl <- signal_return
> >
> > The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
> > anywhere at all and the next function before it in memory is
> > inet_shutdown which looks more believable. I have checked I'm looking
> 
> Probably, the empty SIGPIPE handler triggered. The response to this is a
> lot of shutdown() close() and finally an exit().
> 
> The trace you give above looks like the child process trace. I always see
> the parent process go nuts. The parent process is almost always blocking
> on read() of a unix dgram socket, which it shares with the child. The
> child does a shutdown() on this socket just before exit().

We've done some more detective work. I can reproduce the hang too
by quitting the ftp client abruptly (^Z and kill %1 in my case).
Inducing the hang while stracing the daemon shows a recv returning 0
as expected when the socket closes. The daemon then calls "die":

die(const char* p_text)
{
  /* Going down hard... */
#ifdef DIE_DEBUG
  bug(p_text);
#endif

and DIE_DEBUG is defined. bug() writes an error message and then does
three things:
shutdown(2) on the sockets
close(2) on the sockets
abort()
the last of which libc implements as
rt_sigprocmask(SIG_UNBLOCK, [SIGABRT])
kill(getpid(), SIGABRT)

Here's the interesting thing: doing an exit(0) before the shutdowns
and abort gets rid of the hang. The only unusual and potentially
untested thing I could find about the program was that it uses
capset() and prctl(PR_SET_KEEPCAPS). However, replacing the
"retval = capset(...)" call with a dummy "retval = 0" doesn't get
rid of the hang. So it looks as though some combination of
shutdown(2) and SIGABRT is at fault. After the hang the kernel-side
stack trace is always either the one I gave above (and I *did*
write down the address for inet_ioctl correctly; it's definitely
not inet_shutdown) or else
  do_exit <- do_signal <- schedule <- syscall_trace <- signal_return
(with exactly the same addresses as above except for the differing
schedule and syscall_trace ones) which appeared after the hang while
vsftpd was being run under strace.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Malcolm Beattie writes:
> Chris Evans writes:
> > I've just managed to reproduce this personally on 2.4.0. I've had a report
> > that 2.4.1 is also affected. Both myself and the other person who
> > reproduced this have SMP i686 machines, which may or may not be relevant.
> > 
> > To reproduce, all you need to do is get my vsftpd ftp server:
> > ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz
[...]
> As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
> all other processes were looking OK in R or S state.

Mapping the addresses from whichever ScrollLock combination produced
the task list to symbols produces the call trace
 do_exit <- do_signal <- tcp_destroy_sock <- inet_ioctl <- signal_return

The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
anywhere at all and the next function before it in memory is
inet_shutdown which looks more believable. I have checked I'm looking
at the right System.map but I suppose I may have mis-transcribed the
address when writing it down. vsftpd doesn't make use of signal
handlers except to unset some existing ones and a SIGALRM handler
which I don't think would have triggered. Something like a seg fault
may have caused it (I should have seen an oops if it had happened in
kernel space) perhaps?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Chris Evans writes:
> I've just managed to reproduce this personally on 2.4.0. I've had a report
> that 2.4.1 is also affected. Both myself and the other person who
> reproduced this have SMP i686 machines, which may or may not be relevant.
> 
> To reproduce, all you need to do is get my vsftpd ftp server:
> ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz

I got this just before lunch too. I was trying out 2.4.1 + zerocopy
(with netfilter configured off, see the sendfile/zerocopy thread for
more details and hardware specs) and tried running vsftpd on the
slower machine instead of the faster machine as before. I connected
to vsftpd with an ftp client and got a
500 OOPS: chdir
Login failed.
421 Service not available, remote server has closed connection
(ftpd's idea of an OOPS; not the kernel's idea of an oops, of course).
That was probably because I hadn't configured the directory properly
but following that the machine hung, in the following way: userland
hung: no more logins, existent xterm processes didn't refresh their
windows on my (remote) display. The machine was still pingable, though.

I configured Magic SysRq into the kernel but hadn't played with it
before so I hadn't enabled it in /proc (D'oh. Next time I'll know.)
As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
all other processes were looking OK in R or S state.

Looking at the kernel's EIP every so often to see what was going
showed remove_wait_queue, add_wait_queue, skb_recv_datagram and
wait_for_packet mostly. Random thought: if vsftpd did a sendfile and
then exited, becoming a zombie, could there be a problem with
tearing down a sendfile mapping? I'm off to read some code.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [UPDATE] Fresh zerocopy patch on kernel.org

2001-02-01 Thread Malcolm Beattie

David S. Miller writes:
> 
> Malcolm Beattie writes:
>  > David S. Miller writes:
>  > > 
>  > > At the usual place:
>  > > 
>  > > ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz
>  > 
>  > Hmm, disappointing results here; maybe I've missed something.
> 
> As discussed elsewhere there is a %10 to %15 performance hit for
> normal write()'s done with the new code.
> 
> If you do your testing using sendfile() as the data source, you'll
> results ought to be wildly different and more encouraging.

I did say that the ftp test used sendfile() as the data source and
it dropped from 86 MB/s to 62 MB/s. Alexey has mailed me suggesting
the problem may be that netfilter is turned on. It is indeed turned
on in both the 2.4.1 config and the 2.4.1+zc config but maybe it has
a far higher detrimental effect in the zerocopy case. I'm currently
building new non-netfilter kernels and I'll go through the exercise
again. I'm confident I'll end up being impressed with the numbers
even if it takes some tweaking to get there :-)

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Chris Evans writes:
 I've just managed to reproduce this personally on 2.4.0. I've had a report
 that 2.4.1 is also affected. Both myself and the other person who
 reproduced this have SMP i686 machines, which may or may not be relevant.
 
 To reproduce, all you need to do is get my vsftpd ftp server:
 ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz

I got this just before lunch too. I was trying out 2.4.1 + zerocopy
(with netfilter configured off, see the sendfile/zerocopy thread for
more details and hardware specs) and tried running vsftpd on the
slower machine instead of the faster machine as before. I connected
to vsftpd with an ftp client and got a
500 OOPS: chdir
Login failed.
421 Service not available, remote server has closed connection
(ftpd's idea of an OOPS; not the kernel's idea of an oops, of course).
That was probably because I hadn't configured the directory properly
but following that the machine hung, in the following way: userland
hung: no more logins, existent xterm processes didn't refresh their
windows on my (remote) display. The machine was still pingable, though.

I configured Magic SysRq into the kernel but hadn't played with it
before so I hadn't enabled it in /proc (D'oh. Next time I'll know.)
As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
all other processes were looking OK in R or S state.

Looking at the kernel's EIP every so often to see what was going
showed remove_wait_queue, add_wait_queue, skb_recv_datagram and
wait_for_packet mostly. Random thought: if vsftpd did a sendfile and
then exited, becoming a zombie, could there be a problem with
tearing down a sendfile mapping? I'm off to read some code.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Malcolm Beattie writes:
 Chris Evans writes:
  I've just managed to reproduce this personally on 2.4.0. I've had a report
  that 2.4.1 is also affected. Both myself and the other person who
  reproduced this have SMP i686 machines, which may or may not be relevant.
  
  To reproduce, all you need to do is get my vsftpd ftp server:
  ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz
[...]
 As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
 all other processes were looking OK in R or S state.

Mapping the addresses from whichever ScrollLock combination produced
the task list to symbols produces the call trace
 do_exit - do_signal - tcp_destroy_sock - inet_ioctl - signal_return

The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
anywhere at all and the next function before it in memory is
inet_shutdown which looks more believable. I have checked I'm looking
at the right System.map but I suppose I may have mis-transcribed the
address when writing it down. vsftpd doesn't make use of signal
handlers except to unset some existing ones and a SIGALRM handler
which I don't think would have triggered. Something like a seg fault
may have caused it (I should have seen an oops if it had happened in
kernel space) perhaps?

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serious reproducible 2.4.x kernel hang

2001-02-01 Thread Malcolm Beattie

Chris Evans writes:
 
 On Thu, 1 Feb 2001, Malcolm Beattie wrote:
 
  Mapping the addresses from whichever ScrollLock combination produced
  the task list to symbols produces the call trace
   do_exit - do_signal - tcp_destroy_sock - inet_ioctl - signal_return
 
  The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
  anywhere at all and the next function before it in memory is
  inet_shutdown which looks more believable. I have checked I'm looking
 
 Probably, the empty SIGPIPE handler triggered. The response to this is a
 lot of shutdown() close() and finally an exit().
 
 The trace you give above looks like the child process trace. I always see
 the parent process go nuts. The parent process is almost always blocking
 on read() of a unix dgram socket, which it shares with the child. The
 child does a shutdown() on this socket just before exit().

We've done some more detective work. I can reproduce the hang too
by quitting the ftp client abruptly (^Z and kill %1 in my case).
Inducing the hang while stracing the daemon shows a recv returning 0
as expected when the socket closes. The daemon then calls "die":

die(const char* p_text)
{
  /* Going down hard... */
#ifdef DIE_DEBUG
  bug(p_text);
#endif

and DIE_DEBUG is defined. bug() writes an error message and then does
three things:
shutdown(2) on the sockets
close(2) on the sockets
abort()
the last of which libc implements as
rt_sigprocmask(SIG_UNBLOCK, [SIGABRT])
kill(getpid(), SIGABRT)

Here's the interesting thing: doing an exit(0) before the shutdowns
and abort gets rid of the hang. The only unusual and potentially
untested thing I could find about the program was that it uses
capset() and prctl(PR_SET_KEEPCAPS). However, replacing the
"retval = capset(...)" call with a dummy "retval = 0" doesn't get
rid of the hang. So it looks as though some combination of
shutdown(2) and SIGABRT is at fault. After the hang the kernel-side
stack trace is always either the one I gave above (and I *did*
write down the address for inet_ioctl correctly; it's definitely
not inet_shutdown) or else
  do_exit - do_signal - schedule - syscall_trace - signal_return
(with exactly the same addresses as above except for the differing
schedule and syscall_trace ones) which appeared after the hang while
vsftpd was being run under strace.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [UPDATE] Fresh zerocopy patch on kernel.org

2001-01-31 Thread Malcolm Beattie

David S. Miller writes:
> 
> At the usual place:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz

Hmm, disappointing results here; maybe I've missed something.

Setup is a Pentium II 350MHz (tusk) connected to a Pentium III
733MHz (heffalump) (both 512MB RAM) with SX fibre, each with a
3Com 3C985 NIC. Kernels compared are 2.4.1 and 2.4.1+zc
(the 2.4.1-1 diff above) using acenic driver with MTU set to 9000.
Sysctls set are
# Raise socket buffer limits
net.core.rmem_max = 262144
net.core.wmem_max = 262144
# Increase TCP write memory
net.ipv4.tcp_wmem = 10 10 10
on both sides.

Comparison tests done were
gensink4: 10485760 (10MB) buffer size, 262144 (256K) socket buffer
ftp: server does sendfile() from a 300MB file in page cache,
 client does read from socket/write to /dev/null in 4K chunks.

   2.4.12.4.1+zc
 KByte/s tusk%CPU heff%CPU  KByte/s tusk%CPU heff%CPU
gensink4
  tusk->heffalump94000   58-100   9354000   98-102   11-45
  heffalump->tusk72000   86-100   46-59 7   71-9353-71

  2.4.1 2.4.1+zc 
  KByte/s   KByte/s
ftp heffalump->tusk   86000 62000


I was impressed with the raw 2.4.1 figures and hoped to be even more
impressed with the 2.4.1+zc numbers. Is there something I'm missing or
can change or do to help to improve matters or track down potential
problems?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN)

2001-01-31 Thread Malcolm Beattie

Ingo Molnar writes:
> 
> On Tue, 30 Jan 2001, jamal wrote:
> 
> > > - is this UDP or TCP based? (UDP i guess)
> > >
> > TCP
> 
> well then i'd suggest to do:
> 
>   echo 10 10 10 > /proc/sys/net/ipv4/tcp_wmem
> 
> does this make any difference?

For the last week I've been benchmarking Linux network and I/O on a
couple of machines with 3c985 gigabit cards and some other stuff
(see below). One of the things I tried yesterday was a beta test
version of a secure ftpd written by Chris Evans which happens to use
sendfile() making it a convenient extra benchmark. I'd already put
net.core.{r,w}mem_max up to 262144 for the sake of gensink and other
benchmarks which raise SO_{SND,RCV}BUF. I hadn't however, tried
raising tcp_wmem as per your suggestion above.

Currently the systems are linked back to back with fibre with jumbo
frames (MTU 9000) on and running pure kernel 2.4.1. I transferred a 300
MByte file repeatedly from the server to the client with an ftp "get"
client-side. The file will have been completely in page cache on the
server (both machines have 512MB RAM) and was written to /dev/null on
the client side. (Yes, I checked the client was doing ordinary
read/write and not throwing it away).

Without the raised tcp_wmem setting I was getting 81 MByte/s.
With tcp_wmem set as above I got 86 MByte/s. Nice increase. Any other
setting I can tweak apart from {r,w}mem_max and tcp_{w,r}mem? The CPU
on the client (350 MHz PII) is the bottleneck: gensink4 maxes out at
69 Mbyte/s pulling TCP from the server and 94 Mbyte/s pushing. (The
other system, 733 MHz PIII pushes >100MByte/s UDP with ttcp but the
client drops most of it).

I'll be following up Dave Miller's "please benchmark zerocopy"
request when I've got some more numbers written down since I've only
just put the zerocopy patch in and haven't rebooted yet.

If anyone wants any other specific benchmarks done (I/O or network)
I may get some time to do them: the PIII system has an 8-port
Escalade card with 8 x 46GB disks (117 MByte/s block writes as
measured by Bonnie on a RAID1/0 mixed RAIDset) and there are also
four dual-port eepro fast ethernet cards, a Cisco 8-port 3508G gigabit
switch and a 24-port 3524 fast ethernet switch (gigastack linked to
the 3508G).  I'm benchmarking and looking into the possibility of a DIY
NAS or SAN-type thing.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [UPDATE] Fresh zerocopy patch on kernel.org

2001-01-31 Thread Malcolm Beattie

David S. Miller writes:
 
 At the usual place:
 
 ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz

Hmm, disappointing results here; maybe I've missed something.

Setup is a Pentium II 350MHz (tusk) connected to a Pentium III
733MHz (heffalump) (both 512MB RAM) with SX fibre, each with a
3Com 3C985 NIC. Kernels compared are 2.4.1 and 2.4.1+zc
(the 2.4.1-1 diff above) using acenic driver with MTU set to 9000.
Sysctls set are
# Raise socket buffer limits
net.core.rmem_max = 262144
net.core.wmem_max = 262144
# Increase TCP write memory
net.ipv4.tcp_wmem = 10 10 10
on both sides.

Comparison tests done were
gensink4: 10485760 (10MB) buffer size, 262144 (256K) socket buffer
ftp: server does sendfile() from a 300MB file in page cache,
 client does read from socket/write to /dev/null in 4K chunks.

   2.4.12.4.1+zc
 KByte/s tusk%CPU heff%CPU  KByte/s tusk%CPU heff%CPU
gensink4
  tusk-heffalump94000   58-100   9354000   98-102   11-45
  heffalump-tusk72000   86-100   46-59 7   71-9353-71

  2.4.1 2.4.1+zc 
  KByte/s   KByte/s
ftp heffalump-tusk   86000 62000


I was impressed with the raw 2.4.1 figures and hoped to be even more
impressed with the 2.4.1+zc numbers. Is there something I'm missing or
can change or do to help to improve matters or track down potential
problems?

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Modprobe local root exploit

2000-11-14 Thread Malcolm Beattie

Keith Owens writes:
> All these patches against request_module are attacking the problem at
> the wrong point.  The kernel can request any module name it likes,
> using any string it likes, as long as the kernel generates the name.
> The real problem is when the kernel blindly accepts some user input and
> passes it straight to modprobe, then the kernel is acting like a setuid
> wrapper for a program that was never designed to run setuid.

Rather than add sanity checking to modprobe, it would be a lot easier
and safer from a security audit point of view to have the kernel call
/sbin/kmodprobe instead of /sbin/modprobe. Then kmodprobe can sanitise
all the data and exec the real modprobe. That way the only thing that
needs auditing is a string munging/sanitising program.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Modprobe local root exploit

2000-11-14 Thread Malcolm Beattie

Keith Owens writes:
 All these patches against request_module are attacking the problem at
 the wrong point.  The kernel can request any module name it likes,
 using any string it likes, as long as the kernel generates the name.
 The real problem is when the kernel blindly accepts some user input and
 passes it straight to modprobe, then the kernel is acting like a setuid
 wrapper for a program that was never designed to run setuid.

Rather than add sanity checking to modprobe, it would be a lot easier
and safer from a security audit point of view to have the kernel call
/sbin/kmodprobe instead of /sbin/modprobe. Then kmodprobe can sanitise
all the data and exec the real modprobe. That way the only thing that
needs auditing is a string munging/sanitising program.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Topic for discussion: OS Design

2000-10-23 Thread Malcolm Beattie

Marty Fouts writes:
> I have had the good fortune of working with one architecture (PA-RISC) which
> gets the separation of addressability and accessability 'right' enough to be
> able to partition efficiently and use ordinary procedure calls (with some
> magic at server boundaries) rather than IPCs.  There are others, but PA-RISC
> is the one I am aware of.

Like S/390 secondary address space and cross-address-space services?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Topic for discussion: OS Design

2000-10-23 Thread Malcolm Beattie

Marty Fouts writes:
 I have had the good fortune of working with one architecture (PA-RISC) which
 gets the separation of addressability and accessability 'right' enough to be
 able to partition efficiently and use ordinary procedure calls (with some
 magic at server boundaries) rather than IPCs.  There are others, but PA-RISC
 is the one I am aware of.

Like S/390 secondary address space and cross-address-space services?

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: New Benchmark tools, lookie looky........

2000-10-17 Thread Malcolm Beattie

Larry McVoy writes:
> On Tue, Oct 17, 2000 at 09:21:00AM -0700, Andre Hedrick wrote:
> > Expand 'traces' ... O-SCOPE analyizer?
> 
> Insert a ring buffer into the disk sort entry point.  Add a userland process
> which reads this ring buffer and gets the actual requests in the actual order
> they are sent to the drive[s].  Then take that data and write a simulator into
> which you can plug in different algs.  I have all this crud for SunOS if you
> want it, including elevator.c, hacksaw.c, and inorder.c.

I wrote a lightweight kernel->userland ring buffer device for Linux
called bufflink and a block-request logger that uses it called
reqlog. reqlog writes a structure
struct reqlog {
unsigned intmajor;
unsigned intminor;
unsigned long   sector;
longnr_sectors;
};
to the ring buffer when an ioctl is done to enable logging. The
current patch isn't quite what you were suggesting since it does
roughly

 add_request() {
...
elevator_queue(req, tmp, latency, dev, current_request);
+   if (bl_reqlog && enable_reqlog) {
+   ...
+   bufflink_append(bl_reqlog, (unsigned char *), sizeof rl);
+   }

if (queue_new_request)
(dev->request_fn)();
 }
but it would be easy to write the record (instead or as well)
before the elevator_queue(). The patches are available from
http://users.ox.ac.uk/~mbeattie/linux-kernel.html

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: New Benchmark tools, lookie looky........

2000-10-17 Thread Malcolm Beattie

Larry McVoy writes:
 On Tue, Oct 17, 2000 at 09:21:00AM -0700, Andre Hedrick wrote:
  Expand 'traces' ... O-SCOPE analyizer?
 
 Insert a ring buffer into the disk sort entry point.  Add a userland process
 which reads this ring buffer and gets the actual requests in the actual order
 they are sent to the drive[s].  Then take that data and write a simulator into
 which you can plug in different algs.  I have all this crud for SunOS if you
 want it, including elevator.c, hacksaw.c, and inorder.c.

I wrote a lightweight kernel-userland ring buffer device for Linux
called bufflink and a block-request logger that uses it called
reqlog. reqlog writes a structure
struct reqlog {
unsigned intmajor;
unsigned intminor;
unsigned long   sector;
longnr_sectors;
};
to the ring buffer when an ioctl is done to enable logging. The
current patch isn't quite what you were suggesting since it does
roughly

 add_request() {
...
elevator_queue(req, tmp, latency, dev, current_request);
+   if (bl_reqlog  enable_reqlog) {
+   ...
+   bufflink_append(bl_reqlog, (unsigned char *)rl, sizeof rl);
+   }

if (queue_new_request)
(dev-request_fn)();
 }
but it would be easy to write the record (instead or as well)
before the elevator_queue(). The patches are available from
http://users.ox.ac.uk/~mbeattie/linux-kernel.html

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-13 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
> I have a user buffer and i want to map it to kernel address space
> can anyone tell how to do this like in AIX we have xmattach

In 2.2, you're better off providing a fake character device driver
which allocates the memory in kernel space and lets the user mmap it.
In 2.4, you could try out kiobufs which, if my reading of the section
marked "#ifdef HACKING", "case BTTV_JUST_HACKING" and also
   /* playing with kiobufs and dma-to-userspace */
is correct, goes (modulo error handling) roughly like:

struct kiobuf *iobuf;
alloc_kiovec(1, ); /* allocate a(n array of one) kiobuf */
map_user_kiobuf(READ, iobuf, va, len); /* userland vaddr and length */
   /* s/READ/WRITE/ for write */
/* now you have an iobuf containing pinned down user pages */
...
lock_kiovec(1, , 1); /* Lock pages down for I/O */
   /* first 1 is vector count */
   /* second means wait for lock */
..  /* do I/O on it */
free_kiovec(1, );/* does an implicit unlock_kiovec */

It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
the per-page map->count that map_user_kiobuf incremented gets
decremented again. Anyone? Lowlevel I/O on a kiovec can be done
with something like an ll_rw_kiovec which sct said was going to get
put in but since I haven't read anything more recent than
2.4.0-test5 at the moment, I can't say if it's there or what it
looks like.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mapping user space buffer to kernel address space

2000-10-13 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
 I have a user buffer and i want to map it to kernel address space
 can anyone tell how to do this like in AIX we have xmattach

In 2.2, you're better off providing a fake character device driver
which allocates the memory in kernel space and lets the user mmap it.
In 2.4, you could try out kiobufs which, if my reading of the section
marked "#ifdef HACKING", "case BTTV_JUST_HACKING" and also
   /* playing with kiobufs and dma-to-userspace */
is correct, goes (modulo error handling) roughly like:

struct kiobuf *iobuf;
alloc_kiovec(1, iobuf); /* allocate a(n array of one) kiobuf */
map_user_kiobuf(READ, iobuf, va, len); /* userland vaddr and length */
   /* s/READ/WRITE/ for write */
/* now you have an iobuf containing pinned down user pages */
...
lock_kiovec(1, iobuf, 1); /* Lock pages down for I/O */
   /* first 1 is vector count */
   /* second means wait for lock */
..  /* do I/O on it */
free_kiovec(1, iobuf);/* does an implicit unlock_kiovec */

It doesn't do an unmap_kiobuf(iobuf) so I don't understand where
the per-page map-count that map_user_kiobuf incremented gets
decremented again. Anyone? Lowlevel I/O on a kiovec can be done
with something like an ll_rw_kiovec which sct said was going to get
put in but since I haven't read anything more recent than
2.4.0-test5 at the moment, I can't say if it's there or what it
looks like.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: large memory support for x86

2000-10-12 Thread Malcolm Beattie

Timur Tabi writes:
> ** Reply to message from Jeff Epler <[EMAIL PROTECTED]> on Thu, 12 Oct 2000
> 13:08:19 -0500
> > What the support for >4G of memory on x86 is about, is the "PAE", Page Address
> > Extension, supported on P6 generation of machines, as well as on Athlons
> > (I think).  With these, the kernel can use >4G of memory, but it still can't
> > present a >32bit address space to user processes.  But you could have 8G
> > physical RAM and run 4 ~2G or 2 ~4G processes simultaneously in core.
> 
> How about the kernel itself?  How do I access the memory above 4GB inside a
> device driver?

It depends on what you have already. If you're given a (kernel)
virtual address, just dereference it. The unit of currency for
physical pages is the "struct page". If you want to allocate a
physical page for your own use (from anywhere in physical memory)
them you do

struct page *page = alloc_page(GFP_FOO);

If you want to read/write to that page directly from kernel space
then you need to map it into kernel space:

char *va = kmap(page);
/* read/write from the page starting at virtual address va */
kunmap(va);

The implementations of kmap and kunmap are such that mappings are
cached (within reason) so it's "reasonable" fast doing kmap/kunmap.
If you want to do something else with the page (like get some I/O
done to/from it) then the new (and forthcoming) kiobuf functions
take struct page units and handle all the internal mapping gubbins
without you having to worry about it.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: want tool to open RPM package on Window 95

2000-10-12 Thread Malcolm Beattie

Michal Jaegermann writes:
> > Somewhere floating around there is a perl version of rpm2cpio.
> 
> This is what I wrote one day a long time ago:
> 
> #!/usr/bin/perl -w
> use strict;
> 
> my ($buffer, $pos, $gzmagic);
> $gzmagic = "\037\213";
> open OUT, "| gunzip" or die "cannot find gunzip; $!\n";
> while(1) {
>   exit 1 unless defined($pos = read STDIN, $buffer, 8192) and $pos > 0;
>   next unless ($pos = index $buffer, $gzmagic) >= 0;
>   print OUT substr $buffer, $pos;
>   last;
> }
> print OUT ;
> exit 0;
> 
> Yes, I know that I should not mix 'read' with stdio but it worked
> every time I tried the above. :-)

The good news is that "read" does use stdio (along with seek and print).
The syscall ones are sys{read,write,seek}. The less good news is that
your "print OUT " sucks up all the RPM file into memory before
dumping it out again which is inelegant and leads those who copy the
idiom without understanding it to run into problems when they use
similar code on large files. One way of doing it a bit differently is

#!/usr/bin/perl
die "Usage: rpm2cpio foo.rpm | cpio ...\n" unless @ARGV == 1;
open(RPM, $ARGV[0]) or die "$ARGV[0]: $!\n";
open(STDOUT, "| gunzip") or die "cannot find gunzip: $!\n";
while (read(RPM, $_, 8192)) {
if (!$found_gzmagic) {
s/^.*?(?=\037\213)//s or next;
$found_gzmagic = 1;
    }
print;
}

> Can we go back now to kernel issues?

Oops, yes, we now return you to your regularly scheduled kgcc wars.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: want tool to open RPM package on Window 95

2000-10-12 Thread Malcolm Beattie

Michal Jaegermann writes:
  Somewhere floating around there is a perl version of rpm2cpio.
 
 This is what I wrote one day a long time ago:
 
 #!/usr/bin/perl -w
 use strict;
 
 my ($buffer, $pos, $gzmagic);
 $gzmagic = "\037\213";
 open OUT, "| gunzip" or die "cannot find gunzip; $!\n";
 while(1) {
   exit 1 unless defined($pos = read STDIN, $buffer, 8192) and $pos  0;
   next unless ($pos = index $buffer, $gzmagic) = 0;
   print OUT substr $buffer, $pos;
   last;
 }
 print OUT STDIN;
 exit 0;
 
 Yes, I know that I should not mix 'read' with stdio but it worked
 every time I tried the above. :-)

The good news is that "read" does use stdio (along with seek and print).
The syscall ones are sys{read,write,seek}. The less good news is that
your "print OUT STDIN" sucks up all the RPM file into memory before
dumping it out again which is inelegant and leads those who copy the
idiom without understanding it to run into problems when they use
similar code on large files. One way of doing it a bit differently is

#!/usr/bin/perl
die "Usage: rpm2cpio foo.rpm | cpio ...\n" unless @ARGV == 1;
open(RPM, $ARGV[0]) or die "$ARGV[0]: $!\n";
open(STDOUT, "| gunzip") or die "cannot find gunzip: $!\n";
while (read(RPM, $_, 8192)) {
if (!$found_gzmagic) {
s/^.*?(?=\037\213)//s or next;
$found_gzmagic = 1;
}
print;
}

 Can we go back now to kernel issues?

Oops, yes, we now return you to your regularly scheduled kgcc wars.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: large memory support for x86

2000-10-12 Thread Malcolm Beattie

Timur Tabi writes:
 ** Reply to message from Jeff Epler [EMAIL PROTECTED] on Thu, 12 Oct 2000
 13:08:19 -0500
  What the support for 4G of memory on x86 is about, is the "PAE", Page Address
  Extension, supported on P6 generation of machines, as well as on Athlons
  (I think).  With these, the kernel can use 4G of memory, but it still can't
  present a 32bit address space to user processes.  But you could have 8G
  physical RAM and run 4 ~2G or 2 ~4G processes simultaneously in core.
 
 How about the kernel itself?  How do I access the memory above 4GB inside a
 device driver?

It depends on what you have already. If you're given a (kernel)
virtual address, just dereference it. The unit of currency for
physical pages is the "struct page". If you want to allocate a
physical page for your own use (from anywhere in physical memory)
them you do

struct page *page = alloc_page(GFP_FOO);

If you want to read/write to that page directly from kernel space
then you need to map it into kernel space:

char *va = kmap(page);
/* read/write from the page starting at virtual address va */
kunmap(va);

The implementations of kmap and kunmap are such that mappings are
cached (within reason) so it's "reasonable" fast doing kmap/kunmap.
If you want to do something else with the page (like get some I/O
done to/from it) then the new (and forthcoming) kiobuf functions
take struct page units and handle all the internal mapping gubbins
without you having to worry about it.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Availability of kdb

2000-09-18 Thread Malcolm Beattie

Marty Fouts writes:
> Here's another piece of free advice, worth less than you paid for it: in 25
> years, only the computer history trivia geeks are going to remember you,
> just as only a very small handful of us now remember who wrote OS/360.

You mean like Fred Brooks who managed the development of OS/360, had
some innovative ideas about how large software projects should be run,
whose ideas clashed with contemporary ones, who became a celebrity?
You don't spot any parallels there? He whose book "Mythical Man Month"
with "No Silver Bullet" and "The Second System Effect" are quoted
around the industry decades later? And you think that's only a small
handful of people?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Availability of kdb

2000-09-18 Thread Malcolm Beattie

Marty Fouts writes:
 Here's another piece of free advice, worth less than you paid for it: in 25
 years, only the computer history trivia geeks are going to remember you,
 just as only a very small handful of us now remember who wrote OS/360.

You mean like Fred Brooks who managed the development of OS/360, had
some innovative ideas about how large software projects should be run,
whose ideas clashed with contemporary ones, who became a celebrity?
You don't spot any parallels there? He whose book "Mythical Man Month"
with "No Silver Bullet" and "The Second System Effect" are quoted
around the industry decades later? And you think that's only a small
handful of people?

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Adding set_system_gate fails in arch/i386/kernel/traps.

2000-09-12 Thread Malcolm Beattie

Petr Vandrovec writes:
> On 12 Sep 00 at 21:25, Keith Owens wrote:
> > >0x85) vanish after the system has booted further. printk shows that
> > >idt_table is correctly updated immediately after the set_system_gate
> > >but once the system has booted the entries for my new traps have
> > >reverted. (printk telemetry available on request). However, once the
> > >system has booted, a little module which simply updates
> > >idt_table[MY_NEW_VECTOR] directly works fine and "sticks". Help?
> > >(Or, more accurately "Aaarrrgh?").
> > 
> > I can confirm that this sometimes occurs in 2.4.0-testx, AFAIK I have
> > only seen the problem in SMP kernels.
> 
> What about arch/i386/kernel/io_apic.c:assign_irq_vector() ?

I'm not using SMP but you both put me on the right track.
init/main.c:start_kernel does:
setup_arch(_line);
trap_init();
init_IRQ();

trap_init does the set_system_gate(FOO_VECTOR, ) lines I
extended but then init_IRQ() does

for (i = 0; i < NR_IRQS; i++) {
int vector = FIRST_EXTERNAL_VECTOR + i;
if (vector != SYSCALL_VECTOR) 
set_intr_gate(vector, interrupt[i]);
}

and promptly zaps everything except SYSCALL_VECTOR.
(FIRST_EXTERNAL_VECTOR is 0x20). Many thanks.

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Adding set_system_gate fails in arch/i386/kernel/traps.

2000-09-12 Thread Malcolm Beattie

Petr Vandrovec writes:
 On 12 Sep 00 at 21:25, Keith Owens wrote:
  0x85) vanish after the system has booted further. printk shows that
  idt_table is correctly updated immediately after the set_system_gate
  but once the system has booted the entries for my new traps have
  reverted. (printk telemetry available on request). However, once the
  system has booted, a little module which simply updates
  idt_table[MY_NEW_VECTOR] directly works fine and "sticks". Help?
  (Or, more accurately "Aaarrrgh?").
  
  I can confirm that this sometimes occurs in 2.4.0-testx, AFAIK I have
  only seen the problem in SMP kernels.
 
 What about arch/i386/kernel/io_apic.c:assign_irq_vector() ?

I'm not using SMP but you both put me on the right track.
init/main.c:start_kernel does:
setup_arch(command_line);
trap_init();
init_IRQ();

trap_init does the set_system_gate(FOO_VECTOR, handler) lines I
extended but then init_IRQ() does

for (i = 0; i  NR_IRQS; i++) {
int vector = FIRST_EXTERNAL_VECTOR + i;
if (vector != SYSCALL_VECTOR) 
set_intr_gate(vector, interrupt[i]);
}

and promptly zaps everything except SYSCALL_VECTOR.
(FIRST_EXTERNAL_VECTOR is 0x20). Many thanks.

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Flavours of deceased bovine

2000-09-08 Thread Malcolm Beattie

Keith Owens writes:
> Just had an ext2 filesystem on SCSI that was corrupt.  The first two
> words of the group descriptor had been overwritten with 0xdeadbeef,
> 0x.  The filesystem is fixed now but trying to track down the
> problem is difficult, there are 50+ places in the kernel that use
> 0xdeadbeef.
> 
> I strongly suggest that people use different variants of dead beef to
> make it easier to work out where any corruption is coming from.
> Perhaps change the last 2-3 digits so magic values would be 0xdeadb000
> to 0xdeadbfff, assuming it does not affect any other code.

Nah, choose new words which stand out. There are plenty of them and
it avoids the problem of a 0xdeadb001 being decremented before being
noticed and thus confused with a 0xdeadb000. Be inventive:
egrep -x '[abcdefilos]{3,8}' /usr/dict/words
and make one up whenever needed. For example,
0baff1ed
acce55ed
decea5ed
d15ab1ed
d15ea5ed
along with multiword ones like
fee1dead
dead1055
badca5e5

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Flavours of deceased bovine

2000-09-08 Thread Malcolm Beattie

Keith Owens writes:
 Just had an ext2 filesystem on SCSI that was corrupt.  The first two
 words of the group descriptor had been overwritten with 0xdeadbeef,
 0x.  The filesystem is fixed now but trying to track down the
 problem is difficult, there are 50+ places in the kernel that use
 0xdeadbeef.
 
 I strongly suggest that people use different variants of dead beef to
 make it easier to work out where any corruption is coming from.
 Perhaps change the last 2-3 digits so magic values would be 0xdeadb000
 to 0xdeadbfff, assuming it does not affect any other code.

Nah, choose new words which stand out. There are plenty of them and
it avoids the problem of a 0xdeadb001 being decremented before being
noticed and thus confused with a 0xdeadb000. Be inventive:
egrep -x '[abcdefilos]{3,8}' /usr/dict/words
and make one up whenever needed. For example,
0baff1ed
acce55ed
decea5ed
d15ab1ed
d15ea5ed
along with multiword ones like
fee1dead
dead1055
badca5e5

--Malcolm

-- 
Malcolm Beattie [EMAIL PROTECTED]
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/