Re: How to visit physical memory above 4G?

2001-08-01 Thread Terry Lambert

 craig wrote:
 
 
 I know PIII can support 64G physical memory. In FreeBSD how can I visit such
 range memory(4G-64G) ?

The short answer is you can't.

The longer answer is that you end up having to window it using
segmentation; if you are familiar with the 4k window on video
memory in the TI 99/4A, or the bank select on the 6510 (e.g.
the ability to select between 32K of RAM, and 32K of ROM, but
not both at the same time, on the Commodore C-64 and the similar
arrangement on the C-128, etc.), then you;ll have an idea of how
the thing works... assuming you can find a motherboard that can
handle it.

This basically means that the memory is useless as a DMA target
or source for disk controllers or gigabit ethernet cards, and is
pretty useless for swap, if you ever have to copy from one section
to another (e.g. for IPC, SYSV shared memory, mmap'ed files, VM,
or buffer cache, etc.).

So for limited uses in data intensive applications, it might be
usable, but in general, it's nothing more than a hack so that
they can claim to support more than 4G, for some extremely
limited definition of support.

But to directly answer your question: by rewriting much of the
low core virtual memory and page mapping handling code to know
about segmentation.

Have fun doing this, since by the time you are done, you will
probably be able to get IA64 machines for something less than
the $7000/unit that you have to pay today, and they will likely
have PCI/X, so you have enough bus bandwidth to actually make
the RAM halfway usable.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Finding filesizes in C++ for files greater than 4gb

2001-08-02 Thread Terry Lambert

Joseph Gleason wrote:
 Alright, I made a mistake.  But I did read the man page.  Where does it say
 off_t is 64bits?

The same place it says char is 8 bits, short is 16 bits, and int
and long are 32 bits: in your assumptions.

It might be useful (for some definitions of useful) to have a
man page which tells you the size of these things, but FreeBSD
runs on architectures where these are different, so you would
not be able to have one man page that does it.

The correct thing to do is to use off_t when speaking of file
lengths/offsets, and then let the machine support what it can
support.

If you simply must know, there's always sizeof(off_t).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Finding filesizes in C++ for files greater than 4gb

2001-08-02 Thread Terry Lambert

Chirag Kantharia wrote:
 
 On Wed, Aug 01, 2001 at 11:25:40PM -0700, Terry Lambert wrote:
 | Uh, st_size is an off_t, which is a signed 64 bit value,
 | not an unsigned 32 bit vale...
 
 going off-topic why should it be `signed' 64 bit and not unsigned?

Return value for lseek is off_t.  -1 indicates error, therefore
the sign bit is reserved.

This is a historical UNIX-ism, having to do with functions not
taking the addresses of their return values as parameters, so
that the actual return value could be a pure success/failure
code (e.g. like VMS, where you will get SYS$SUCCESS on success,
and the actual error code on an error -- this also does away
with the need for an errno, which has historically been a huge
pain, particularly in threaded programs).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

mark tinguely wrote:
   Also, the PIII CAN'T natively support more than 4GB of ram. If a
   particular PIII motherboard supports this, then it's using some kind of
   wierd chipset that allows this to happen. 4GB is the limit with a 32 bit
   chip I believe; and the PIII is a 32-bit chip.
 
 Since the Pentium Pro processor, the Intel chipsets support a
 physical address extension (PAE) which has 4 extra addressing
 bits, and a third level of page table indirection called the
 page-directory-pointer-table base address field.

Bit 5 of control register 4, and then it uses the top 27 bits
of control register 3 to select a 32 byte aligned region in
the lower 4G.  It also changes the PSE bit to refer to 2M
instead of 4M pages, so your would needto DISABLE_PSE, or the
FreeBSD kernel would freak when it enabled the 4M page on
the kernel itself.

Then the high 4 bits are used to pick a pointer entry (which
is effectively a software segment register select, for all
practical purposes), giving you 64G of addressable space,
in chunks of 4G at a time.

Practically, you end up having to overlap this, which tends
to cut you down to 32G.


 The addressing use 64 bits for a memory pointer and the additional
 page indirection add to the overhead. The stickler is the MMU is
 still 32 bits. This means the PAE must segment the 64GB space into
 4GB segments or 4 1GB segments. The OS must manage which pages are
 viewable to the process at this time.

Not only that: you reload CR3, and none of these pages are
really global, so you can't set the PG_G bit, and so you
get the full TLB shootdown on everything, so a segment switch
ends up shooting _everything_ down.


 There is a third mode of addressing using 2MB pages (simular to the
 4MB page addressing mode for the 32 bit addressing scheme) that will
 only give a process access to 4GB of memory (not segmentable to
 a larger space, but can address physical memory located above the 4GB
 address).

Not really useful, unless you go back to a task gate, which
itself will limit you to 1024 things; with code + data, you
end up halving that to 512, minus overhead drops it to 510,
so you end up with a limitation on number of processes.  You
could do all the switching manually, but it is very, very hard.

Further, you can write off shared memory and shared libraries,
and some types of IPC (e.g. descriptor passing), unless you
want to rework everything.

IMO, the resulting kernel would be so slow as to prevent the
changes from being useful, due to their expense.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

John Baldwin wrote:
 Err. hang on.  This has zero to do with segmentation.  Zip, nada.
 PAE is completely in the paging side of things.  No matter what
 fun games you play with segmentation, you still end up with a
 32-bit linear address that gets handed off to the paging translations.
 PAE just allows you to use more backing store across multiple
 processes, but you are still stuck with a 4gb virtual address
 space for processes.  (Including KVM)

IMO, the 4 bit selector register is the moral equivalent
of a segment register.

Personally, I think it's much less useful to run the kernel
out of KVA space, than it is to have more memory available
to the kernel for things like mbufs, so I'm not really very
interested in trying to raise the per process address space
limits this way.  You could actually get 4G for the kernel
and 4G for processes using this, but you would only need two
segments to make this happen; mapping the other stuff at
the same time makes little sense: you just map a window on
it to implement region overlays for user or kernel data
paging.

Given the compiler tools we have, this still limits you to
using only 4G in a given VA space, unless you did something
evil, like add HLOCK/HUNLOCK, etc..


  But to directly answer your question: by rewriting much of the
  low core virtual memory and page mapping handling code to know
  about segmentation.
 
 No, to rewrite said code to handle a different type of page table
 structure.

Virtual table structure/segements: same difference: I'm now
wdoing in software what I bought hardware to get away from
having to do in software.

Given the vastly simplified page management in Linux, I
could see how there wouldn't really be a big performance
loss over the way Linux does things without this, so it
might be OK, so long as there were no shared memory regions,
semaphores, etc..  That really makes it pretty useless.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Rik van Riel wrote:
[ ...  4G on 32 bit macines ... ]
  The short answer is you can't.
 
  The longer answer is that you end up having to window it using
  segmentation;
 
 Only if you want to use it all within one process.

No.  It still bites you if you want to do IPC, etc., since you
can not guarantee the structures used for this are all within
the non-segmented region of memory.

 You can have multiple 2 GB (that's the maximum
 process size in FreeBSD, right?) programs at the
 same time, happily using all physical memory.

The default maximum size for FreeBSD is 3G.  You can tune this
up or down, with the limit being that the larger the user space,
the smaller the KVA space, and vice versa.


 Only the FreeBSD memory management subsystem doesn't
 support it (yet?).

It's not a question of supporting it, it's a question of
whether or not it's a useful idea at all.

  This basically means that the memory is useless as a DMA target
  or source for disk controllers or gigabit ethernet cards, and is
  pretty useless for swap, ...
 
  So for limited uses in data intensive applications, it might be
  usable,
 
 And for those data intensive applications, it is very
 useful indeed...

I have yet to see one person using it for anything.  So far,
it is nothing more than marketing fodder: I haven't seen one
motherboard capable of more than 4G worth of SIMMs.


  But to directly answer your question: by rewriting much of the
  low core virtual memory and page mapping handling code to know
  about segmentation.
 
 Not just that.  There is a more insidious problem with
 the FreeBSD VM code and support of huge machines.

Not really.


 The part of handling the PAE extended page table format
 and mapping high memory pages in and out of KVA (kernel
 virtual address) memory to copy stuff is easy.

Yes.


 Problem is that you'll have to fit all of FreeBSD's VM
 data structures in the 2GB of KVA. This just isn't going
 to fit with the size the data structures have today ...

I currently run with 3G+ of KVA; it would be simple to
invert this, but this leaves me a 1G user space window,
with 3G available for kernel structures, etc..  It takes
about 1G for all of the kernel support stuff for 4G, with
an allowance for 1/8th million open network connections.

So it's not unreasonable to think of putting 8G or 16G in
a box, and being able to map it all.

 So in order to support huge memory machines right,
 you'd have to put a number of FreeBSD's VM data structures
 on a rather strict diet.

Not really.  There's always 4M pages.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Rik van Riel wrote:
   Only if you want to use it all within one process.
 
  No.  It still bites you if you want to do IPC, etc., since you
  can not guarantee the structures used for this are all within
  the non-segmented region of memory.
 
 Wrong. Your process can have pages from all over the
 64 GB mapped into its page tables.

Try doing this with code pages in one segment, the stack in
another, and the data being referenced in a third.  It will
not work.


 Each process has 3 GB of virtual memory mapped to any
 of the pages of the 64 GB of physical space.

Like I said before, this is not useful.  The only marginal
use for such a thing is for implementing an L3 cache in
software, or for implementing multiple virtual machines on
one box.


   Only the FreeBSD memory management subsystem doesn't
   support it (yet?).
 
  It's not a question of supporting it, it's a question of
  whether or not it's a useful idea at all.
 
  I have yet to see one person using it for anything.  So far,
  it is nothing more than marketing fodder: I haven't seen one
  motherboard capable of more than 4G worth of SIMMs.
 
 I've seen a bunch of the machines. They're rather
 popular with the database folks.

Name an OS that supports this; more than likely, you will
have to appeal to a purpose built embedded system.


   Problem is that you'll have to fit all of FreeBSD's VM
   data structures in the 2GB of KVA. This just isn't going
   to fit with the size the data structures have today ...
 
  So it's not unreasonable to think of putting 8G or 16G in
  a box, and being able to map it all.
 
 You can never map it all, since your virtual address space
 is limited to 4GB...

Overlays.  A technology from the dawn of time, I know, but
so are segment registers.


 Basically the database folks are really keen on keeping
 their 3GB user addressable memory, so the kernel will
 remain limited to 1GB of KVA.

They shouldn't care, since they are getting more memory.


 On the really large machines, this can lead to the
 situation where even the page tables hardly fit into
 KVA. 4MB pages seem like the only solution ...

This is why we use 64 bit processors for really large
machines, and we say that 36 bit address spaces are
really pretty useless, and will be obsolete by the time
the code is complete, because of ia64 being a cheaper
and faster soloution.  8-p.

36 bits only gives you 2^4 * 4G, or 64G, anyway, and it
is hardly worth the segment thrashing and instruction and
data cache shootdowns to be able to handle it.  You are
better off throwing 16 machines at the problem: your major
cost item is going to be memory, anyway, and getting 64G
in one box is going to cost you significantly more than
just putting together multiple boxes.  If the locality is
such that 2G per process is OK, you might as well be on
seperate boxes with non-segmenet swapped memory.


 (well, there's also the mess of shared page tables,
 but nobody is keen on the locking issues those imply)

It's much, much worse than that.  Like I said before, you
could do it pretty trivially, but the expense of doing it
will be so high, relatively, that you might as well buy an
Alpha or PA-RISC box and call it a day, if you need it now,
or just wait for IA64, rather than throwing developement
effort into something that will end up in a scrap heap before
it gets reasonable performance, or perhaps before it's even
deployed.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Rik van Riel wrote:
  BUT, don't the motherboards also have to support this? And isn't
  it only supported through some wierd segmentation thing?
 
 Yes, the mainboard needs to support the memory.
 
 No, there is no weird segmentation thing, at least
 not visible from software.

Last time I looked, the kernel was software...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Julian Elischer wrote:
 
 No
 The space is linear in physical space and if you have PCI/64
 capable devices they can access it all too.
 
 (In fact 64 bit addresses have been supported even in 32 bit wide PCI
 since day 1).

It's been my experience that the TIGON cards take a 32 bit
DMA target address, not a 64 bit DMA target address, and
that the 54 bit width was only used for the data transfer,
not for the address offset.

Correct me if I'm wrong, and 64 bit PCI cards can in fact
DMA at offsets above 4G, in the physical address space...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Charles Randall wrote:
 
 From: Terry Lambert [mailto:[EMAIL PROTECTED]]
 I have yet to see one person using it for anything.  So far,
 it is nothing more than marketing fodder: I haven't seen one
 motherboard capable of more than 4G worth of SIMMs.
 
 The Dell PowerEdge 6450 supports 8 GB of RAM.
 
 http://www.dell.com/us/en/biz/products/model_pedge_pedge_6400.htm
 
 If I understand your comments in a few follow-up messages
 correctly you're saying that this effort may be better spent
 by working on an IA-64 port and making it support large memory
 configurations?

The IA64 intrinsically supports a physical address space of 2^64,
so an address space of 2^36 vs. 2^32 is spectacularly unimpressive.

 Can you elaborate?

Yeah, the overhead in doing this will up the CPU utilization
to the point where it becomes fairly useless to do the swapping
to and from above 4G, vs. just swapping normally.

The costs involved in doing DMA to/from the memory region
above 4G will be incredible, unless the address space is
both exported, and known, to the PCI bus; even then, it
could only work for 64 bit cards, since 32 bith cards will
only be able to address the first 4G of physical memory.

I can think of one or two uses for the memory, assuming
the ability to DMA into and out of it with a 64 bit card,
and the ability to shove a 1G or 2G window around in it
so the kernel can get at the memory when it needs to, but
the overhead seems to me to be high enough that you are
better off buying a Sibytes card, running NetBSD on the
MIPS processors on the thing, plugging in 16G of RAM, and
calling your PC a control processor.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: PR 25958

2001-08-03 Thread Terry Lambert

Nate Dannenberg wrote:
 I'd be glad to, however I no longer run FreeBSD.  I have since switched to
 Linux.

[ ... ]

 Not being much of a C programmer
 anymore I can't really say for certain though :)

Are these two statements related by cause and effect?

8-) 8-)

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: How to visit physical memory above 4G?

2001-08-03 Thread Terry Lambert

Rik van Riel wrote:
  This is a trivial implementation.  I'm not very impressed.
 
  Personally, I'm not interested in a huge user space,
 
 Maybe not you, but I bet the database and scientific
 computing people will be interested in having 64 GB
 memory support in this simple way.

You mean 4G, of course, since the process address space
remains limites to 32 bits...


  Fully populating both the transmit and receive windows for
  1M connections is 32G of RAM, right there... and it better
  be kernel RAM, or you're screwed.
 
 Well, you _could_ store this memory in files, which
 get mapped and unmapped by the same code the filesystem
 code uses to access file data in non-kernel-mapped RAM.
 
 *runs like hell*

That's the entire problem: it has to be performant, or I'm
just not interested in it.

Using the memory as a software L3 would make a lot more
sense to me... a 3G user space is pretty useless, from my
point of view, and I'd much rather spend the space on the
kernel.  Cutting that to 2G/2G might be OK, with 1G in the
user used for mapping regions in and out.

You are still limited to how much RAM you have, but at
least you aren't shooting yourself in the foot trying to
make it work.

You still haven't told me what Linux does for 2x4G processes
and a 1G kernel with only 8G of physical RAM.  I rather
suspect that as soon as your usage exceeds real memory, it
all goes to hell very quickly, since your L1 and L2 caches
are effectively disabled by the frequent reloading of CR3
and CR4 on context switches...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: gethostbyXXXX_r()

2001-08-06 Thread Terry Lambert

Alexander Litvin wrote:
 As for bind9 -- this has AFAIK totally rewritten resolver,
 which doesn't even resemble bind8. IMHO, to incorporate
 it into FreeBSD might take a tremendous effort.

Not really.

Just import it on a vendor branch as /usr/src/lib/libresolv,
and then things that want it can link to it, no problem.

Once everything is converted over, then it can be diked out
of libc.

Pretty straight forward...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Terry Lambert

Matt Dillon wrote:
 Yes, that is precisely the reason.  In -current this all changes, though,
 since interrupts are now threads.  *But*, that said, interrupts cannot
 really afford to hold mutexes that might end up blocking them for
 long periods of time so I would still recommend that interrupt code not
 attempt to allocate pages out of PQ_CACHE.

I keep wondering about the sagicity of running interrupts in
threads... it still seems like an incredibly bad idea to me.

I guess my major problem with this is that by running in
threads, it's made it nearly impossibly to avoid receiver
livelock situations, using any of the classical techniques
(e.g. Mogul's work, etc.).

It also has the unfortunate property of locking us into virtual
wire mode, when in fact Microsoft demonstrated that wiring down
interrupts to particular CPUs was good practice, in terms of
assuring best performance.  Specifically, running in virtual
wire mode means that all your CPUs get hit with the interrupt,
whereas running with the interrupt bound to a particular CPU
reduces the overall overhead.  Even what we have today, with
the big giant lock and redirecting interrupts to the CPU in
the kernel is better than that...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: timing question

2001-08-07 Thread Terry Lambert

Jeff Behl wrote:
 please excuse and direct me to the right place if this isn't the appropriate
 place to post this sort of question
 
 we're looking into moving to freebsd (yea!), but found the following
 problem.  It seems that the shortest amount of time the below code will
 sleep for is 20 seconds!  any call to nanosleep for 5,10, etc miliseconds
 returns a 20 ms delay.  are we doing something wrong?

You appear to be measuring the quantum.

You can decrease the quantum size via sysctl, or at compile
time.

Realize that your timing is tight enough that if you are
running any other code, you can't expect that your process
will get the quantum next, unless you use rtprio.

Also note that your timer granularity might be someone less
than you would expect: in other words it could be returning
before, but since the sleep is woken up as the result of
a timer interrupt firing, you may need to increase the rate
your clock runs at (search for HZ in /sys/i386/conf/LINT)
to make your timer interrupts faster, which will in turn
increase your timeout resolution.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Terry Lambert

Bosko Milekic wrote:
  I keep wondering about the sagicity of running interrupts in
  threads... it still seems like an incredibly bad idea to me.
 
  I guess my major problem with this is that by running in
  threads, it's made it nearly impossibly to avoid receiver
  livelock situations, using any of the classical techniques
  (e.g. Mogul's work, etc.).
 
 References to published works?

Just do an NCSTRL search on receiver livelock; you will get
over 90 papers...

http://ncstrl.mit.edu/

See also the list of participating institutions:

http://ncstrl.mit.edu/Dienst/UI/2.0/ListPublishers

It won't be that hard to find... Mogul has only published 92
papers.  8-)


  It also has the unfortunate property of locking us into virtual
  wire mode, when in fact Microsoft demonstrated that wiring down
  interrupts to particular CPUs was good practice, in terms of
  assuring best performance.  Specifically, running in virtual
 
 Can you point us at any concrete information that shows
 this?  Specifically, without being Microsoft biased (as is most
 data published by Microsoft)? -- i.e. preferably third-party
 performance testing that attributes wiring down of interrupts to
 particular CPUs as _the_ performance advantage.

FreeBSD was tested, along with Linux and NT, by Ziff Davis
Labs, in Foster city, with the participation of Jordan
Hubbard and Mike Smith.  You can ask either of them for the
results of the test; only the Linux and NT numbers were
actually released.  This was done to provide a non-biased
baseline, in reaction to the Mindcraft benchmarks, where
Linux showed so poorly.  They ran quad ethernet cards, with
quad CPUs; the NT drivers wired the cards down to seperate
INT A/B/C/D interrupts, one per CPU.


  wire mode means that all your CPUs get hit with the interrupt,
  whereas running with the interrupt bound to a particular CPU
  reduces the overall overhead.  Even what we have today, with
 
 Obviously.

I mention it because this is the direction FreeBSD appears
to be moving in.  Right now, Intel is shipping with seperate
PCI busses; there is one motherboard from their serverworks
division that has 16 seperate PCI busses -- which means that
you can do simultaneous gigabit card DMA to and from memory,
without running into bus contention, so long as the memory is
logically seperate.  NT can use this hardware to its full
potential; FreeBSD as it exists, can not, and FreeBSD as it
appears to be heading today (interrupt threads, etc.) seems
to be in the same boat as Linux, et. al..  PCI-X will only
make things worse (8.4 gigabit, burst rate).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Terry Lambert

Mike Smith wrote:
 
  It also has the unfortunate property of locking us into virtual
  wire mode, when in fact Microsoft demonstrated that wiring down
  interrupts to particular CPUs was good practice, in terms of
  assuring best performance.  Specifically, running in virtual
  wire mode means that all your CPUs get hit with the interrupt,
  whereas running with the interrupt bound to a particular CPU
  reduces the overall overhead.  Even what we have today, with
  the big giant lock and redirecting interrupts to the CPU in
  the kernel is better than that...
 
 Terry, this is *total* garbage.
 
 Just so you know, ok?

What this, exactly?

That virtual wire mode is actually a bad idea for some
applications -- specifically, high speed networking with
multiple gigabit ethernet cards?

That Microsoft demonstrated that wiring down interrupts
to a particular CPU was a good idea, and kicked both Linux'
and FreeBSD's butt in the test at ZD Labs?

That taking interrupts on a single directed CPU is better
than taking an IPI on all your CPUs, and then sorting out
who's going to handle the interrupt?

Can you name one SMP OS implementation that uses an
interrupt threads approach that doesn't hit a scaling
wall at 4 (or fewer) CPUs, due to heavier weight thread
context switch overhead?

Can you tell me how, in the context of having an interrupt
thread doing scheduled processing, how you could avoid an
interrupt overhead livelock, where the thread doesn't get
opportunity to run because you're too busy taking interrupts
to be able to get any work done?

FWIW, I would be happy to cite sources to you, off the
general list.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Terry Lambert

Matt Dillon wrote:
 :What this, exactly?
 :
 :That virtual wire mode is actually a bad idea for some
 :applications -- specifically, high speed networking with
 :multiple gigabit ethernet cards?
 
 All the cpu's don't get the interrupt, only one does.

I think that you will end up taking an IPI (Inter Processor
Interrupt) to shoot down the cache line during an invalidate
cycle, when moving an interrupt processing thread from one
CPU to another.  For multiple high speed interfaces (disk or
network; doesn't matter), you will end up burining a *lot*
of time, without a lockdown.

You might be able to avoid this by doing some of the tricks
I've discussed with Alfred to ensure that there is no lock
contention in the non-migratory case for KSEs (or kernel
interrupt threads) to handle per CPU scheduling, but I
think that the interrupt masking will end up being very hard
to manage, and you will get the same effect as locking the
interrupt to a particular CPU... if you asre lucky.

Any case which _did_ invoke a lock and resulted in contention
would require at least a barrier instruction; I guess you
could do it in a non-cacheable page to avoid the TLB
interaction, and another IPI for an update or invalidate
cycle for the lock, but then you are limited to memory speed,
which is getting down to around a factor of 10 (133MHz) slower
than CPU speed, these days, and that's actually one heck of a
stall hit to take.


 :That Microsoft demonstrated that wiring down interrupts
 :to a particular CPU was a good idea, and kicked both Linux'
 :and FreeBSD's butt in the test at ZD Labs?
 
 Well, if you happen to have four NICs and four CPUs, and
 you are running them all full bore, I would say that
 wiring the NICs to the CPUs would be a good idea.  That
 seems like a rather specialized situation, though.

I don't think so.  These days, interrupt overhead can come
from many places, including intentional denial of service
attacks.  If you have an extra box around, I'd suggest that
you install QLinux, and benchmark it side by side against
FreeBSD, under an extreme load, and watch the FreeBSD system's
performance fall off when interrupt overhead becomes so high
that NETISR effectively never gets a chance to run.

I also suggest using 100Base-T cards, since the interrupt
coelescing on Gigabit cards could prevent you from observing
the livelock from interrupt overload, unless you could load
your machine to full wire speed (~950Mbits/S) so that your
PCI bus transfer rate becomes a barrier.

I know you were involved in some of the performance tuning
that was attempted immediately after the ZD Labs tests, so I
know you know this was a real issue; I think it still is.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-07 Thread Terry Lambert

Zach Brown wrote:
  That Microsoft demonstrated that wiring down interrupts
  to a particular CPU was a good idea, and kicked both Linux'
  and FreeBSD's butt in the test at ZD Labs?
 
 No, Terry, this is not what was demonstrated by those tests.  Will this
 myth never die?  Do Mike and I have to write up a nice white paper? :)

That would be nice, actually.

 
 The environment was ridigly specified:  quad cpu box, four eepro 100mb
 interfaces, and a _heavy_ load of short lived connections fetching static
 cached content.  The test was clearly designed to stress concurrency in
 the network stack, with heavy low latency interrupt load.  Neither Linux
 nor FreeBSD could do this well at the time.  There was a service pack
 issed a few months before the test that 'threaded' NT's stack..
 
 It was not a mistake that the rules of the tests forbid doing the sane
 thing and running on a system with a single very fast cpu, lots of mem,
 and gigabit interface with an actual published interface for coalescing
 interrupts.  That would have performed better and been cheaper.

I have soft interrupt coelescing changes for most FreeBSD
drivers written by Bill Paul; the operation is trivial, and Bill
has structured his drivers well for doing that sort of thing.

I personally don't think the test was unfair; it seems to me
to be representative of most web traffic, which averages 8k a
page for most static content, according to published studies.

 Thats what pisses me off about the tests to this day.  The problem
 people are faced with is is how do I serve this static content
 reliably and cheaply, not, what OS should I serve my content
 with, now that I've bought this ridiculous machine?.

8-) 8-).


  Its sad that people consistently insist on drawing insane
 conclusions from these benchmark events.

I think that concurrency in the TCP stack is something that
needs to be addressed; I'm glad they ran the benchmark, if
only for that.

Even if we both agree on the conclusions, agreeing isn't
going to change people's perceptions, but beating them on
their terms _will_, so it's a worthwhile pursuit.

I happen to agree that their test indicated some shortcomings
in the OS designs; regardless of whether we think they were
carefully chosen to specifically emphasize those shortcomings,
it doesn't change the fact that they are shortcomings.

There's no use crying over spilt milk: the question is what
can be done about it, besides trying to deny the validity of
the tests.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Kernel stack size

2001-08-07 Thread Terry Lambert

Julian Elischer wrote:
 
 the kernel stack is a VERY LIMITED resource
 basically you have about 4 or 5 Kbytes per process.
 if you overflow it you write over your signal information..
 
 you should MALLOC space and use a pointer to it..

Would adding an unmapped or read-only guard page be
unreasonable?


The only thing I could see it doing would be panic'ing,
so it's not like it'd be possible to dump the process,
without handling the double fault and hoping it doesn't
go over 4k of overage (or you'd need 2...N guard pages).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-08 Thread Terry Lambert

void wrote:
  Can you name one SMP OS implementation that uses an
  interrupt threads approach that doesn't hit a scaling
  wall at 4 (or fewer) CPUs, due to heavier weight thread
  context switch overhead?
 
 Solaris, if I remember my Vahalia book correctly (isn't that a favorite
 of yours?).

As usual, IMO...

Yes, I like the Vahalia book; I did technical review of
it for Prentice Hall before its publication.

Solaris hits the wall a little later, but it still hits the
wall.  On Intel hardware, it has historically hit it at the
same 4 CPUs where everyone else tends to hit it, for the same
reasons; as of Solaris 2.6, they have adopted the hybrid per
CPU pool model recommended in Vahalia (Chapter 12).

While I'm at it, I suppose I should recommend reading the
definitive Solaris internals book, to date:

Solaris Internals, Core Kernel Architecture
Jim Mauro, Richard McDougall
Prentice Hall
ISBN: 0-13-022496-0

Solaris does use interrupt threads for some interrupts; I
don't like the idea, for the reasons stated previously.

Solaris claims to scale to 64 processors while maintaining
SMP, rather than real or virtual NUMA.  It's been my own
experience that this scaling claim is not entirely accurate,
if what you are doing is a lot of kernel processing.  On the
other hand, if you are running a lot of non-intersecting
user space code (e.g. JVM's or CGI's), it's not as bad (and
realized that FreeBSD is not that bad in the same situation,
either: it's just not as common in practice as it is in
theory).

It should be noted that Solaris Interrupt threads are only
used for interrupts of priority 10 and below: higher priority
interrupts are _NOT_ handled by threads (interrupts at a
priority level from 11 to 15).  10 is the clock interrupt.

It should also be noted that Solaris maintains a per processor
pool of interrupt threads for each of the lower priority
interrupts, with a global thread that is used for handling of
the clock interrupt.  This is _very_ different than taking an
interrupt thread, and rescheduling it on an arbitrary CPU,
and as others have pointed out, the hardware used to do the
scheduling is very different.

In the 32 processor Sequent boxes, the actual system bus was
different, and directly supported message passing.

There is also specific hardware support for handling interrupts
via threads, which is really not applicable to x86 or even the
Alpha architectures on which FreeBSD currently runs, nor to the
IA64 architecture (port in progress).  In particular, there is
a single system wide table, introduced with the UltraSPARC, that
doesn't need to be locked to support interrupt handling.

Also, the Sun system is still an IPL system, using level based
blocking, rather than masking, and these threads can find
themselves blocks on a mutex or condition variable for a
relatively long time; if this happens, it resumes the previous
thread _but does not drop its IPL below that of the suspended
thread_, which is basically the Djikstra Banker's Algorithm
method of avoiding priority inversion on interrupts (i.e. ugly).

Finally, the Sun system borrows the context of the interrupted
process (thread) for interrupt handling (the LWP).  This is very
similar to the technique employed with kernel vs. user space
thread associations within the Windows kernels (this was one of
the steps I was referring to when I said that NT had dealt with
a number of scaling issues before it needed to, so that they
would not turn into problems on 8-way and higher systems).

Personally, I think that the Sun system is extremely succeptible
to receiver livelock (Network interrupts are at 7, and disk
interrupts are at 5, which means that so long as you are getting
pounded with network interrupts for e.g. NFS read or write
requests, you're not going to service the disk interrupts that
will let you dispose of the traffic, nor will you run the user
space code for things like CGI's or Apache servers trying to
service a heavy load of requests for content).

I'm also not terrifically impressed with their callout mechanism,
when applied to networking, which has a preponderance of fixed,
known interval timers, but FreeBSD's isn't really any better,
which it comes to huge numbers of network connections, since it
will end up hashing 2/4/6/8/... into the same bucket, unordered,
which means traversing a large list of timers which are not
going to end up expiring (callout wheels are not a good thing to
mix with fixed interval timers of relatively long durations,
like the 2MSL timers that live in the networking code, or most
especially the TIME_WAIT timers).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: -Stable, apache, ldap and shlibs

2001-08-08 Thread Terry Lambert

Julian Elischer wrote:
 Who is the expert on apache, modules and shlibs?
 (I'll go offline to discuss the problem if I can find
 an appropriate person.. (can't get ldap module to work with apache
 under freebsd.)

Build Apache from your own sources, and not from ports.

You will also need to use the Netscape library to get
LDAP support, unless things have changed very recently,
and the OpenLDAP library has been supported without me
knowing it.

I have Apache + SLL + LDAP + IMAP + PHP4 + mod_auth +
PAM_LDAP working on my laptop.  I built from source,
not from ports.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-08 Thread Terry Lambert

Mike Smith wrote:
 Terry; all this thinking you're doing is *really*bad*.
 
 I appreciate that you believe you're trying to educate us somehow. But
 what you're really doing right now is filling our list archives with
 convincing-sounding crap.  People that are curious about this issue are
 likely to turn up your postings, and get *really* confused.
 
 Please.  Just stop, ok?

Mike, I know you are convinced you know everything, and that
of all the people who have worked professionally on SMP
systems before, FreeBSD has only one guy I'm aware of in a
design position for the SMP project, and a lot of students
who think they know what they are doing, even though they
can't cite the literature, but please...

Read the email threads all the way through before commenting
on my postings; the IPI issue is real for TLB shootdown, as
was pointed out by others; it was quite late, and it's very
understandable, given that I have aphasic dyslexia, that I
substituted the wrong word.

Rather than correcting things, as others have done, you have
insisted that no issue exists.

Effectively calling me an idiot in a public forum doesn't
help your credibility, and you're doing more damage by
denying that there is any issue whatsoever to be concerned
about, and being pedantic about precise word usage, instead
of addressing the issues and correcting my unintentional
spoonerisms out of concern for the archives.


Also please read the white paper reference I gave you about
receiver livelock: interrupt threads were, and are, a bad
idea, particularly on stock Intel SMP hardware -- so Solaris
using that approach doesn't justify it any more than antique
versions of IRIX using that approach do.

If you don't want to believe be, then believe Jeff Mogul,
but don't pretend that simply because I chose the wrong word,
that there is no issue to consider.

Thanks,
-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Tuning the 4.1-R kernel for networking

2001-08-08 Thread Terry Lambert

Brian O'Shea wrote:
 On this machine I run a program which simulates many (~150) simultaneous
 TCP clients.  This is actually a multithreaded Linux binary, and one
 thread per simulated TCP client is created.  After a few seconds the
 system runs out of mbuf clusters:
 
 # netstat -m
 3231/3392/6144 mbufs in use (current/peak/max):
 1641 mbufs allocated to data

This is 25 connections worth of full TCP window in a single
direction.  For 150 connections, you will need 150*16k*2/256,
or 19,200 mbufs for a 16k window size, with both transmit and
receive biffers full, or 9,600 mbufs, if you only have the
receive windows full.


 182 mbufs allocated to packet headers

This is really allocation for tcptempl structures for use in
packet keepalive; it basically indicates that you have a total
of 182 sockets open.

 1408 mbufs allocated to socket names and addresses

Don't know why this is, but it means you need at least another
1408 mbufs...

 1536/1536/1536 mbuf clusters in use (current/peak/max)

Your cluster count is way, way too small.  The fact that you
are maxing out at 1536 in all categories implies that you have
filled your clusters out with full send or receive windows.  An
mbuf cluster is 2k, and is generally only filled with 1536 (MTU)
bytes, unless it has been coelesced.  Assuming that these are
the result of full write windows, then you need _at least_ 2400
clusters (150x16k*2/2k) for bidirectional window fill.

 3920 Kbytes allocated to network (98% in use)

Means you've only got 4M assigned to network resources, which
is not a lot; I can see you using 5M for window buffers, alone,
ignoring cluster headers and mbufs for sockets and the 1408
mbufs you disappeared into socket names and addresses.


 96993 requests for memory denied

These are the number of MGET/MCLGET failures.

 0 requests for memory delayed

None of them were malloc calls, they were all allocations of
objects which can only be allocated from memory which you must
reserve by compiling your kernel with the correct tuning
parameters (or in some cases sysctls in loader.conf, but I
can't tell you what works in 4.1, and what has to be done at
config time; sorry).

 0 calls to protocol drain routines
 
 Also, I see a steady stream of these messages on the console:
 
 xl0: no memory for rx list -- packet dropped!
 
 From the xl(4) man page:
 
 xl%d: no memory for rx list  The driver failed to allocate an mbuf
 for the receiver ring.

Yeah; you are out of mbufs and cluster headers.  We knew that.

When the driver can't allocate a replacement mbuf, it drops the
receive packet data, since it can't really safely leave an empty
receive ring slot, since those are receive interrupt driven.

Basically, you have exhausted all your receive resources.  If
I had to guess, I would say that your program was an HTTP load
program of some kind, since you have a grundle of data packed
up in your receive window, which is common when you receive data,
but never bother actually reading it to make it go away.

You could also change your program to set the window size down
to a smaller size than the default (also a sysctl, to set it
globally for all programs, as an administrative limit), and that
would keep the sender from taking up as much memory, since you
would advertise a smaller window to the sender.


 Looking at the xl_newbuf() function in the xl driver, there are two
 places where this message can be generated.  It looks like the problem
 is with the second case where the MCLGET fails, since we are running out
 of those.

It could still be either one, actually; your message that you
quoted didn't match either one of the printf's...


 I increased maxusers to 128 (2560 mbuf clusters) and it ran out of mbuf
 clusters again.  Then I increased it to 256 (4608 mbuf clusters), with
 the same results.  I don't have any sense of what is reasonable mbuf
 cluster usage for the application that I am running, but the system
 never seems to recover from the condition, which would seem to point to
 an mbuf cluster leak.
 
 Does this sound like a problem with the driver (mbuf cluster leak), or
 with the way that I have tuned this system? (the kernel config file for
 this system is attached)

Don't tune it this way: leave it at 16 or 32 users, and specifically
increase the networking resources and MAXFILES (see /sys/i386/conf/LINT).

Frankly, it sounds like your application is bad; does it limit
itself to 150 connections, or is it trying to make as many
connections as it possible can make?  If so, then no matter
how you tune the system, your program will always hit its head
on a resource limit eventually.


 I compiled a debug kernel and panicked the system while it was in the
 state described above, in case that is any use.  I don't know how to
 analyze the crash dump to determine where the problem is.  Any
 suggestions are welcome.

Did it panic from being in the state, or did you break to the
debugger 

Re: Why page enable in Kernel space?

2001-08-08 Thread Terry Lambert

 craig wrote:
 In general a address in a process is just a linear address which
 refer to physical address indirectly by page directory.

Or a virtual address that does not have a physical page behind
it.  Some kernel memory is swappable, and some is overcommitted,
and the pages backing the page entries aren't committed until
they are needed.


 This is reasonable in user space. However is it necessary to do
 such thing in kernel?

Only if you want to be able to handle differential loads.  8-).


 It is sure to have penalty when converting a linear address to
 physical thing.

Actually, the hardware does it for free, for the most part.


 Is it worth doing such thing in kernel.

Yes.  Consider how you would write fault on copyout in order to
cause the user space buffer you were copying kernel data to to
be there to be written to otherwise.

Also, realize that the KVA space stops where the user space ends,
and that the virtual address spaces are effectively butted up
against each other, with the user space being readable/writable
from the kernel, but not vice versa.  This would be very hard to
do; in fact, you would end up doing a lot more work for paging,
and you would have to ptov/vtop to convert between kernel and
user space addresses, where right now, you only have to do that
to give physical addresses to devices.

Likewise, the virtual space doesn't care if the physical space
is fragmented (and most drivers will do scatter/gather anyway),
but real fragmentation might mean that you would be unable to
allocate kernel memory larger than your largest fragment (you
could do it, but you'd have to defrag physical memory, which
could be hard).


 I think the performance is the most important in kernel, other
 thing is second. I remember in linux linear address is real
 physical address in kernel space(is it true?).  Why freebsd
 does not do in the same way?

It's not true; Linux does the same thing FreeBSD does.  When a
protected mode OS that uses VM starts up, it generally loads
itself into memory, builds a virtual address space that looks
exactly like the physical address space where it was loaded,
but relocated, and then relocates itself and starts using the
virtual memory system instead of the physical one.  This is
really a gross oversimplification, but you get the idea.  If
you want to get a better understanding of this, get the book:

Protected Mode Software Architecture
Tom Shanley
MindShare, Inc.
Addison-Wesley Publishing Co.
ISBN: 020155447X

It costs about US$30, new on Amazon.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: timing question

2001-08-09 Thread Terry Lambert

Rolf Neugebauer wrote:
   NB. for achieving higher timer resolutions you might find it
 interesting to look at Soft-Timers at Rice [2]. Events are scheduled
 at the usual timer interrupt frequency but the time wheels are also
 checked at system-call and other interrupt times, thus, depending on
 load, achieving finer grained timer resolutions. The TOCS paper,
 referenced on that site, also describes a mixed approach between
 Soft-Timers and hardware timers to achieve tighter delay bounds.

I like most of Mohit Aron's papers, and Peter Druschel's work
is pretty much landmark.

The real problem with them is that their implementations are
for very old versions of FreeBSD, or they are licensed with a
pretty restrictive license that would prevent them from being
incorporated in FreeBSD, or both.  The soft timers aren't the
only code, either.  There are two different LRP implementations
they have released in conjunction with the Scala server project;
the one for FreeBSD 4.x is under a very restrictive license,
while the other one is very old, and both of them are graduate
student code, i.e., they are not anywhere near usable in a
production system.

The LRP paper is the one I have been obliquely referencing with
regard to receiver livelock avoidance in the thread allocating
at interrupt (or something like that).

Unfortunately, Mohit is now working for Zambeel on a Linux based
product (I have a guess about it, actually), but they are still
in stealth mode, so we aren't going to be seeing much from him
for a while.

Druschel is still working for Rice, last I heard, but has also
formed a forward proxy cache company with some other academics,
and they consistently win the performance numbers in the cache
bakeoffs for forward proxy caches tested via the polygraph
benchmark (polygraph is a typical your product id a bad idea,
so I am going to write something to kill it benchmark, but
everyone publishes numbers, so it's accepted as a figure of
merit, even though it goes incredibly out of its way to first
characterize how a forward proxy cache works, and then pick a
pessimal usage pattern to bust it... so there).

Most of the Scala ideas don't require 5 PhD's to understand, and
someone well versed in the literature could implement them, with
effort (for example, I've implemented LRP for FreeBSD 4.3-STABLE
for the company I work for; LRP deals with one of the seven deadly
receiver livelock issues).

With HP selling a 10Gbit copper ethernet card on 64 bit PCI, even
without the standards ratified (the optical geeks can't get their
act together to get their components specified for that rate), it
would be very easy to overwhelm your bus with DMA from the copper
card, if you didn't handle each of the seven deadly issues in the
livelock case correctly.

For those who don't know the term, think denial of service attack.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-09 Thread Terry Lambert

Weiguang SHI wrote:
 
 I found an article on livelock at
 
 http://www.research.compaq.com/wrl/people/mogul/mogulpubsextern.html
 
 Just go there and search for livelock.
 
 But I don't agree with Terry about the interrupt-thread-is-bad
 thing, because, if I read it correctly, the authors themself
 implemented their ideas in interrupt thread of the Digital Unix.

Not quite.  These days, we are not necessarily talking about
just interrupt load limitations.

Feel free to take the following with a grain of salt; but
realize, I have personally achieved more simultaneous connections
on a FreeBSD box than anyone else out there without my code in
hand, and this was using gigabit ethernet controllers on modern
hardware, and further, this code is in shipping product today.

--

The number one way of dealing with excess load is to load-shed
it before the load causes problems.

In an interrupt threads implementaion, you can't really do
this, since the only option you have is when to schedule a
polling operation.  This leads to several inefficiencies,
all of which negatively impact the top end performance you
are going to be able to achieve.

Use of interrupt threads suffers from a drastically increased
latency in reenabling of interrupts, and can generally only
perform a single polling cycle, without running into the problem
of not making forward progress at the application level (they
run at IPL 0, which is effectively the same time at which NETISR
is currently run).  This leads to a tradeoff in increased
interrupt handling latency (e.g. the Tigon II Gigabit ethernet
driver in FreeBSD sets the Tigon II card firmware to coelesce at
most 32 interrupts), vs. the transmit starvation problem noted in
section 4.4 of the paper.

It should also be noted that, even if you have not reenabled
interrupts, the DMA engine on the card will still be DMA'ing
data into your receiver ring buffer.  The burst data rate on
a 66MHz, 64 bit PCI bus is just over 4Mbits/S, and the sustainable
data rate is much lower than that.

This means a machine acting as a switch or firewall with two
of these cards on board will not really have much time for
doing anything at all, except DMA transfers, if they are run
at full burst speed all the time (not possible).  Running an
application which requires disk activity will further eat into
the available bandwidth.

So this raises the spectre of DMA-based bus transfer livelock:
not just interrupt based livelock, if one is scheduling interrupt
threads to do event polling, instead of using one of the other
approaches outlined in the paper.

In the DEC UNIX case, they mitigated the problem by getting rid
of the IP input queue, and getting rid of NETISR (I agree that
these are required of any code with these goals).  The use of
the polling thread is really just their way of implementing the
polling approach, from section 5.3.  This does not address the
problems I noted above, and in particular, does not address the
latency vs. bus livelock tradeoff problem with modern hardware
(they were using an AMD LANCE Ethernet chip; this was a 10Mb
chip, and it doesn't support interrupt coelescing).  They also
assumed the use of a user space forwarding agent (screend):
a single process.

Further, I think that the feedback mechanism selected is not
really workable, without rewriting the card firmware, and
having a significant memory buffer on the card, something which
is not available on the market yet today.  This is because,
in practice, you can't stop all incoming packet processing just
because one user space program out of dozens has a full input
queue that the user space program has not processed yet.  It's
not reasonable to ignore new incoming requests to a web server,
or to disable card interrupts, or to (for example) drop all ARP
packets until TCP processing for that one application is complete:
their basic assumption -- which they admit, in section 6.6.1, is
that the screend is the only application running on the system.
This is simply not the case with a high traffic web server, a
database system, or any other work-to-do-engine model of several
process (or threads) with identical capability to service the
incoming requests.

Further, these applications use TCP, and thus have explicitly
application bound socket endpoints, and there is no way to
guarantee client load.  We could trivially DOS attack an Apache
server running SSL via mod_proxy, for example, by sending a flood
of intentionally bad packets.  The computation expense would keep
its input queue full, and therefore, the feedback mechanism noted
would starve the other Apache processes of legitimate input.
There are other obvious attacks, which are no less damaging in
their results, which attack other points in the assumption of a
single process queue feedback mechanism.

Their scheduler in section 7, which is in effect identical to the
fixed scheduling class in SVR4 (which was used by USL to avoid
the move mouse, wiggle cursor problem when using the 

Re: Allocate a page at interrupt time

2001-08-09 Thread Terry Lambert

Greg Lehey wrote:
  Solaris hits the wall a little later, but it still hits the
  wall.
 
 Every SMP system experiences performance degradation at some point.
 The question is a matter of the extent.

IMO, 16 processors is not unreasonable, even with standard APIC
based SMP.  32 is out of the question, but that's mostly because
you can't have more than 32 APIC ID's in the current 32 bit
processors, and still give one or more away to an IO APIC.  8-).


  On Intel hardware, it has historically hit it at the same 4 CPUs
  where everyone else tends to hit it, for the same reasons;
 
 This is a very broad statement.  You contradict it further down.

I contradict it for SPARC; I don't think I contradicted it
for Intel, but am wiling to take correction...


  Solaris claims to scale to 64 processors while maintaining SMP,
  rather than real or virtual NUMA.  It's been my own experience that
  this scaling claim is not entirely accurate, if what you are doing
  is a lot of kernel processing.
 
 I think that depends on how you interpret the claim.  It can only mean
 that adding a 64th processor can still be of benefit.

The 4 processors Intel claim is a point of diminishing
returns, and is well enough known that it is almost passed
into folklore (which might not bode well for finding people
building boards with more, which would be unfortunate).  My
SPARC experience is likewise a diminshing returns, where
it becomes cheaper to buy another box to get the performance
increment, than to stick more processors in the same box.
It's definitely anecdotal on my part.


  On the other hand, if you are running a lot of non-intersecting user
  space code (e.g. JVM's or CGI's), it's not as bad (and realized that
  FreeBSD is not that bad in the same situation, either: it's just not
  as common in practice as it is in theory).
 
 You're just describing a fact of life about UNIX SMP support.

Practice vs. Theory?  Or the inevitability of UNIX SMP support
having the performance characteristics it has most places?  I don't
buy the we must live with the performance because it's UNIX
argument, if you meant the latter.


  It should be noted that Solaris Interrupt threads are only
  used for interrupts of priority 10 and below: higher priority
  interrupts are _NOT_ handled by threads (interrupts at a
  priority level from 11 to 15).  10 is the clock interrupt.
 
 FreeBSD also has provision for not using interrupt threads for
 everything.  It's clearly too early to decide which interrupts should
 be left as traditional interrupts, and we've done some shifting back
 and forth to get things to work.  Note that the priority numbers are
 noise.  In this statement, they're just a convenient way to
 distinguish between threaded and non-threaded interrupts.

FreeBSD masks, Solaris IPLs.  In context, this was meant to
show why Solaris' approach is not directly translatable to
FreeBSD.

I really can't buy the idea that interrupt threads are a good
idea for anything that can flood your bus or interrupt bandwidth,
or have tiny/non-existant FIFOs, relative to the speeds they are
being pushed; right now that means might be OK for disks; not OK
for really fast network controllers, not OK for sorta fast network
controllers without a lot of adapter RAM, not OK for serial ports
and floppies, at least in my mind.



 I think somebody else has pointed out that we're very conscious of CPU
 affinity.

I think affinity isn't enough; I've expressed this to Alfred on a
number of occasions already, when I see him in the hallway or at
lunch.  Dealing with the problem is kind of an all-or-nothing bid.


  In the 32 processor Sequent boxes, the actual system bus was
  different, and directly supported message passing.
 
 Was this better or worse?

For the intent, much better.  It meant that non-intersecting
CPU/peripheral paths could run simultaneously.  The Harris
H1000 and H1200 had a similar thing (big iron real time
systems used on Navy ships and at the college where Wes and
I went to school).


  Also, the Sun system is still an IPL system, using level based
  blocking, rather than masking, and these threads can find
  themselves blocks on a mutex or condition variable for a
  relatively long time; if this happens, it resumes the previous
  thread _but does not drop its IPL below that of the suspended
  thread_, which is basically the Djikstra Banker's Algorithm
  method of avoiding priority inversion on interrupts (i.e. ugly).
 
 So you're saying we're doing it better?

Long term priority lending is the real problem I'm noting; this
is an artifact of context borrowing, more than anything else
(more below).

I think the FreeBSD use of masking is better than IPL'ing, and is
an obvious win in the case of multiple cards, since you can run
multiple interrupt handlers at the same time, but wonder what will
happen when/if it gets to UltraSPARC hardware.  I think the Djikstra
algorithm, in which contended resources are prereserved based on an
anticipated need, 

Re: need help

2001-08-10 Thread Terry Lambert

smail wrote:
 
 Hello freebsd-hackers,
 
 i need some help. my problem is about memory limit in mmap function.
 i can't mmap files infinitely, after some number of file mmaped in
 memory i've got an error, probably causing memory limit of 2 or 4 Gb.
 can you help me? my platform is FreeBSD 4.3/i386 [128Mb RAM, 4Gb
 swap].

This is probably your homework, isn't it?  8-)


Your address space is by default limited to 3G (the kernel virtual
address space is 1G and the user address space is 3G: the kernel
needs to be able to get at all memory.

You can change this a little, but you will still bump your head
on the limiting fact that you have only 32 bits with which to
address all the memory you use.

To exceed this limit, you will have to go to indirect mappings;
using indirect mappings, you divide your address space up into
chunks, and then when you need to access an additional chunk of
data, and all your chuncks are busy, you evict a previous mapping
and map the chunk there instead.  A common algorithm for this
type of eviction is LRU.

Effectively, you are managing the mapping to implement in software
a virtual address space in excess of your 3G limit; in effect,
there is no practical limit on the amount of data you can access
in this way, since you could go to eviction of your table of ranges,
after going to a hierarchy of tables of ranges of file data that
you map.

A simple example of this technique would be to map a smaller
portion of the file (say 8MB of the file) at a time, and keep a
quad word (C type long long, FreeBSD type off_t) of the offset.
To access the first 8MB of the file, mmap it in memory, and iterate
through it.  Now iterate through 8GB of the file by moving the
mapping, only when necessary.

This technique is known as windowing; here is a simple example,
which won't compile without header files, probably needs LL
constant identifiers to make them 64 bits, and may have a typo
even after you add the header files; it will also be incredibly
slow, since the algorithm parameters should be tuned to the data
you will be accessing:

/* I call this program How much is that file in the window?*/

void *basep;/* window address*/
off_t relbase;  /* window base*/
off_t eight_meg = 0x008;/* window size*/
int memfd;  /* lazy global*/


/*
 * Stupid program to dump out the first 8G of a file, one
 * byte at a time, using a windowed get byte function.
 */
main()
{
off_t curoff;
unsigned char c;

memfd = open( my_overly_large_file, O_RDONLY, 0);

/* Go from 0 to 8G, one byte at a time...*/
for( curoff = 0; curoff  0x0001; curoff++) {
c = byte_at_offset( curoff);
putchar(c);
}
}

/*
 * Access a character in an arbitrary length file by mapping
 * it into memory in 8M windows, changing the mapping when
 * the requested character lands before or after the current
 * window, so that a lot of mapping and unmapping isn't needed.
 */
unsigned char
byte_at_offset(off_t offset)
{
off_t reloffset = offset - relbase;

/*
 * if the offset is before, or after the window, or
 * if we haven't yet set up a window, then we need to
 * modify the window we are using.
 */
if( relbase  offset || reloffset  eight_meg || basep == NULL) {
/* if this is the first time, there is nothing to unmap*/
if( basep != NULL) {
munmap( basep, eight_meg);
}
/* get a new relative base, offset, and a new window*/
relbase = offset - (offset % eight_meg);
reloffset = offest - relbase;
basep = mmap( NULL, eight_meg, PROT_READ, MAP_SHARED,
  memfd, relbase;
}

/*
 * return the requested character from its relative
 * location in the window.
 */
return *(unsigned char *)basep[ reloffset];
}


-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Allocate a page at interrupt time

2001-08-10 Thread Terry Lambert

Mike Smith wrote:
 The basic problem here is that you have decided what interrupt threads
 are, and aren't interested in the fact that what FreeBSD calls interrupt
 threads are not the same thing, despite being told this countless times,
 and despite it being embodied in the code that's right under your nose.
 
 You believe that an interrupt results in a make-runnable event, and at
 some future time, the interrupt thread services the interrupt request.
 
 This is not the case, and never was.  The entire point of having
 interrupt threads is to allow interrupt handling routines to block in the
 case where the handler/driver design does not allow for nonblocking
 synchronisation between the top and bottom halves.

So enlighten me, since the code right under my nose often
does not run on my dual CPU system, and I like prose anyway,
preferrably backed by data and repeatable research results.

What do interrupt threads buy you that isn't there in 4.x,
besides being one hammer among dozens that can hit the SMP
nail?

Why don't I want to run my interrupt to completion, and want
to use an interrupt thread to do the work instead?

On what context do they block?

Why is it not better to change the handler/driver design to
allow for nonblocking synchronization?


Personally, when I get an ACK from a SYN/ACK I sent in
response to a SYN, and the connection completes, I think
that running the stack at interrupt all the way up to
the point of putting the completed new socket connection
on the associated listening socket's accept list is the
correct thing to do; likewise anything else that would
result in a need for upper level processing, _at all_.
This lets me process everything I can, and drop everything
I can't, as early as possible, before I've invested a lot
of futile effort in processing that will come to naught.

This is what LRP does.

This is what Van Jacobson's stack ([EMAIL PROTECTED])
does.

Why are you right, and Mohit Aron, Jeff Mogul, Peter
Druschel, and Van Jacobson, wrong?


 Most of the issues you raise regarding livelock can be
 mitigated with thoughtful driver design.  Eventually,
 however, the machine hits the wall, and something has to
 break.  You can't avoid this, no matter how you try; the
 goal is to put it off as long as possible.
 
 So.  Now you've been told again.

Tell me why it has to break, instead of me disabling receipt
of the packets by the card in order to shed load before it
becomes an issue for the host machine's bus, interrupt
processing system, etc.?

Are you claiming that dropping packets that are physically
impossible to handle, as early as possible, while handing
_all_ packets that are physically possible to handle, is
broken, or is somehow unpossible?

Thanks for any light you can shed on the subject,
-- Terry

PS: If you want to visit me at work, I'll show you code
running in a significantly modified FreeBSD 4.3 kernel.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: the =+ operator

2001-08-13 Thread Terry Lambert

John Merryweather Cooper wrote:
   Prototypes are an overwhelmingly Good Thing(tm)
   as behind-your-back implicit parameter conversion is death to serious
   numerical work.  At least now, some control can be exercised over
  parameter
   conversions . . .
 
  Who ever said anything about not being able to do that in Terry's view?
  You are taking one statement and running wildly with it.
 
 In my view, he was advocating chucking ANSI-89 and returning to the wild
 days of KR.  I think that would be very bad.  Clearly, you disagree with
 my understanding.

Not my intent; I'm well known to dislike many of the decisions
that the X3J11 committe made; in comp.lang.c, there was a long
firefight, which only ended after [EMAIL PROTECTED] came down
on my side of the argument and said:

Let me begin by saying that I'm not convinced that
even the pre-December qualifiers (`const' and `volatile')
carry their weight; I suspect that what they add to the
cost of learning and using the language is not repaid in
greater expressiveness. `Volatile', in particular, is a
frill for esoteric applications, and much better expressed
by other means.  Its chief virtue is that nearly everyone
can forget about it.  `Const' is simultaneously more
useful and more obtrusive; you can't avoid learning about
it, because of its presence in the library interface.
Nevertheless, I don't argue for the extirpation of
qualifiers, if only because it is too late. 

The fundamental problem is that it is not possible to
write real programs using the X3J11 definition of C.
The committee has created an unreal language that no
one can or will actually use.  While the problems of
`const' may owe to careless drafting of the
specification, `noalias' is an altogether mistaken
notion, and must not survive.

See http://www.lysator.liu.se/c/dmr-on-noalias.html for the
full text of his posting.

He also has a couple of choice words on prototypes requiring
themselves.

 In benchmarking IBM's VisualAge C++ (Version 4.0), this seems to be the
 case, at least for me.  I chose this compiler because it is easy, with the
 tools available for me to monitor the stages of compilation since each
 stage has a separate DLL.  Using SciTech's MGL 5.0 Beta 2 Library, it is
 clear that Lexing/pre-processing take up the lion's share of the time.
 Obviously, your mileage differs.  I would like to have your understanding
 of what's happening--and not this troll-mine.

What's happeneing is that compiler users outnumber compiler
writers, 100,000 to 1.

Ergo, if a compiler writer can make a change that saves 1 hour
of user time, he has saved 100,000 hours or user time.

That is over 11 man years.

Clearly, all tradeoffs should be made in favor of compiler users,
not in favor of compiler writers, for the betterment of mankind.


 I know it's ambiguous.  In fact, I think it's the most poorly
 standardized/described language to date.  However, since C++ is quite
 popular, apparently my opinion doesn't carry much weight.  :)

The popularity of C++ was one of the driving factors behind
the inclusion of prototypes.

Even using symbol decoration, the approach used by both GCC
and Microsoft Visual C++, there is enough information present
for parameter errors to be identified and corrected at link
time.  The problem at the time, however, is that there was a
race between Microsoft and Borland to have the fastest
compiler; not that this doesn't mean the compiler that puts
out the fastest code or the compiler that makes the jobs
done by the programmers who use it take the shortest amount
of time.  So we got standardization of a language in which
there were exposed bolts like volatile and const, which
make the compiler writer's job (defined as compiling fast)
easier, since barring these keywords, they were allowed to
make assumptions that broke previously working code.


 From the reading I've done, I believe this conclusion is justified.
 Doubtless there are other opinions though . . .

The C Programming Language
Brian W. Kernighan, Dennis M. Ritchie
Prentice-Hall
ISBN: 0-13-110163-3

P. 212:

17. Anachronisms

Since C is an evolving language, certain obsolete
constructions may be found in older programs.  Although
most versions of the compiler support such anachronisms,
ultimately they will disappear, leaving only a portability
problem behind.
Earlier versions of C used the form =op instead of
op= for assignment operators.  This leads to ambiguities,
typified by
x=-1
Which actually decerments x since the = and the - are
adjacent, but which might easily be intended to assign -1
to x.
The syntax of initializers has changed: previously,
the equals sign that introduces an 

Re: can't generate vnode_if.h automatically

2001-08-13 Thread Terry Lambert

Rohit Grover wrote:
 
 On Sun, 12 Aug 2001, Dima Dorfman wrote:
  Rohit Grover [EMAIL PROTECTED] writes:
   Interestingly, when I executed the command  'make depend',
   vnode_if.h was correctly created for me. I'd like to know why I don't
   need to do a 'make depend' for modules like 'vn' or 'nfs' before
   building them.
 
  Perhaps because it was done before?  Check to see if you have a
  '.depend' file in those directories.  I'm pretty sure you do.
 
 I did not find a .depend in sys/modules/vn.

It's because of the order of declaration of the variables
containing the source and the objects vs. the .include
directive(s) in your Makefile.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: mtio questions

2001-08-13 Thread Terry Lambert

Bernd Walter wrote:
 
 On Sun, Aug 12, 2001 at 11:46:57AM -0700, Terry Lambert wrote:
  Bernd Walter wrote:
   Another point:
   Can we '#define MTEOM MTEOD' as MTEOM is used on NetBSD and Solaris?
 
  End of Message is not the same as End of Data for some
  drives; this could break old 8-track (no, not the music, and
  not a typeo for 9-track!) drives, e.g. Zilog and Cypher.
 
 Well that's what Solaris 8 sys/mtio.h tells about MTEOM:
 #define MTEOM   10  /* position to end of media */
 And here NetBSD 1.5:
 #define MTEOM   10  /* forward to end of media */
 
 Neither of them is saying Message.

I was thinking Media, but wrote Message, since that's what
the ASCII characer EOM means; my bad.

The end of the media can be interpreted as after the first
EOF, before the second, in order to permit the tape volume to
be extended.

It can also be interpreted to mean before the first of two EOFs,
such that the last extent can be extended.

It's really hardware dependent, and ambiguous.

 Please correct me if I'm wrong:
 If I want to append to a tape I would MTEOM on Solaris and MTEOD on
 FreeBSD so it's supposed to be used for the same reason.
 None of the OS I looked into had both.
 
 But well - that's what HP-UX define:
 #define MTEOD   8   /* DDS, QIC and 8MM only - seek to end-of-data */

These devices are not absolutely positionable to EOM; they
leave you after the last data block; on QIC, which records
like:

---.
  ,---+--.
  `---'  |
-'

It's nearly impossible to position to an exact location.

DEC MT-50, MT-75, and 9-Track drives, on the other hand, were
abosolutely positionable, and often were written with a real
filesystem on them (FILES-11 format instead of ANSI format).

You are delving into an area where things vary widely by vendor
and the crossproduct of drivers and hardware...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: pthreads and poll()

2001-08-14 Thread Terry Lambert

Daniel M. Eischen wrote:
 We don't provide locking for fd's any longer (I thought this was only in
 -current, but your results seem to indicate otherwise).  If we did, only
 one thread would wake up.  The mistake in your sample seems to be that
 you're having all threads block on the same fd.  Why?

Probably he has a bunch of daemons waiting around for work to
do (e.g. HTTP daemons all listening for connections to accept
on the same fd).  Lot's of applications could use this model
to get a performance boost.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: ncurses

2001-08-16 Thread Terry Lambert

Hans Zaunere wrote:
 
 I'm sorry that this is offtopic, but I've looked/asked
 everywhere and no one has a clue.
 
 Once a program does initscr(), is it possible to
 printf()?  I can printf() stuff without a problem, but
 it doesn't get to the screen until the program exits?
 
 I've done every ncurses function I can think of,
 endwin(), etc.  However if there is a printf()
 anywhere after ncurses stuff has happened, nothing is
 printed to the screen until the program exits.  What
 am I missing?  Is there a trick to this, as it must be
 possible, right?

Printf goes to a buffer, which is not necessarily flushed
until you do input using a stdio function, or explicitly
call fflush(stdout).

Generally speaking, if you have started curses, all your
output should be done using curses routines, so that the
curses library knows about it, and does the right thing
for refresh, maintaining the current cursor location so
the next printf() goes to the right place, etc..

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Writing a packet alias translator, need help

2001-08-16 Thread Terry Lambert

Joe Clarke wrote:
 
 I'm trying to write a packet alias translator for a protocol that uses TCP
 to setup a UDP streaming session (much like the smedia driver that's
 already there).  I'm having a problem getting the translated port to mesh
 with the actual port.  Here's what I've done:
 
 /* msg is a TCP setup packet
  struct msg {
 u_int32_t ipAddr;
 u_int32_t portNumber;
  };
 */

One obvious thing is that ports are 16 bits, not 32 but...

 is UDP 16704, but the translation puts 50535 in the packet.

The bit patterns these make are not even remotely similar,
meaning that this isn't a byte order issue; I think you will
need to run the code in a debugger (or add printf's).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: IP address on bridge

2001-08-17 Thread Terry Lambert

Eugene L. Vorokov wrote:
 I'm observing some strange problem when I have an IP address on one card
 on a bridge machine and want to telnet in. I have 4.2-RELEASE box with
 two network cards: Realtek 8139 (rl0) and 3Com 3C905B (xl0). rl0 is connected
 to the world, and xl0 to the intranet switch. FreeBSD handbook says that
 I'm allowed to assign an IP address to one of the two interfaces. Okay,
 so I assign the address to xl0. But I'm unable to access it from a machine
 on xl0 side. arp is found properly, and packets are sent, but somehow
 bridge machine just ignores those packets (tcpdump shows nothing).
 
 If I assign IP address to rl0 rather than xl0, it works for short time,
 then machine I telnet from says that arp of the bridge is moved to xl0
 arp again, and packets get lost. ifconfig rl0 down/up and ping'ing the
 machine I telnet from (so it gets proper arp) heals, but for the short time
 again.

1)  The xl0 interface is working for transmit but not receive,
or it would keep working after the ARP move.

2)  You are putting both interfaces on the same wire; this
means you have another bridge out there somewhere, or
the wire doesn't need to be bridged, and is why the ARP
is claimed to have moved.

3)  A gratuitous ARP is sent when you ifconfig an interface
to add an IP address (e.g. when you add an alias, or
bring the interface from down to up).  This is why the
pinging heals when you reset the interface.

So...

A)  Do not put two interfaces on the same wire, particularly
if you have not set your netmask to make their listen
ranges non-intersecting.

B)  Make sure the xl0 interface is correctly assigning an
interrupt, etc..  You can check this by making it the
default gateway for the machine, not configuring the
other interface, and seeing if things work.  If they
don't, you have a bad card, driver, or BIOS (the BIOS
does the IRQ assignment, if you have PNP OS enabled
in the BIOS configuration, and some BIOS' do it wrong).

C)  Find the other bridge, if you haven't put both cards
on the same wire segment, since _someone_ is forwarding
those ARP packets, if that's the case.

D)  Realize you can only have one default interface on a
machine at a time, so correctly use your subnet masks.

E)  Consider doing routing instead of bridgeing.


-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: function calls/rets in assembly

2001-08-26 Thread Terry Lambert

David O'Brien wrote:
  If gcc team wants to implement proper
  alignment to work with SSE and other high-specialized stuff,
  they should learn commands for bitwise AND, and use only where
  really needed.
 
 Perhaps you'd like to send your patch to [EMAIL PROTECTED]
 Perhaps you'd like to explain to them why they are so wrong about this?
 You'd do that at [EMAIL PROTECTED]
 
 They have their reasons for this, and I'll let them explain them.

If it's anything like their reasons for the per-thread exception
stack allocation being static, and thus causing additional overhead
for non-C++ and non-threads-using programs (i.e., mainly because
they already had it implemented that was in egcs when Jeremy Allison
made the patches to do it right for gcc when we were making ACAP run
for the first time under the GNU tool chain), then you might as well
not bother sending them email.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: PCI Enumeration

2001-08-26 Thread Terry Lambert

Ronald G Minnich wrote:
 On Sat, 25 Aug 2001, Mike Smith wrote:
 
  I/O space is easy, but memory space is hard.  Userspace access to
  physical memory is a big no-no in the *nix world.
 
 I want to disagree just a bit. If you look at myrinet, or the many fpga
 cards, it's the standard modus operandi. You have to do it that way.

I think that Mike's point is that even when you access
physical memory from user space, you are accessing it as
virtual memory.

Normally, the device RAM window is mapped into the KVA,
and such access is done via /dev/kmem, which is translated
through the KVA virtual mapping.  Even if you mmap such a
region into a user space processes address space, you end
up translating through the UVA of the process to get at the
RAM (X does this for linear frame buffers, etc.).

If you are trying to do this windowed, or through another
mapping (e.g., for a frame buffer card with 16M of RAM
plugged into a 4G system, with no way to map the full 16M
into the KVA, no matter how you slice it), it's not just
hard, it's nearly impossible to do correctly.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: function calls/rets in assembly

2001-08-27 Thread Terry Lambert

John Baldwin wrote:
  Well, now you should add wanted options to /etc/make.conf and avoid
  seeing of such nightmares.
 
 Erm, the original topic of this dicussion was about attempting to use the
 assembly from the C compiler to see how things work when writing one's own
 assembly functions.  Having to know magical extra parameters to pass to the
 compiler to make this a fruitful exercise doesn't help.  If the compiler were
 more intelligent about the code it output by default in the first place, then
 that would help.

Should we all start chanting now?

Sign extend to int!
Sign extend to int!
KR were right!
KR were right!
Sign extend to int!

8^P

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Portability of #warning in /usr/include

2001-08-28 Thread Terry Lambert

Mark D. Anderson wrote:
  This may not work.
 ...
  Some of those compilers
  would NOT let you '#ifdef' out the version that it did not recognize
  (perhaps thinking that '#warn' or '#warning' might be some gross typo
  for '#else' or '#endif', I guess...).
 
 this is true; some compilers seem to require that #ifdef'd out code
 be syntactically correct.

This can be handled by using an external preprocessor, before
handing the code to the compiler.

From my recollection, the only thing a preprocessor is required
to pass through is #pragma directives.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Undefined symbol _ZTVN10__cxxabiv117__class_type_infoE

2001-08-31 Thread Terry Lambert

Jan Mikkelsen wrote:
 
 You probably have the system default libstdc++.so.3 in your library search
 path before the GCC 3 libstdc++.so.3.  Try setting LD_LIBRARY_PATH to the
 GCC 3 lib directory.

NOTE:

If you are using the FreeBSD .mk files to build this, and you
are setting DESTDIR, you can set your library and include path
until the cows come home, and it won't help: you will still get
the system default C++ includes and libraries, no matter what
(the .mk files are broken for use with a compiler other than
the system default).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: FW: Interesting Router Question

2001-08-31 Thread Terry Lambert

Deepak Jain wrote:
 We've got a customer running a FreeBSD router with 2 x 1GE interfaces [ti0
 and ti1]. At no point was bandwidth an issue.
 
 The router was under some kind of ICMP attack:
 
 For about 30 minutes:
 icmp-response bandwidth limit 96304/200 pps


I've seen this happen in a lab when there are a large number
of ICMP redirects coming into the machine from the next hop,
which doesn't believe itself to be the next hop, directing
you to the real next hop.

This can happen with asymmetric routes.

You can also see this in the NAT case, where you get a
gateway redirect to the NAT box from the local gateway,
with a ping.

Stopping and restarting the ping makes it honor the
redirect for subsequent packets, but the initial ping
program does not honor it after the first (or nth) time
it gets the redirect: it merrily pounds away at the
redirecting machine.

I don't know why the route does not get adjusted like it
should, so that subsequent attempts don't trigger the
redirect, but it doesn't (this seems to be a problem with
the FreeBSD routing code).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-03 Thread Terry Lambert

Zhihui Zhang wrote:
 
 What is the file system that uses VT_TFS in vnode.h? Is it still available
 on FreeBSD?  Thanks.

Julian added it for TRW Financial Services; the first public
reference machine for 386BSD (which later became FreeBSD and
NetBSD) was ref.tfs.com.  TRW supported a lot of the early
386BSD/FreeBSD effort, back before Walnut Creek CDROM threw
in and had us change the version number from 0.1 to 1.0 to
make it a bit easier to sell.  The version numbers have been
bloating ever since...

The purpose of the new vnode type was to permit the VFS to own
the vnode, instead of having it owned by the OS, as a contended
resource (System V based systems, including UnixWare, Solaris,
etc., all give ownership of vnodes to the underlying VFS,
instead of having a system wide free vnode pool, like BSD
uses).  You'd have to ask Julian to be sure, but it may even
have been done to port TFS from a System V derived system.

Julian also did the original Adaptec SCSI controller support
for 386BSD/NetBSD/FreeBSD... this was back when FreeBSD was
really 386BSD (authored by Bill Jolitz) + the patchkit (that I
originally authored, before I foisted it off on Rod Grimes,
Nate Williams, and later Jordan Hubbard, and the original
Unofficial FAQ off on to Dave Burgess.

Technically, having the vnodes owned by the VFS is a much
better design, since it helps scaling to get away from the
global list, and you can allocated the incore inode and the
vnode as a single allocation unit.  It also helps with the
VFS stacking issues, by avoiding a stacked layer race that
can happen when you are low on the vnodes, when you have two
or more stacked layers.  It also lets you proxy calls across
the user/kernel boundary more easily, which lets you do neat
things like source level debug stacking layers entirely in
user space.


-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: sysent in fork()

2001-09-04 Thread Terry Lambert

Evan Sarmiento wrote:
 
 Hey,
 
 I have a question about sysent. If a modification
 to a processes p-p_sysent and associated substructures
 are made, are the changes propagated through fork
 to children?

Yes, for fork().

You probably wanted to ask about exec(), though... the answer
for exec() is it uses the brand on the exec()'ed binary to
decide which call table to use.

This probably means that you are running into a problem with
an unbranded Linux binary being run from another Linux
binary... alternately, you might be confusing the path prefix
lookup for branded binaries, which causes an path search to
look in the /compat/linux directory before looking under /,
if what is being run is a Linux binary.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: .so and threads, and stereo /dev/dsp, freebsd 4.3-stable.

2001-09-04 Thread Terry Lambert

Faried Nawaz wrote:
 Next: the OSS plugin builds but doesn't seem to work properly.  At
 some point, it tries to set /dev/dsp to stereo, and fails:
 
 tmp = 0;
 if (shm-channels == 2)
 tmp = 1;
 rc = ioctl (audio_fd, SNDCTL_DSP_STEREO, tmp);
 if (rc  0) {
 perror (snd_dev);
 Con_Printf (Could not set %s to stereo=%d, snd_dev, shm-chann
 els);
 close (audio_fd);
 return 0;
 }
 
 I have a Creative 128 card which identifies itself as
 
 pcm0: AudioPCI ES1373-8 port 0x6800-0x683f irq 10 at device 10.0 on pci0
 
 What can I do here to make quakeforge use the sound card?

I've seen something like this before.

The cause was the sourd card being opened twice, and the second
open failing, without the programmer checking the open return
code in the second case.

The fix was to use the already open fd, instead of reopening
the card.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: signal handling descrpancy (FreeBSD oaf fix/Evolution)

2001-09-04 Thread Terry Lambert

David O'Brien wrote:
 
 Hi Hackers, et.al.
 
 The PIM Evolution, http://www.ximian.com/products/ximian_evolution/,
 does not run on FreeBSD.  The authors have made a change so that it will.
 However, we would like to know if FreeBSD is the odd-man-out, or if the
 authors were lucky Evolution ran on Solaris and Linux.

There's some oddity in FreeBSD's SIGCHLD handling.  If you
ignore the signal, it's supposed to magically reap when
child processes exit, according to my copy of Go SOLO 2
(The Single UNIX Specification).  FreeBSD makes you explicitly
set the handler/mask the signal yourself.

Technically, either way could be considered correct; there
was a patch to -current that fixed FreeBSD; I'm not sure
if it was MFC'ed to the 4.4 branch or not...

There was a big discussion on -current about proper handling,
if I recall correctly.  FWIW, I opposed the change (which
implied SA_CLDWAIT, I think, with no way to turn it off to get
historical behaviour for programs which needed it).

I hate POSIX signal handling...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: general ethernet driver changes

2001-09-04 Thread Terry Lambert

Did you have opportunity to play with the soft interrupt
coalescing we discussed?

I was able to remove a third of the interrupt overhead
from the Tigon III driver, using the approach we discussed
at the user group meeting two months back.

It looks to be a serious win... and it appears to be
applicable to almost every driver you've written (it's
about a 12 line change per driver).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: FreeBSD and Athlon Processors

2001-09-04 Thread Terry Lambert

David O'Brien wrote:
   Well, since it didn't, I might as well explain the problem here too.
   There are at least two major problems with VIA chips:
 
  [data curruption on VIA KT133/133A systems by pushing PCI and memory bus]
 
  Are you sure about that?
 
 I am.  I was having data coruption in a terrable way when I added a 2nd
 IDE UDA100 drive to a very plain MSI K7T Pro2-A 1.2GHz Athlon system.

Are you sure it's not just a CMD640 IDE controller?  They are
known to have issues; Linux has a patch... FreeBSD used to, but
I think it got yanked out, or was just turned off by default.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: SO_REUSEPORT on unicast UDP sockets

2001-09-04 Thread Terry Lambert

Vladimir A. Jakovenko wrote:
 
 Hello!
 
  According to UNPv1 SO_REUSEPORT on UDP sockets can be used to bind more than
  one socket to the same port (even with same source ip address). But quick
  look on /sys/netinet/udp_usrreq.c function udp_input() shows that this will
  work as expected (data stream duplicate) only on multicast/broadcast local
  addresses. Please pay attention to the following code fragment comments:

[ ... ]

  Is there still any real need in such backward compatibility? Can such
  functionality be added (fixed) with possibility to switch it off using
 sysctl  or kernel-build option?
 
  I find such possibility realy useful at least for NetFlow data
 processing and believe that it would be useful for many UDP-based
 protocols.

Bound UDP sockets have always been problematic; there's a lot
of code out there that depnds on the historical behaviour for
unicast, unfortunately, including a number of commercial
applications that run on FreeBSD (e.g. Real Server).

If you look at that code for any length of time, you will get
to see it as an armpit: it's not a good place to stick your
nose, and it tends to smell to high heaven.  At my current
job, I'm up to my elbows in it...

Similarly, there are a number of bugs in the TCP sockets as
well; specifically, there's a problem with all sockets being
treated as being in the same collision domain, when doing
automatic port assignment.  This limits you to 65535 oubound
TCP connections, even though you have multiple IP aliases on
an interface (theoretically, you should get 64k connections
per IP address, if you bind _not_ to IN_ADDR_ANY, but instead
use a specific port, but the hash is broken).

There's also another problem with the cloned route, in the
case you get a redirect, since the clone is not properly
updated (e.g. do a ping, get a redirect, and notice that
you keep getting the redirect until you stop and restart the
ping, after which you get the  correct route record: there
was a recent thread on this in -current, where someone ICMP'ed
themselves to death using multiple Gigabit interfaces as
unbonded non-VLAN equivalence routes).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-04 Thread Terry Lambert

Julian Elischer wrote:
   What is the file system that uses VT_TFS in vnode.h? Is it
   still available on FreeBSD?  Thanks.
 
  Julian added it for TRW Financial Services; the first public
  reference machine for 386BSD (which later became FreeBSD and
  NetBSD) was ref.tfs.com.  TRW supported a lot of the early
  386BSD/FreeBSD effort, back before Walnut Creek CDROM threw
  in and had us change the version number from 0.1 to 1.0 to
  make it a bit easier to sell.  The version numbers have been
  bloating ever since...
 
 I think you are thinking of other stuff I did at TFS, (we had
 something similar) but never committed here.. this was actually
 done in the following commit:

Hunh.  I could have sworn that that was your baby...  I guess
I'm just remembering a conversation about the something similar.

In any case, it's useful to let a VFS layer own its vnodes...
so I'd leave it there: never know when you might need it.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: general ethernet driver changes

2001-09-04 Thread Terry Lambert

Luigi Rizzo wrote:
 
  Did you have opportunity to play with the soft interrupt
  coalescing we discussed?
 
 Did this message just leak to a mailing list, or would you
 be able to expand this (or pass a pointer to mailing lists
 where this was discussed) ?

Ignore the man behind the curtain... 8-)

Seriously, it was at two BABUG (Bay Area BSD User's Group)
meetings; no big secret.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Tagged Command Queuing support for IC-35L0?0 ?

2001-09-05 Thread Terry Lambert

Steve Roome wrote:
  Can these newer drives, based on the IC-35L0?0-chipset, also be used
  with TCQ enabled in FBSD? (? is 2, 4 or 6 depending on whether the
  drive has 20, 40 or 60 GB capacity).
 
 I've got one of these :
 
 ad0: 39266MB IC35L040AVER07-0 [79780/16/63] at ata0-master UDMA66
 
 If I turn tagged queueing on, I get an awful lot of write failures and
 ata timeouts and whatnot. Basically it just doesn't work. **For me**
 
 I'm not blaming Søren Schmidt here! it could be due to broken
 hardware, code or just my sheer incompetence, but in the end I gave up
 trying, it didn't work with my last drive either, and that was a 30GP
 type drive (don't remember the model number though).
 
 As far as I remember there are apparently problems with some of these
 drives in terms of whether they even work when they leave the factory,
 but I've only ever heard that here (make what you want of that).

Search for tagged command queueing and DLTA and IBM;
you will be rewarded with many horror storries about the
drive electronics not being able to keep up on these drives,
when writing near the spindle.  This normally doesn't happen
until the disk is almost full, with Windows FS's, which will
usually place your machine safely out of warranty.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-05 Thread Terry Lambert

Nate Williams wrote:
  TRW supported a lot of the early
  386BSD/FreeBSD effort, back before Walnut Creek CDROM threw
  in and had us change the version number from 0.1 to 1.0 to
  make it a bit easier to sell.
 
 *Huh*  That's revisionist history if I've ever heard it.  We
 did a 1.0 release for FreeBSD because we wanted to differentiate
 ourselves from 386BSD (lot of bad blood there with the Jolitz's)
 and NetBSD (which had a 0.8 release at that time).

FWIW: This is all archived on Minnie, thanks to Warren Toomey.

I believe that Julian was the first corporately employed
person, who had at least part of his paid job as working on
the 386BSD/FreeBSD code.

Bill Jolitz approved a 0.5 interim release of 386BSD, as
his recent family troubles and recent contract with Sun
precluded him getting the promised 1.0 release out any time
soon.

Some of the people who later split off NetBSD and released the
NetBSD 0.8 release had reverse engineered the patchkit format,
and built tools to do the same thing.  Not understanding the
fact that the patchkit was in fact a simple, single user revision
control system that I had hacked together, they released patches
of their own, starting at #1000.  This resulted in problems with
serialization, and, I believe, was one of the main factors in
their going off on their own.

Progress was made on the 386BSD 0.5 release under the auspices
of the patchkit maintainers, who had their position of control
because I did not distribute the patchkit patch making shell
scripts very widely, in order to ensure serialization, so that
the patches, when applied, would work, have proper dependency
tracking, and not result in conflicts.

There was an angry posting on Usenet by Lynne Jolitz; in it,
she claimed that 1/3 of the patchkit was good, 1/3 was benign
(but unnecessary), and 1/3 was crap.  Then she would not say
which 1/3 was which; this pissed off more people than the
original claim that only 1/3 of the code was any good.

After much sniping back and forth, Bill Jolitz posted, and
revoked his previous permission to use the 386BSD name (a
common law trademark belonging to him), and therefore he had
effectively scuttled the interim release under the 386BSD
name.

Unwilling to throw away many months of work, it was decided to
go forward with the release, under the name FreeBSD 0.1.

Walnut Creek CDROM suggested that the version number be changed
to 1.0, in order to make it an easier sell on CDROM.

Check with Warren, if you don't believe this account.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-05 Thread Terry Lambert

Poul-Henning Kamp wrote:
 Nate,
 
 You're replying to Terry for christs sake!  What did you expect if not
 revisionist $anything ?
 
 Which reminds me, Adrian still oves us his story about ref :-)

Poul, you're going off again, without regard for facts.

Remember the last time FreeBSD history came up, I proved Nate
mistaken in his claim that my authorship of the original 386BSD
FAQ was revisionist history.

You can check these facts out in the archives on Minnie; I can
also provide almost every email I ever sent or received (if it
resulted in a response from me to the author), from 1988 forward,
since I have it all archived, since even at the time, I felt it
might end up being an important historical record.  At the very
least, it has provided me with a rich source of information from
which to draw, in order to study Open Source projects in general,
and 386BSD, FreeBSD, and NetBSD, in particular.

I am only willing to open up the non-private email sent or
received, and then only with considerable incentive (it is a
very large archive).

Alternately, you can go to Warren's archive and look there,
before making accusations of revisionism.

However, if you insist, I can and will happily quote large
sections of it to this mailing list, in support of any contended
claims of inaccuracy...

Thanks,
-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-05 Thread Terry Lambert

Nate Williams wrote:
  Bill Jolitz approved a 0.5 interim release of 386BSD
 
 And then Lynn revoked this, and posted a public message to the world
 stating what obnoxious fiends we were.

Actually, Lynne didn't have the right to do this; the trademark
was Bill's, so the revocation wasn't valid until Bill did it.


  Some of the people who later split off NetBSD and released the
  NetBSD 0.8 release had reverse engineered the patchkit format,
  and built tools to do the same thing.
 
 Actually, no.  It was the person who was going to take it from me (I
 could name him, but it wouldn't do much good).  The new maintainer
 didn't do anything or respond to email for over 3 months, so Jordan took
 it over from where I left off.

I was aware that CGD had reverse engineered it.  I wasn't aware
that you had given the tools to the people who later released
the 1000 level patches.


 NetBSD was Chris Demetriou's child after he got fed up with Bill's
 promises never coming true.  I was the third committer on what would
 later become the NetBSD development box, but I still naively assumed
 that Bill's promises would eventually come to fruition.

All of us pretty much assumed that, at the time.  8-(.


 NetBSD happened when Lynn's famous email was sent out claiming we were
 all evil incarnate, and that no-one understood them anymore.

I talked to Lynne and Bill through much of that time; it was
(unfortunately) a discussion well before the fireworks that
resulted in him knowing about common law trademarks.  I was
still on good terms with them, well after the NetBSD 0.8
release, and we mostly just lost touch, rather than letting
the bickering come between us.

One thing that was not commonly known at the time, though I
guess most people know it now, is that they had had a financial
setback, followed by a death in the family, and really weren't
in any condition to be doing anything but picking up the pieces;
the whole incident was really unfortunate.


 Actually, all of the patchkit maintainers (myself, Jordan, and Rod) had
 access to your shell software.  However, it turned out that avoiding
 conflicts was hard, because serialization often required patches upon
 patches upon patches upon patches, and at some point, the
 creation/maintenance of the patchkit was greater than building a new
 release.  (Plus the fact that you couldn't install the patches w/out a
 running system, and the running system couldn't be installed on certain
 hardware w/out patches, causing a catch-22).

Yes.  It was effectively a single author thing.  I always used
it by manually applying the patches and resolving any conflicts
by hand, and then running a diff between the base tree and the
target tree.  I never really claimed it as anything other than a
vehicle for distributing patches (it sure as heck was no CVS!).

As for the binaries, we had a number of patched floppy images
floating around (I personally couldn't boot the thing at all
until I binary edited the floppy to look for 639 instead of
640 in the CMOS base memory data registers).


 Close, but the original posting was by Bill, and the revokation was done
 by Lynn.

I remember it the other way, but would have to go to tape on
it to know for sure... 8-).

Originally, Lynne recommended the patchkit and FAQ -- here's
an excerpt of a usenet posting of hers from 28 January 1993:

| You can get a copy of 386BSD from agate.berkeley.edu (and it's mirror
| sites) via anonymous ftp. It is also available on CDROM from Austin
| Code Works ([EMAIL PROTECTED]) [Note -- this is unpatched 0.1 -- you should
| get the patchkit in /unofficial on agate, and also the FAQ]. 


 I was involved with the entire affair, and Warren's archive doesn't
 include much of what later became 'core' email.

Unfortunately, I cut myself out of the loop early on that,
due to the impending purchase of USL by Novell, which went through
in June of 1994, after off shore locations which were not Berne
Convention signatories had been found to house the code in case the
worst happened, so this email is not part of my personal archives.
I hope someone, somewhere has saved it for posterity...


 Also, it doesn't include the phone conversations with Bill and
 Lynn, which (obviously) aren't in the public domain.

Nor mine.

Actually, in California, Utah, and Montanna, and now many more
states, so long as one party to the conversation is the one
doing the recording, you don't even have to have the periodic
beep to indicate a recording... even back then.

But I never even considered recording my calls, and I rather
doubt that anyone else had the foresight to do it, either.  It's
annoying in retrospect, because I had the equipment for doing
passive monitoring without violating the phone company rules
on connecting equipment to their wires.  20/20 hindsight... 8-(.


-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-05 Thread Terry Lambert

Nate Williams wrote:
 You're not the only pack-rat around here.  Be careful of your claims,
 since they could come back to bite you.

I'm willing to be bitten in public, if I'm wrong... always have
been.  ;-).


 ps. I still have my phone-logs of my conversations with Bill as well. ;)

Now I'm jealous... I have some yellow legal pads with notes
on them, and two of the archives of the grand unified console
driver online discussions (what a boondoggle that turned out\
to be!), but no real phone logs.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: What is VT_TFS?

2001-09-05 Thread Terry Lambert

Poul-Henning Kamp wrote:
 *I* worked at TFS, I even kept ref.tfs.com alive after Julian went AWOL.

I'm well aware of your checkered past... 8-).

I guess Julian might pipe up now about the use of the acronym
AWOL...


 Now, remind me again why historians are so picky about primary
 sources and secondary sources for historical information...

That would be... Dennis Ritchie?  8-) 8-).


 Are we done now ?

I guess...


 (Apart from Adrians story of course :-)

If you think you can beat it out of him... I think we'd all
like to sit around the camp fire and listen to it, while
stroking our long grey beards...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: auto relaying for subdomains -- why?

2001-09-05 Thread Terry Lambert

Igor Podlesny wrote:
 I  noticed  that  some  mailers (sendmail, postfix) in case they allow
 relayingforsomedomain.zonealsoallowrelayingfor
 subdomain-of.somedomain.zone.
 
 I can accept this as reasonable behavior but would like to know how to
 deny it! :) Also I wish to know what was the actual idea behind this?

Sendmail does _not_ do this by default; you have to specifically
allow it by adding entries to your M4 file from which you build
your sendmail.cf.

If I had to guess, I'd guess that you enabled the domain via a
sendmail.cw file, rather than a virtusertable, or by setting
yourself up as a promiscuous relay.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: local changes to CVS tree

2001-09-05 Thread Terry Lambert

John Polstra wrote:
 CVS claims to support multiple vendor branches, but in practice it
 doesn't work in any useful sense.  There's at least one place in the
 CVS sources where the vendor branch is hard-coded as 1.1.1.  You
 really don't want to use multiple vendor branches -- trust me. :-)
 Use two repositories instead, or use perforce.

I guess I'll ask the usual question:

Any chance of getting CVSup to transfer from a remote repository
to a local vendor branch, instead of from a remote repository to
a local repository?

This would be incredibly useful for building a combined local
source tree from multiple project's CVS repositories.  It could
be used by FreeBSD for a number of contrib things, as well...

Just a hint hint to the Modula 3 programmers among us...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Posix Threading

2001-09-05 Thread Terry Lambert

[EMAIL PROTECTED] wrote:
 
 Hi All,
 I am trying  to create threads under HP-UX 11 using POSIX threads library and
 using the method pthread_create(...).
 
 But I don't know how can I create a thread in a suspended state.

First the obligatory off topic humor:

This is not the place to ask about HP-UX programming; you
probably want the Hewlet-Compaqard user's mailing list... 8-p.

--

You don't give us enough information about your application
for us to tell you the correct approach to building it; you've
started with a hammer (initially suspended threads) and are
ignoring other tools (e.g. the jaws of life) in your search
for a way to instance your hammer, under the theory that your
problem is a nail (it might not be; you've given us no way to
know).

Probably switching to FreeBSD and the kqueue interface would
better match your problem.


Let's do some setup, and then guess at what you wanted to do,
and give you several approaches to our satisfying our guesses...


Short answer: You can't.  The suspended state is an attribute
of the thread that is controlled by the threads scheduler, and
is not directly controllable by you.


A thread is guaranteed to be suspended only when it is waiting
on a mutex, condition variable, or blocking call (such as I/O).

I suggest you rethink your design.


If the intent is to get an immediate context switch back to
yourself, you will need to create a mutex that can not be
acquired by the newly created thread, and attempt to acquire
it as the first thing you do.

You can then immediately release it so as not to block other
threads trying the same dirty trick.  Note that there is not
an explicit yield, so you will have to do something like
this to get a yield equivalent.


Alternately, if the intent is to create threads so that they
will be around when you need them, you would be better off
delaying their creation until you need them.  The expensive
part of threads creation is _supposed_ to be the allocation
of the stack; so if you keep a pool of preallocated stacks
lying around for your use, you will get only a small startup
latency.

If the intent is to have a pool of idle threads, ready to
go when you get request traffic, and get around the latency,
well, you'd do a lot better in the latency department if you
went to a finite state automaton, instead of messing with
threads.  But if you insist, the best you are going to be
able to get is use of a mutex, since a condition variable will
result in a thundering herd problem.  You will still have to
eat the latency for the mutex trigger to the thread you give
the work to, however (this is how I designed the work-to-do
dispatcher in the Pathworks for VMS (NetWare) product for DEC
and Novell).

If the intent is a horse race, where you create a bunch of
threads, and then open the starting gate and have them all
start going ``at once'', then you want a condition variable.  I
intentionally quoted ``at once'', since the concurrency will
not be real without multiple CPUs and cooperation from the OS,
it will be virtual -- just as if you were running multiple
processes, instead of trying to use threads.  This is an
artifact of the scheduler, and varies from implementation to
implementation.

Note: I don't know what level of standards compliance the
HP-UX 11 version of pthreads has achieved; some of the things
described above are probably not going to be easy to achieve
in downrev versions of the library.  Last time I looked, the
HP-UX version of pthreads was a Draft-4 version, not a Draft-10
or standard version, so you may be in more trouble than you
originally thought; specifically, you will have a problem with
creating a preinitialized mutex in a Draft-4 version of pthreads,
so you will have to create the mutex (if you use one) first.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: local changes to CVS tree

2001-09-05 Thread Terry Lambert

Nate Williams wrote:
  I guess I'll ask the usual question:
 
  Any chance of getting CVSup to transfer from a remote repository
  to a local vendor branch, instead of from a remote repository to
  a local repository?
 
 The problem is that you aren't just transferring bits from the HEAD, but
 from multiple active branches.  As John already stated, CVS doesn't
 handle multiple 'vendor' branches well (and in this case, the FreeBSD
 tree has vendor (CSRG) branches, FreeBSD vendor branches (RELENG_2,
 RELENG_3, ..., contrib vendor branches (TCSH, GCC, etc..)
 
 CVS is simply not setup to do what you ask. :(

I know how to make it do it, using a numeric tuple pair
prefix to effectively force things onto a vendor branch;
CVS will just do the right thing with the data: it's
really CVSup, not CVS, which is the bottleneck.

I've actually done this one on an experimental basis, by
using CVSup to mirror the CVS repository, and then using
scripts to hack the holy heck out of the mirror during a
copy, which left me with a local repository containing only
a skeleton and a vendor branch (with ID's up in the 5000's).

It worked, but I got a cramp: the local copy was so
expensive compared to an integrated approach, that it was
not worth maintaining.

It's just been 15 years or so since I did any Modula
programming, and the Modula 3 compiler is a behemoth that
I'd rather not have to slay to get real work done.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: local changes to CVS tree

2001-09-05 Thread Terry Lambert

John Polstra wrote:
 No, Terry's idea is sound as long as you only try to track one branch
 of FreeBSD.  I.e., you consider FreeBSD to be your vendor, and you do
 a checkout-mode type of fetch from a branch of the FreeBSD repository
 and directly import it onto your own vendor branch.  This would meet
 the needs of a lot of people, e.g., companies who make products based
 on FreeBSD.

Yes, precisely.  People always complain that companies are
gun-shy of -current; the inability to tag a sufficiently
stable version is why most companies stay away from it.

This means that most commercially funded work occurs on the
-RELEASE/-STABLE branches, for fear of destabilizing their
products.  Everyone in FreeBSD-land always complains about
this, even as they continue to make -current even less
stable, and less likely to result in them getting funded
help to work on it.  So a lot of forward looking research
takes a lot longer than it should to bear fruit (or wither,
if it turns out to be a net loss).


 I have had this on my to-do list for a long time, but I have no idea
 if or when it'll ever get implemented.  It would require a focused
 period of working on it that I just don't have these days.  Maybe if
 the economy gets worse ...

I hear Hewlett-Compaqard is laying off 15,000, if that's any
incentive...

I guess a better question would be whether funding would help?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: SO_REUSEPORT on unicast UDP sockets

2001-09-05 Thread Terry Lambert

Mike Silbersack wrote:
  Similarly, there are a number of bugs in the TCP sockets as
  well; specifically, there's a problem with all sockets being
  treated as being in the same collision domain, when doing
  automatic port assignment.  This limits you to 65535 oubound
  TCP connections, even though you have multiple IP aliases on
  an interface (theoretically, you should get 64k connections
  per IP address, if you bind _not_ to IN_ADDR_ANY, but instead
  use a specific port, but the hash is broken).
 
 I like this problem's evil sibling: client side TIME_WAITs.  If
 you build them up, you just sit there unable to allocate outgoing
 ports until they time out.

If you fix or workaround the source IP address problem, and
patch/tune the kernel for enough outbound sockets, you can
go to 250,000 outbound connections very easily.  I used a
couple of 1GB memory systems in this configuration to get my
1M (actually, closer to 2M) inbound server connections...
obviously, a server doesn't have the port limitation, when
it comes to accepting connections.

The client TIME_WAIT problem is more an issue for port reuse;
for a 2MSL timer in the standard 60 second range, this will
basically limit you to 65535/60, or ~1000 outbound connections
a second per IP address, as a sustained rate, with a total
outstanding count of 65535 * IP_address_count.

Unless you set SO_REUSEPORT/SO_REUSEADDR.

So for the client side, you are pretty much limited by the
bug (or your fix), and whatever you set the 2MSL timer down
to, as a sustained rate top end.

For most real world uses, apart from test equipment, which
will usually just use raw sockets directly, and fake the
AYN/ACK for the SYN, and then accept the ACK without an RST,
you never really get up into this number of client connections
on a single box.


 Maybe net or openbsd handle these situations better, I'll have
 to check later.

I doubt it.  Until I did testing on 4.3, no one had really
run over 32,766 open sockets in a production server, since at
that point, the ucred reference count overflowed, which would
result in some strange and very hard to identify crashes, when
closing those connections.

Alfred fixed this in -current, but it wasn't done consciously
to address a known problem, it was done just in case (Alfred
finds problems like that, and fixes them without necessarily
being aware of it... 8-)).  It hadn't been MFC'ed back to 4.3
until I identified an actual problem, and the root cause.

NetBSD and OpenBSD have some hacks on the server side of the
scaling problem (e.g. they have each implemented a SYN cache,
which is OK as far as it goes, but is really inferior to the
SYN cache and rate halving algorithm code (also against FreeBSD)
out of the Pittsburgh Supercomputing Center.

I've done a preliminary port of the PSC code to 4.x, actually,
though I would need to strip out a number of local changes.

One interesting thing about the SYN cache code is that it could
use the tcptmpl allocation until it saw the ACK (or even the
first data, as was suggested by some of the researchers at that
startup in India, a while back, though that's very aggressive).

Mostly, you aren't going to see the hashing on both source and
detination IP's and ports -- what you'd see in an L2/L3 switch,
if you were building one -- which would let you reuse the local
pair, so long as it was associated with a different remote pair.

That's probably the real long term fix, if there is one.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Posix Threading

2001-09-06 Thread Terry Lambert

John Baldwin wrote:
  If the intent is to have a pool of idle threads, ready to
  go when you get request traffic, and get around the latency,
  well, you'd do a lot better in the latency department if you
  went to a finite state automaton, instead of messing with
  threads.  But if you insist, the best you are going to be
  able to get is use of a mutex, since a condition variable will
  result in a thundering herd problem.  You will still have to
  eat the latency for the mutex trigger to the thread you give
  the work to, however (this is how I designed the work-to-do
  dispatcher in the Pathworks for VMS (NetWare) product for DEC
  and Novell).
 
 Most of what you say is ok, but this is wrong.  condition variables
 do not mandate using a wakeup all strategy.  There is such a thing
 as 'signal' instead of 'broadcast', which only wakes up one thread.

My concern over recommending this would be that it is very
implementation dependent as to which thread gets woken up.
In Linux, it could result in a full context switch for it
to be implemented by the threads system.

Also I remembered something about a problem with the
implementation from Draft 2, and as I said previously, I
had no idea of the compliance level (this is from an
experience with adapting the threads in the Standard
Template Library, as distributed by the Moscow Supercomputing
Center, to so correct static mutex initialization).

In FreeBSD, you're certainly right, though it will maybe
end up having the full context switch overhead (or even
CPU selection overhead) once kernel threading via KSE is
the norm (but in FreeBSD's implementation, you might be
able to argue the same thing about mutex based triggers,
if implemented such that the context is not passed off
instead -- except that he wanted initial hibernation, and
I don't think you could guarantee that with a mutex).

FWIW, my implementation in VMS was based on DEC's MTS,
which was a BLISS-based call conversion threading package,
which I had to extend to have timers, and also had to add
all the necessary synchorinization primitives.  The basic
implementation was made using ASTs with SYS$WAITEFLOR --
wait-event-flag-OR -- very similar to condition variables.

The new condition variable primitive wasn't enough to give
a guarantee the necessary semantics for the application
(a port of Mentat Streams to VMS, in support of the SPX
and IPX stacks used by NetWare), and I had to build real
Mutex support on top of the primitives to get the packet
MUX to do the correct thing.

Anyway, there was really not enough information in his
request, or my potentially outdated knowledge of pthreads
on HP-UX for me to recommend condition variables with the
wake one semantics.

But again, your point is 100% valid for the FreeBSD release
version out there, and I *DID* recommend that he switch his
application to FreeBSD.  ;-).

PS: BLISS is ignorance...

Regards,
-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: auto relaying for subdomains -- why?

2001-09-06 Thread Terry Lambert

Igor Podlesny wrote:
 Yes,  I  saw  this  info here:
 http://www.sendmail.org/m4/features.html#relay_mail_frombut   most
 valuable  part of my question was about the purpose or the idea behind
 this,  cause it's not too clear to me why allowing relaying for domain
 FOO.BAR  should  allow  relaying  for  SUB.FOO.BAR?  I  mentioned RFCs
 because  I had a hope to find out the answer from it but still haven't
 yet...

[EMAIL PROTECTED]
[EMAIL PROTECTED]

Whose account name at your customer's site are you going to
intentionally render unintelligble, and force them to change
their business cards and stationary?

Alternately, why wouldn't they just say screw you, and set
their masquerade features to make all the machines lie and
say they were sending from the domain?

What are you trying to accomplish by prohibiting some machines
legitimately in a delegated subdomain (for which account and
other authority has been vested in someone other than the main
site administrator, such as a departmental administrator) from
sending legitimate email?

Why do you want them to have to jump through hoops in order to
be able to send email which they will ultimately jump through
the hoops -- and send through your relay anyway?

What possible legitimate purpose is serves by letting [EMAIL PROTECTED]
send email, but prohibiting [EMAIL PROTECTED] from sending mail?

I suspect that you are more concerned with having only a single
MAIL_HUB relaying email through you, rather than actually
prohibiting people from using delegated subdomains.  If so,
then your problem is because you are trying to use the wrong
tool to accomplish your task: do not use domain naming to try
to control relaying, or people will simply spoof their source
addresses, and relay an incredible amount of SPAM through your
mail relays, since they will leak like a sieve.

Also note: even if you prohibit outbound, you _can't_ do the
same for inbound, without prohibiting delegation of subdomains.

This would be like me insisting that you not use the email
address [EMAIL PROTECTED], because at the top level, I will
only allow relaying for poige@ru, since morning.ru is a
delegation from ru.


In other words, if you are trying to solve a problem, tell us
the problem, don't ask us how to implement your proposed answer
to a secret problem you won't share with us.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: auto relaying for subdomains -- why?

2001-09-06 Thread Terry Lambert

Igor Podlesny wrote:
 Now  it's  all  clear  :)  and  I  understand  that  it was just a way
 SENDMAIL's  is  configured.  Another  question could be why not to use
 syntax  .foo.bar  instead  of foo.bar but I'm quite ready to call it a
 rhetorical one ;-)) (regexps are also there ;-)

The virtusertable file syntax is such that:

foo.bar predicate

means relay for foo.bar, but not *.foo.bar, and:

.foo.barpredicate

means relay for *.foo.bar, but not foo.bar, and:

foo.bar predicate
.foo.barpredicate

means relay for both foo.bar and *.foo.bar.  The value
of predicate depends on what you want to do with the
email, and it is usually a tuple consisting of a mailer
and a disposition suffix for that mailer, e.g.:

foo.bar local:bob
.foo.barsmtp:[EMAIL PROTECTED]

means send all mail with an address in foo.bar to the POP3
mailbox on the local machine for the local user ``bob'', and
send all mail for any delegates subdomains of foo.bar to the
user ``tom'' with a mail account at another ISP named
``isp.com''.

If you need to get this complicated, I suggest you read the
sendmail FAQ, or buy a copy of the O'Reilly Sendmail book.


 P.P.S.  I'm  not  quite  sure  should I start new thread or can remain
 within  it  with another question which is: What MTA software supports
 highly  configurable  relaying...  One  of  the  needed  features is a
 support  for using alternative mail routers (relays) in case when this
 MTA  can't  send  a message by itself because of networks problem.

Sendmail... this is handled by the SMART_HOST feature of
sendmail.


 For example situation could be: MTA is on a network A which is temporarily
 cut  off  from it's uplink so it can't transfer mail by itself, but it
 has  a  connection (permanent or dial-up) to another mailer.

Mail routing is via DNS.  If you are on the other side of a
dialup, you should mark the mailer expensive, set HoldExpensive
to True, and then explicitly do the queue run in your link-up
script, or, if you prefer, at intervals.

Generally, what you want to do is a bad idea, since the best way
to handle this if you have an unreliable permanent connection,
is to simply use your other connection to contact the same list
of MX's that it would have contacted anyway.

 Are there such  MTAs  which can be said if you can't send it
 by yourself (would be   cool   if   additional   parameters
 were  some_time_period  and failure_reason) then use that MTA
 (ip-addr) or that (another-ip)?.

By IP address is a bad idea, though it could be done.


 I  suspect in common case such system could easily lead to
 loops and have  other  drawbacks  but  in such simple
 configuration it seems all should work fine...

Not really.  But it will take you some amount of time to
configure this correctly, and to get your back end infrastructure
in place.

I did this work for IBM Web Connections, and it took us 3 months
to do the back end stuff, and 8 months to do all the client side
stuff, so that it was all turn key.

Basically, you are asking for a huge technology transfer,
which generally runs most ISPs several hundreds of thousands
of dollars to acquire.  With the questions you are asking,
you will probably need to buy or license it from someone.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: FINAL REMINDER: FreeBSD Monthly Development Status Report -- Request For Submissions

2001-09-07 Thread Terry Lambert

Robert Watson wrote:
 
 Submissions are due this afternoon.  Please submit by e-mail ASAP.  We're
 currently substantially behind prior months -- this is in some ways
 expected due to various people on summer vacations in the Northern
 Hemisphere, but it would be nice to get things a bit more fleshed up.  In
 particular, I'd like to see reports on:

You should add a section for academic research and commercial
users of FreeBSD.

This might not be keeping with the philosophy, though, since
most of us do not trust -current enough to do our PhD Thesis,
Master's Project, or business work on it, and tend to create
derivative works of -stable, instead...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: SO_REUSEPORT on unicast UDP sockets

2001-09-07 Thread Terry Lambert

Vladimir A. Jakovenko wrote:
 Terry, I clearly understand all your explanations. Yes, we are living in
 real life and there is a lot of programms with bad design.
 
 But all what I want is possibility to receive UDP packets with
 corresponding dst IP and port by more than one process on a single
 host. This already works for Broadcast and Multicast addresses. If
 I want to get same functionality for unicast (without any kernel
 changes) I have to use UDP-proxy, which redirects given datagrams
 to loopback broadcast address, where they can be received by more
 that one process.

Yes.


 According to comment in udp_input() SO_REUSEPORT hack on unicast
 sockets could be used in single process, right?

Yes.


 How possibility to get duplicate traffic on two UDP sockets, which
 was created in _different_ processes with the same local address
 and port values, would break existing applications?

Consider a UDP based reliable delivery protocol that cares
about key frames, but not about all frames.  A good example
of this would be any RTSP/RTP type protocol implemented on
UDP, which was used to implement streaming video of a live
broadcast, using limited buffer space.

In this situation, the video is delivered by sending a key
frame, and then subsequent data is sent as deltas on that
key frame.  Effectively, this MUXes two protocols: a reliable
datagram protocol containing the key frames, and an unreliable
protocol containing the deltas, over a single channel.  This
method of key frame use is the same method used to encode DVD
data and a number of real time streaming protocols, including
a number of streaming video protocols running over UDP (the
original technique was pioneered by a company named CinemaWare,
a Utah-based developer of Amiga software, which used a technique
called cell animation to reduce image data size requirements).

Your hypothetical two-process-no-proxy program would not
correctly acknowledge key frames consisting of more than
one UDP packet, unless you delivered the unicast to both.

If you delivered the unicast to both, you would need to
build an acknowledgement proxy, which would only
acknowledge when the entire key frame had been received by
*both* processes.

Taking an even simpler case, you could build a UDP packet
payload classifier, which would classify UDP packets based
on payload (size, etc.), and report statistics at the end
of a run.  Your change would result in erroneous reporting.


On a philosophical note, it's questionable about whether a
unicast is directed to an IP/port pair, or whether it is
directed to a particular application, and the IP/port pair,
or even the UDP protocol, are just a necessary vehicle for
the delivery of the information.


On a practical note, if you could fix the multiple delivery
problem, so that only one listener got the packet, this would
address many, but not all, of the objections above(*).


On a purely technical note, I think you want to use something
other than unicast for your implementation: multicast group
seems to be the most correct fit (I am in the camp that a
unicast is directed at an application, not merely a machine).


(*) You would still have the problem of a meta relationship
between multiple packets, and you would still have the problem
of correctly selecting who would get the packet; right now,
the behaviour is first listener, not LRU more MRU...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: [xine-user] xine on freebsd?

2001-09-12 Thread Terry Lambert

Jason Andresen wrote:
 Are you using XFree 4.x?  What video cards are in both boxes?
 Are they the same box just dual booting?  I've found that XFree
 3.x is a processor pig on my system, but XFree 4.x is nice and
 light, particularly with Xv.


I'll echo the 3.x vs. 4.x observation.


 Sometimes I wonder if Linux puts more of a buffer on DVDs than
 FreeBSD does, given the way that most of the linux DVD programs
 are written (read, decode, display, continue) they tend to IO
 starve themselves under FreeBSD.

Double buffering the I/O is a definite win.  This is really an
application space issue, since you want to buffer at least two
key frames and the associated deltas... Linux tends to do this
automatically.  I've also noticed that Linux tends to precache
the index data (just like the MACH paper that cached the entire
FAT for an MSDOSFS and turned off UFS cacheing as unfair, in
order to prove that MSDOSFS was faster than UFS), which may
be a good idea, or at least a useful mount option.

Also, look at the optimization options chosen by configure in
Linux vs. FreeBSD for compiling the player.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Kernel module debug help

2001-09-13 Thread Terry Lambert

Ah.  Interesting bug; perhaps related to a similar experience
of my own... so let's stare at it!


Zhihui Zhang wrote:
 
 I am debugging a KLD and I have got the following panic inside an
 interrupt context:
 
 fault virutal address = 0x1080050
 ...
 interrupt mask = bio
 kernel trap: type 12, code = 0
 Stopped at vwakeup+0x14: decl 0x44(%eax)
 
 Where eax is 0x108000c and vwakeup() is called from biodone().
 
 Since this panic occurs in an interrupt environment, I have no idea how to
 trace it. Is there a way to find the bug by tracing or what is the prime
 suspect in this case.  Thanks!

The best advice would be to repeat this failure in the
context of linking the module in statically instead of
dynamically.

If it won't repeat for you then, the problem has to be in
the form of memory allocation you are using as part of the
module.

If you want to brute-force the issue, find out what is being
dereferenced at vwakeup+0x14 ...it looks to be:

vp-v_numoutput--;

though mine is at:

0x40189c9c vwakeup+20:decl   0x44(%eax)

which implies you have bad/older/newer vwakeup code.  Maybe
you are just missing the if test that verifies it's non-NULL
vnode pointer being dereferenced???  That would match the number
of bytes your decl instruction is off from mine:

614 void
615 vwakeup(bp)
616 register struct buf *bp;
617 {
618 register struct vnode *vp;
619
620 bp-b_flags = ~B_WRITEINPROG;
621 if ((vp = bp-b_vp)) {
622 vp-v_numoutput--;
623 if (vp-v_numoutput  0)
624 panic(vwakeup: neg numoutput);
625 if ((vp-v_numoutput == 0)  (vp-v_flag  VBWAIT)) {
626 vp-v_flag = ~VBWAIT;
627 wakeup((caddr_t) vp-v_numoutput);
628 }
629 }
630 }


I'll also note that 0x44 is 68, which implies 17 long words
before v_numoutput is declared in struct vnode; this didn't
match my quick count.


I rather expect that it's in a swappable memory region that's
currently not mapped, or NULL (we see it's not NULL), so this
implies that it's an unitialized vnode from the zone -- a thing
you can't initialize at interrupt.

This can happen as the result of a kevent() completion being
noted (e.g. readable) at interrupt context, since you can get
swappable objects (it also looks like you may be on your way
out of splbio, which implies networking -- my guess is therefore
that you are working on network file system code, and have a
shadow vnode that you are using as a context for the calls
that should have been allocated out of an interrupt zone instead
of out of the main memory allocator, which is not interrupt safe
for new allocations... 8-)).

For example, I use LRP, which drastically increases my connections
per second out of the TCP stack and eliminates receiver livelock
and a number of other problems for heavily loaded servers, but it
means that sockets need to be able of accept'ing to completion
(creating a new socket) at interrupt context.

But when this happens, I don't have a proc structure handy to
deal with the issue (since I'm at interrupt context).  The
sneaky way around this is to use the proc from the already
existing socket on which the listen for which the accept is
being completed was initially posted -- which gets me the proc
struct, which gets me the ucred, so I have the proc pointer
and the ucred pointer necessary to run the connection to
completion.

I rather expect that if you are depending on the existance of
something similar at interrupt context, that you will have to
either queue it and run to completion at a software interrupt
level (e.g. NETISR -- not recommended, even for networking!),
or just lose the wakeup (which is what the vwakeup code I
have does, with it's if test).

Still, your best bet is to compile the thing in static, repeat
the problem, and then look at where things went wrong in the
kernel debugger.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: All ok?

2001-09-14 Thread Terry Lambert

Josef Karthauser wrote:
   Hi!
 
  I just wonder if all freebsd developers are ok, due the wtc attack?
 
 We believe so.

Has anyone talked to Loqui Chen since the the event?

Loqui was a financial person in New York at one time,
and made significant contributions in the VM system, the
Soft Updates code, and SMP code.  People would post a
bug no one could find, and the first you'd hear from him
(and usually the last) was a bug fix for some insanely
deep problem in code that was opaque to most people on
these lists.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Driver structures alignment

2001-09-14 Thread Terry Lambert

Mike Smith wrote:
 Having said that, I recommend using __attribute__ ((packed))
 to explicitly request that a structure be packed.

Is there a problem with #pragma pack(1)?  I see it in a
lot of header files... do they need changing?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: storing routine code in kernel memory using kvm interface

2001-09-15 Thread Terry Lambert

Sansonetti Laurent wrote:
 
 Hi,
 
 Is there a way to store a function in kernel memory using KVM interface ?
 
 I have written a tty spy'er, which simply hijack discipline line entries for
 a tty, and as you know probably, those routines must be situated in kernel
 land.
 
 I know that I should use KLD for that, but i'm still curious..

No.  You can not allocate memory safely to prevent the kernel
reusing it and stomping your code, and you can not guarantee
your hook installation will be done atomically without getting
context switched or interrupted via a hardware interrupt, thus
panic'ing the kernel.  Not to mention that you would have to
know a huge amount about the VM system to establish mappings,
and those mappings wouldn't be atomic, either, and without them,
your kernel would panic with a page not present.

Use a KLD instead, unless this is a cracking tool, in which case
go ahead and use /dev/kmem, if it's writeable, since repeated
crashes with tracebacks pointing to a program using your uid
and having /dev/kmem open will get your admin to you-proof his
box.  8-).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Could not bind

2001-09-16 Thread Terry Lambert

Stephen Montgomery-Smith wrote:
 
 I have written a server program that listens on port 3000.  The program
 works very well except for one feature.  I am asking if that is normal,
 or whether I forgot something.
 
 If I run the program it does fine.  If I then kill the program (after it
 has accepted connections), and then run the program again, the bind
 function fails to work, and I get a message like Could not bind (see
 program below).  If I wait a while, like a minute or two, then the
 program will work again. Is this normal behavior, or did I miss
 something?

This is normal.

When a server closes the connection, which will occur in the
resource-track-cleanup case of you killing the server, the
connections effectively undergo a host close.  If the clients
are still around and responsive, these conections will go away
quickly.  If not, then these connections will hand around a
long time.  In addition, in the case of client initiated closes
prior to your temination of the program, the sockets will be in
TIME_WAIT state for 2MSL -- 60 seconds, by default.

So in normal operation, you should expect that you will not be
able to restart the server for at least 60 seconds, and perhaps
more, unless you have unset the keepalive socket option on the
sockets to prevent the FIN_WAIT_2 state.

A common overreaction to improrper state tracking by the programmer,
or improper clean shutdown of a server is to set SO_REUSEADDR
on the listen socket of the server.  THis lets you restart the
server.  But it also lets you start multiple instances of the
server, so if you are doing things like cookie state tracking which
are server instance specific (e.g. for an HTTP server), then you
have shot yourself in the foot, unless this state is shared between
all server instances, and your servers are anonymous work-to-do
engines, rather than being specific-purpose (this is because you
can not control the connections to make them go to one server vs.
another, if both are listening on the same port).

Ideally, you would correct the shutdown so that it was clean,
and correct the socket options, if what you are intending is
to abort the server without sending complete data to the client
(e.g. unsetting SO_LINGER will cause the sending of an RST on
close, avoiding the TIME_WAIT, but potentially leaving the
client hanging until the longer -- 2 hour, by default -- clock
on the client sends a keepalive, and the RST is resent; this is
because RST's are not resent, as they do not get acknowledgement).

As a workaround, you can set SO_REUSEADDR on the socket.  Above,
I labelled this as an overreaction... it is.  For this to work,
you will probably need to make sure your server creates a pid
file in /var/run/servername.pid, and then, before you reopen
the socket, verify via kill(2), using a signal argument of 0,
that the process is in fact dead, before grabbing its port out
from under it for half the inbound connections (see the 2 kill
man page for details on the 0 signal; a 0 return or a -1 return
with errno == EPERM mean the process your are trying to replace
is already running).


 I got the programming style from Richard Steven's book on network
 programming.  The structure of the program is something like this:

[ ... example elided ... ]

That's all fine; the problem is just an incomplete understanding
of the TCP protocol; hopefully the above will fill you in; in the
man time, you may want to get the internals volumes from the
Steven's books series, and read them, as well, since it's often
useful to understand why you are seeing what you are seeing; the
user space volumes are only half the story.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Cron pickle

2001-09-16 Thread Terry Lambert

Tim Allshorn wrote:
 
 Hello.
 I need to be able to run a particular program at the last
 minute of each month and yes I know it would be much easier to
 run it at the first minute of each month, but my hands are tied
 and my brain is too puny to work it out.

1)  Run it the first minute of each month.

2)  Set your system clock back one minute.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: 802.11 with best Apple compatibility?

2001-09-17 Thread Terry Lambert

Rasputin wrote:
 Thnaks a lot for the confirmation - for the record, I'm trying
 to replace a basestation, not communicate with one.

The timing for access points requires different firmware; you
can't do it in software alone.

Talk to Julian Elisher; he presented on a company that sells
a FreeBSD driver with firmware (binary only for the whole
thing) that can make a FreeBSD box into an access point,
using specific hardware.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: any reason to use m_devget in the dc driver ?

2001-09-21 Thread Terry Lambert

Andrew Gallatin wrote:
 I imagine that this was done to follow alignment constraints on
 non-i386 platforms where having the ip header misaligned is fatal.
 (the tulip is not capable of byte granularity DMA, so you can't
 intentionally misalign the ethernet header  end up with an aligned IP
 header)

This is the reason: the ethernet header is 14 bytes.


 I imagine the i386 should be made an exception. See rev 1.17 of
 sys/dev/nge/if_nge.c

I disagree with this code; the elemenets in the header
are referenced multiple times.  If you are doing the
checksum check, you might as well be relocating the data,
as well.  The change I would make would be to integrate
the checksum calculation with the m_devget(), to ensure
a single pass, in the case that m_devget() must be used
to get aligned packet payload, and the checksum has not
been offloaded to hardware.

When Bill finishes the Tigon III driver, he will find out
that it does not have the firmware problem the Tigon II
has, and that he can actually leave the checksum offload
active, and still be able to use VLANs (something that you
can not do with the Tigon II without serious changes to
the firmware).

IMO, in the vast majority of cases, it makes sense to do
the m_devget(), even though it looks like you can do the
unaligned access in 2 bus cycles instead of 1, and come
out ahead, for the IP and TCP header elements.  This could
be fixed by reducing the references to the elements so
that they are extracted only once, at which point, the
cost breaks even at a 32 byte or larger payload size, and
it is better to do the unaligned references than the copy
(assuming checksum offload, rather than opportunistic copy
at checksum calculation time).  Reordering the code to do
it this way is an ...interesting... exercise, since you
have to make some assumptions that may not be valid (e.g.
that it is an IP payload) before you get to the ipinput
code.

In any case, I think it would be useful to turn on the CR
bit that causes unaligned access faults on the 486 and
above Intel processors, as well (this discussion has taken
place before).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: any reason to use m_devget in the dc driver ?

2001-09-21 Thread Terry Lambert

Andrew Gallatin wrote:
   I disagree with this code; the elemenets in the header
   are referenced multiple times.  If you are doing the
   checksum check, you might as well be relocating the data,
   as well.  The change I would make would be to integrate
   the checksum calculation with the m_devget(), to ensure
   a single pass, in the case that m_devget() must be used
   to get aligned packet payload, and the checksum has not
   been offloaded to hardware.
 
 Interesting idea... However, what if you're a bridge or a router?
 You've just done a whole lot of work for nothing.  I imagine its just
 this case that Luigi cares about.
 
 If you want to integrate a checksum  a copy, it should really be done
 at the copyout() stage.

You're missing the point.  You do the m_devget() only when
you decide to do a checksum, which means you've decided to handle
the packet yourself.  The alignment is done via a copy of the
header field; specifically, a byte copy of the protocol type,
when you decide how to handle it.

In the case of the bridge, it's very easy: you don't care about
the contents of the packet, unless it's destined for you.  For
a router, you are operating above layer 2, so you _do_ care,
and must hterefore do the checksum to be correct (since you
should not reference the field contents without knowledge that
the checksum is correct).  The Cisco approach of ignoring the
checksums is all well and good, if the work is being done in
hardware, but for most software, it very definitely cares (it's
more overhead to relay bad packets onto your network, even if
you assume it's OK for a router to do that).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: any reason to use m_devget in the dc driver ?

2001-09-22 Thread Terry Lambert

Luigi Rizzo wrote:
 
 I probably missed some emails ?
 In any case i was only concerned about the additional copy
 done by m_devget when the controller can already DMA into
 an mbuf, and there are no alignment constraints.

I guess we are talking about a protocol other than IP?  The
ethernet header is 14 bytes in length, which means that the
elements in the IP header are not longword aligned, unless
the card can DMA onto a 2 byte boundary, and does so, to
ensure that the IP header starts on a 16 byte boundary, and
therefore the contents are correctly aligned for direct
reference in a single bus cycle (or at all, on Alpha, where
unaligned access is not permitted).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: cvs commit: src/lib/libatm atm_addr.c cache_key.c ioctl_subr.c ip_addr.c ip_checksum.c timer.c

2001-09-16 Thread Terry Lambert

David O'Brien wrote:
 
 On Sun, Sep 16, 2001 at 06:35:27AM +1000, Bruce Evans wrote:
  Especially the empty line after the copyright message:
 
 Agreed.
 
   __FBSDID($FreeBSD: src/lib/libatm/atm_addr.c,v 1.6 2001/09/15 19:36:55 dillon 
Exp $);
 
 What about changing this to __FBSD(), which is what I was using in a
 prototype to reduce the number of characters in the macro name (and thus
 reduce the wrap around).

That's a silly reason to change a macro.

If you need two characters back, you should change Matt Dillon's
login from dillon to matt.

(Now see how stupid the original suggestion sounded?).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Disk based file system cache

2001-09-25 Thread Terry Lambert

Attila Nagy wrote:
 
 Hello,
 
 I'm just curious: is it possible to set up an NFS server and a client
 where the client has very big (28 GB maximum for FreeBSD?) swap area on
 multiple disks and caches the NFS exported data on it?
 This could save a lot of bandwidth on the NFS server and also redues load
 on that.

There is a configuration called dataless, in which you
have local swap for an NFS booted system; this has been
supported by SunOS since 1991/1992.  It can also be used
with FreeBSD, with some rc file tweaking (FreeBSD has
seemingly resisted rc file changes to make this kind of
division easier, as well as diskless).  You are also
allowed locacl data storage, with the idea that the OS
and utilities, etc. all come off the remote server.

The major value would be to mark the VFS type as precache
executables to swap... the main failure mode for diskless
and dataless SunOS machines has historically been the fact
that, when the machine when to swap in a page in an exectable,
the server was down, and therefore, all the engineers sit and
twiddle their thumbs while their machine is locked up sitting
in the page-in path.

To combat this, you could have an attribute flag on the FS
that indicated it was remote, and thus triggered local swap
caching of the executable file image in its entirety, so that
demand paging over the network was held to a minimum.  This
would permit people to continue to do work during a server
outage, since the pages will be there.

This is an idea I've suggested before.

If, on the other hand, you are asking for caching of data
file contents for writable files (unlike executables, which
are read-only except for the server, since when you run a
program, you do not tend to write to its image), then the
answer is no, not unless you implement NFSv4.

The problem is that, prior to NFSv4, there was not a working
distributed cache coherency protocol, so locally cached data
can become stale; writebacks to the server by one client can
therefore overwrite adjacent but unrelated data written back
by another client, if such writes are not restricted to page
boundaries (or worse).

So that answer is that any system that does this, risks the
corruption of its data in the common case (even the simplest
case, a mail server whose mailbox files are accessed by a
single client machine at a time, will get corruption).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: got bad cookie vp 0xe2e5ef80 bp 0xcf317328

2001-09-26 Thread Terry Lambert

Brian Reichert wrote:
 /*
  * Yuck! The directory has been modified on the
  * server. The only way to get the block is by
  * reading from the beginning to get all the
  * offset cookies.
  *
  * Leave the last bp intact unless there is an error.
  * Loop back up to the while if the error is another
  * NFSERR_BAD_COOKIE (double yuch!).
  */

This is technically incorrect.  Unless the directory has been
truncated back, the way to do this is to do a mod on the block
size for the offset given, and assume a compression of the
directory block out from under you.  So you do the mod, and
then that gives you the directory block (compression does not
occur across blocks), and then chain forward in the block until
you have an index that exceeds the cookie index, and then go
back one.  The best way to do that is to remember the index one
previous.

In the case of a truncation, you're done.

The ultimate bad result from this will be a duplicate entry
from the iteration of the directory.  Dealing with a duplicate
entry is the job of the NFS client.

Note that a directory iteration is just a snapshot, much like
ps, and is not guaranteed to remain accurate, just as taking
a polaroid picture of a crowd will not necessarily remain an
accurate indicator of who is there, since it could change in
the time between when the picture is taken and it develops.

The best way would be to avoid cookies entirely, but FreeBSD is
not structured for this (cookies were introduced in the 4.4-Lite
release by NetBSD, and adopted by everyone else, including the
4.4-Lite2 release; NetBSD and OpenBSD have a different cookie
parameter order than FreeBSD, as well).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: TCPIP cksum offload on FreeBSD 4.2

2001-09-29 Thread Terry Lambert

Kenneth D. Merry wrote:
  Unfortunately, it can not correctly interoperate with a
  number of cards in jumbogram mode, so unless you know the
  card on the other end and manually configure it (it can't
  negotiate properly), you can't really use jumbograms.  Or
  you could rewrite the firmware (of course).
 
 Huh?  Care to explain?  Why wouldn't it interoperate with jumbo frames?
 Why are jumbo frames something that need be negotiated by the card?
 They're not negotiated by the card, but rather by the TCP stack.

Because at the time the Tigon II was released, the jumbogram
wire format had not solidified.  Therefore cards built during
that time used different wire data for the jumbogram framing.


  I have never seen the board with the stock FreeBSD driver
  get better than about half wire rate with normal size
  packets and window depth.
 
 Yep, it doesn't handle large numbers of small packets very well.  You can
 easily get wire rate with jumbo frames, though.

Again, that's not useful, since it requires hand negotiation;
you can't automatically set the MTU up as a result of doing a
driver level negotiation with the other end of the link (unless
the other end is also a Tigon II).


  On the other hand, the Tigon III
  is capable of 960 megabits -- about the wire rate limit --
  with normal size packets, if you implement software interrupt
  coelescing (which doesn't help, unless you crank the load up
  to close to wire speed and/or do more of the stack processing
  at interrupt time).
 
  The Tigon II also has the unfortunate need to download its
  firmware, and a FreeBSD bug -- the lack of a distinct state
  for firmware download -- means that the firmware get sent
  to the card each time the thing is config'ed up, or an IP
  alias is added to the thing.  Plus the damn things are more
  expensive than the Tigon III based cards.
 
 That's partially because Bill wanted to have a process context
 when he was downloading the firmware, and didn't want to do a
 kernel thread to do it.

Bill and I discussed this at length, in person.  The problem is
that FreeBSD resets the ethernet cards, for fear that it is the
user's intent to recover a hung card tis way.

The firmware download issue is because there is not a seperate
entry point for firmware download, combined with the initial
gratuitous ARP request being placed on the transmit queue for
the card before it is really up, and the only other alternative
place to hook the firmware download meant that you would have
to hang the ifconfig command -- nad therefore the /etc/rc, and
therefore the boot process -- waiting for th download to complete.

The upshot of this is that adding a virtual IP address to the
interface is assumed to be a very infrequent event.

This assumption is incorrect for a large number of IP takeover
protocols (e.g. VRRP), whose timing granularity is small enough
that the firmware download exceeds the interval by a factor of
10 or more.  By the time the alias address assignment (and the
concommitant firmware download) are completed, the takeover
window has been greatly exceeded.

For all three of the takeover protocols for which there is
published documentation available, this results in the address
being configured on, the firmware downloaded, the takeover
window being missed, losing the contention race for the takeover,
and the address being configured off -- and the firmware being
downloaded yet again.

Another amusing consequence is that, if there is traffic at
the time of the event, the driver goes into a perpetual
watchdog timeout: resetting loop, which can oly be resolved
by a system reboot.


 So it gets done when the card is ifconfiged up.  I wrote some code for my
 former employer to wait up to 10 seconds for link to show up when the board
 is configured.  The alternative is a 'sleep 5' (for fiber boards) or 'sleep
 8' (for copper boards) after it is ifconfiged up.

The kernel code is trivial, but it's not in there by default,
and wouldn't need to be there, if there were a firmware download
phase for the card which could be triggered to occur asynchronously
(e.g. so that the driver did not hand user space in the attach).

The sleep workaround doesn't really work, unless you're willing
to hack up your rc files to deal correctly with a mix of cards;
even so, then it adds an additional ~10 seconds per ti interface
to the boot process, and to any reconfiguration process that can
occur later (and later reconfiguration processes risk going into
the watchdog timeout: resetting infinite loop, if there are any
proceses with network links active at the time -- something which
is almost inevitable).


 Having downloadable firmware is actually a huge advantage.  You can do
 things with the Tigon II that just aren't possible with any other chip,
 either because they don't have downloadable firmware, or because the vendor
 won't release firmware source.

I agree; I'm well aware of a number of people who have done
things to the firmware.  The 

Re: TCPIP cksum offload on FreeBSD 4.2

2001-09-29 Thread Terry Lambert

Bill Paul wrote:
 It is possible for a driver
 to load a custom image into the NIC's memory which will override the
 auto-loaded one, and it's also possible to load a new image into
 the EEPROM, however this requires an additional manual on top of
 the BCM5700 driver developer's guide as well as the firmware development
 kit, which you can only get from Broadcom/3Com/whatever under NDA.

Yes.  This is annoying as hell.  One wonder what they are
thinking.

 These custom images are called value-add firmware which are used to
 provide features like TCP segmentation, which you can't do with the
 default firmware image. Note that the BCM5700/Tigon III only has
 a limited amount of on-board RAM (256KB, I think). You're supposed
 to be able to attach up to 16MB of static SRAM to the BCM5700. The
 BCM5701 doesn't support external SSRAM at all, which I find a little
 confusing.

The hardware based two card failover is based partly on
card firmware changes, as well.  They support this for
Linux because they themselves wrote the code.  They won't
let third parties license this for porting, unfortunately,
even for binary only ditributions.  8-(.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: sio modification

2001-09-30 Thread Terry Lambert

Bart Kus wrote:
 If I do have to write something, for my work to be included anywhere, I
 should be writing for the -CURRENT kernel, right?  I presently run -STABLE,
 so that would obviously be the more comfortable kernel to write for...but it
 is *STABLE* after all.

Most of us doing commercial developement write for -stable, and
give it out to be ported to -current by someone else, if we don't
have the time to do both.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: precise timing

2001-09-30 Thread Terry Lambert

Bakul Shah wrote:
 
  Hrm, I was planning on investigating the RT capabilities of fbsd after
  I got
  myself a decent timer mechanism.  I was hoping they would be enough to get
  close to RT.  I have an SMP system I can use, so 1 CPU can be dedicated to
  the task.
 
 I doubt even an SMP system would help.

Plus this is ASMP -- ASymmetric MultiProcessing -- when you
dedicate a CPU to a task.

FreeBSD doesn't support this.

Linux supports this, with the patches from Ingo.  I'm guessing
they will become part of the standard Linux distribution.  He
developes the Tux in kernel web server, and he has the entire
code path for the thing, including the TCP stack, so it fits in
a single CPU instruction cache.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: VM: dynamic swap remapping (patch)

2001-09-30 Thread Terry Lambert

Alfred Perlstein wrote:
[ ... SIGDANGER ... ]
 Well Joe seems to have provided a pretty interesting document on
 how it works in AIX, but I was wondering if they do anything wrt
 low/high watermarks like my idea.
 
 Basically you'd like to inform processes that the danger has been
 alliviated so that they can cautiously start accepting more work
 rather than freaking out and shutting out clients forever...

The process is supposed to return unused memory to the system
when it gets the signal, if it can.

It's not supposed to shed all load until it gets the all clear
signal.

I don't know if there are any good books on Windows Internals,
but the Windows VM system does the same thing: it notifies all
kernel subsystems that they need to free up memory, if they can.
The VFAT32 IFS will basically return exactly one page out of
many thousands it is using for cache, when it gets the request
(it is implemented as a callback, which you must provide when
you register for VM services).


 This might lead to a situation where SIGDANGER starts getting
 sent informing that things are looking bleak, then processes
 start freeing resources, they get the second SIGDANGER to let
 them know that things are looking ok so they ramp up again and
 the cycle repeats, I guess that's not optimal, but I'd like FreeBSD
 to let processes know that things are looking better so they can
 go from scrooge mode to thrifty mode.

The idea is just to free resources, if you can, and to mark the
processes which are precious by whether or not they have a
signal handler.  A close reading of the other document posted
(it seemed to be the admin manual from the URL) will indicate
that the followon SIGKILL is not sent to the processes that have
a SIGDANGER handler registered.  Note that this does not mean
that your process won't be killed off as a result of a page not
present fault, so abusing the interface is not really tolerated
very well by the system.

I think signalling an all clear is really a bad idea; a soft
hysteresis loop is much less prone to pendulum swings than a
hard hysteresis loop (lesson #1 in the book Fuzzy Logic).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: sio modification

2001-10-01 Thread Terry Lambert

Poul-Henning Kamp wrote:
 Submissions should contain a -current version or they are likely
 to never make it into the tree...

I guess this is why the Rice University code that more than
triples the TCP connection rate never made it in the first
time they released it for 2.2 and then again when they
released the updated version with resource containers for
version 4.2 (though the license on the second version was
not happy).

I guess that's also why Luigi Rizzo's SACK and TSACK code
for 2.x was never integrated.

And the Pittsburgh Supercomputing Center code for the
syn cache, which is vastly superior to the code in NetBSD
and BSDI...

And the MIT code for the TIME_WAIT zombies...

It's a great pity that most MA and PhD thesis and real world
products have hard deadlines that can't wait for -current to
become stable enough to use in a product sold to people, and
which the company shipping it must support when it has problems,
but it's understandable why companies and people tend to do
their development there instead of -current.

If I were doing a new product, I'd pick 4.4, I think, since
the KSE work and ACPI code has destabilized -current; it
doesn't help, either, that the 5.0 release date was pushed
back another year.

This is not trying to lay blame; it's just pointing out that
most funded work is going to take place in a -stable branch,
since FreeBSD is being used a a platform for other work, and
is not the ends in and of itself (the ends are graduation
with a degree, or making money on a FreeBSD based product).

Going to P4 would help that a little, but not as much as it
would if P4 were free for commercial use; and yes, I understand
their need to make money as well: I'm just pointing out that
it's largely a tools problem.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Memory allocation question

2001-10-03 Thread Terry Lambert

Dwayne wrote:
  I'm creating an app where I want to use memory to store data so I
 can get at it quickly. The problem is, I can't afford the delays that
 would occur if the memory gets swapped out. Is there any way in FreeBSD
 to allocate memory so that the VM system won't swap it out?

Allocate it at boot time in machdep.c; that's one way, but
that's mostly for use in the kernel (though you could set
the PG_U bit on the pages, which would double map them in
both the kernel and all user space processes.

Another way is to put it in a System V shared memory segment,
and set the sysctl option that wires it down.

Have you read the madvise() man page?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: pkg_create help needed.

2001-10-03 Thread Terry Lambert

Julian Elischer wrote:
 
 I need to take a directory of 'stuff'
 which includes a script install.sh
 and make it into a package..
 I have had some success but it's not quite right..
 
 What I'd like to make it do is:
   unpack the 'stuff' into a temporary directory somewhere.
   run the install script
   delete the install directory
 
 The trouble is that I can't work out how to get the files
 unpacked there and have the install script get them from there..
 
 I can get it to unpack them into the final locations, and I can get the
 install script to run and find them there, but I need the install script
 to modify stuff and I'd rather have it all done in the temp
 directory if possible, and then istalled into the final
 location..
 also I have can not make the @srcdir option work in the packing list..
 does it work?
 (-s seems to work)

Use a preinstall script for the modifications.  Yes, this
means you will need two scripts.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Question about pthread

2001-10-05 Thread Terry Lambert

Oleg Golovanov wrote:
 
 Dear Sirs:
 
 I am using FreeBSD-2.2.8 and after calling pthread_create()
 my programs get sigfault (SIGSEGV) and exited with core dump.
 
 I should like to ask if somebody know the solve of this problem.
 My example of using pthread is included below.
 
 I ask to answer me directly on my e-mail.

[ ... ]

 pthread_create(NULL, NULL, coms, confd);

2.2.8 is a pre draft 4 standard pthreads implementation.

void
coms( void *confdp)
{
int confd = *(int *)confdp;
...
}


main()
{
pthread_t   id; /* REQUIRED */

...
/* Note: coms, _NOT_ coms... */
pthread_create( id, NULL, coms, (void *)confd);
...
}

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: Read only file FSTAB after error config???

2001-10-21 Thread Terry Lambert

This belongs on -questions...

Soweb_Ahfei wrote:
 We have installed the Freebsd4.32 in our server.But we can not
 reboot the system after we made an error configuration  in the
 file FSTAB.Now,we can not delete or rename the error file Fstab
 and the system shown the file is read only.
 
 We would not re-install the system since there are some available
 data.Please give us an instruction how to revise it.

Boot the system single user (boot -s at the boot prompt,
after hitting spacebar during the countdown).

Remount the root partition as read/write (mount -u -o rw /,
after you get to a shell).

Modify the fstab to correct your error; you may need to fsck
the partition where /tmp is located, if it is not /, before
you can run an editor; you will probably need to set the
terminal type, as well, unless you want to use cat, or are
comfortable with ed (setenv TERM cons25).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: IPSEC sucking up memory

2001-10-23 Thread Terry Lambert

Shoichi Sakane wrote:
  While investigating a problem, I noticed that the IPSEC code
  is initializing the sp -- even when no one is using IPSEC.
 
  It turns out that this really, really bloats the per socket
  memory requirements, with the only real result being a lot
  of extra processing that could be replaced by a pointer is
  not NULL check.
 
  It seems to me that this could be handled in the TCP, UDP,
  and IP userreq code by only initializing the thing in the
  case that a policy has been set.  Is there some reason why
  this can't be done?
 
 IPsec specification requires to consult the SPD with all of packets
 in order to handling the packet.  it defines RFC2401.
 if a pointer to the entry of the SPD is NULL, it means the security
 policy is not defined.  so the kernel consults the system wide default.
 it never means nothing to do.

So you are saying that I could establish a global default, and
make the sp pointer NULL, and have that mean use the global
default, instead of copying identical policies all over the
place, right?

I think this would be the best approach, and it would get me
all of the redundant deep copy memory back in the default
case.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: CFS

2001-10-24 Thread Terry Lambert

Jesús Arnáiz wrote:
 
 Hi!
 
 I want to install a cyphred partition on my system. I use FreeBSD, and I
 want to know what software is avaivle in order to do it.
 
 I heard about CFS and TCFS (but this is not still supported by FreeBSD), is
 there any better bet? If anyone know any good resource (sites, papers, ...)
 on these topics please tell me.

There are several in ports that rely on NFS loopback mounts.

If you are not interested in that, contact John Heidemann,
the original author of FreeBSD's stacking VFS architecture:
several of his students built a crypto stacking layer as one
of their class projects (I've had a copy I can't redistribute
since about 1995).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: mountd will not start at boot. Or function later...

2001-10-26 Thread Terry Lambert

Joesh Juphland wrote:
 
 You wouldn't happen to have a portmap_enable=NO line in your rc.conf,
 would you?
 
 No, I do not.  Further, I see 'portmap' in the process list, so it is indeed
 running.

ipfw add 1 allow all from any to any

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: syslogd and kqueue

2001-10-27 Thread Terry Lambert

Kris Kennaway wrote:
 
 On Fri, Oct 26, 2001 at 11:39:57PM +0100, void wrote:
  If syslogd used the kqueue interface, I believe it could open a new log
  file as soon as it was created, rather than waiting to receive a signal.
  Would this be worth doing, or would it be too big a divergence from the
  traditional behavior?
 
 I assume you mean as soon as the configuration file is modified?
 That would be a big violation of POLA.

You need the mount point.

Several OSs handle this by being able to mount on any
mount point, whether it exists, or not.

You could do this pretty easily in FreeBSD by adding a
directory lookup cache entry for non-existant mount points,
which is never aged out, and then using that as the mount
point (in theory, never aging out mount points is a good
idea in any case, since it protects you from several classes
of NFS deadlocks, as well -- unless they go stale, in which
case, it's no worse than before).

Another alternative I have suggested in the past is to make
the devfs two deep:

o   Have a / that contains a single directory /dev

o   Have /dev be what normally gets mounted on /dev
as the root of an FS

o   Union mount root over top of / -- in other
words, mount devfs as /, first

I really like the second one, but I have other obscure uses
in mind, and since it would make chroot jails harder, if you
didn't also permit deffs to mount flat without the /dev/
insert, it's probably better to take the forst approach.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: syslogd and kqueue

2001-10-27 Thread Terry Lambert

Mike Barcroft wrote:

  I'm suggesting that the kill could be left out if syslogd got the same
  smarts as tail -F.
 
 I recommend using newsyslog(8) for rotating log files.

I recommend _NOT_ using newsyslog for rotating files.

The newsyslog program bit us on the ass numerous times at
Whistle, where if it failed to be called, it would just
build up a big log file, fill up /var, and you'd end up
screwed even after it restarted, since it would leave /var
full.

The problem is that newsyslog doesn't rewrite history.

As an example, say you have a size limit on a log file of
10k, and a number of files to keep of 6, so you never
expect it to take up more than 60k.

Now newsyslog fails, and you end up with the top level
log file being 1M, with 5 10k log files after it:

1M, 10K, 10K, 10K, 10K, 10K

You start newsyslog up again (usually with a reboot, as
the failing program was cron or at), and it moves
the 1M file to the first log file, deletes the oldest,
and then creates a new log file.  You now have:

0K, 1M, 10K, 10K, 10K, 10K

when what you wanted was really:

0K, 10K, 10K, 10K, 10K, 10K

With the 5 10K files being the last 50K of the 1M file.

Now you can only rotate it out with another 10K of data
writtent to an already full /var (other log files are now
free to consume the 10K you freed up), and then it will
take 5 log rollovers before your /var is down to its
proper disk utilization again, and your system is back
to normal... and these can never happen.

Because of this, /var is still full, so anything that
needs /tmp is still broken, so you end up getting a call
for support about whatever it was that wasn't working.

Very, very ugly.

Until newsyslog is fixed to not be able to stage a
denial of service attack against you, I really, really
recommend against its use.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: syslogd and kqueue

2001-10-28 Thread Terry Lambert

Garance A Drosihn wrote:
 Until newsyslog is fixed to not be able to stage a
 denial of service attack against you, I really, really
 recommend against its use.
 
 Seems like it would be more user-friendly (to freebsd users
 in general) to fix newsyslog, instead of just telling people
 that they should not use it...  If people just don't use
 newsyslog, how does that guarantee that whatever they do
 use will not have the same problem that you described?

They will have to make their own solutions correct.

Every engineer is responsible for the correctness of their
code.  All I'[m saying here is that there is a known level
of incorrectness in newsyslog: use it at your peril.

If you want to fix it, I'm sure people would appreciate the
effort.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



Re: disabling dynamic route addition

2001-10-28 Thread Terry Lambert

Mike Silbersack wrote:
  Also, if this happens again, what additional information could I grab so I
  or others could (hopefully) successfully find the bug?
 
 Many dynamic route related changes have been made since 4.2, your bug may
 already be fixed.  You should invest time in transitioning to 4.4.

THere's an interesting bug that appears to still be present
in 4.4, where if you create an IPSEC VPN, a ping to the
other end of the tunnel gets there, comes all the way back,
but is dropped by the local machine, if the dfefault route is
the machine hosting the tunnel.

If you remove the default route, and add a static route to
the other end of the tunnel, pointing through the gateway
host, there is no problem.

Note that leaving a static route while having a default route
still fails.

The tcpdump on the pinging host sees the packet back, but the
network stack of the host does not.

Can't tell you if this is a problem in the gateway host doing
a rewrite when it shouldn't, and the receiving host dropping
it, or the receiving host being too picky about the source of
the next hop for the echo reply...

If you want reproduction direction, I might be able to wrangle
them out of someone, but you will need at least 4 machines to
run them.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-hackers in the body of the message



<    1   2   3   4   5   6   7   8   9   10   >