from:"Eduardo Horvath"

Re: make COMPAT_LINUX match SYSV binaries

2020-10-21 Thread Eduardo Horvath

On Wed, 21 Oct 2020, co...@sdf.org wrote:

> In the event someone adds support for another OS with this problem (say,
> modern Solaris), I don't expect this compat to be enabled by default,
> for security reasons. So the problem will only occur if a user enables
> both forms of compat at the same time.

But Solaris *IS* SYSV.

Eduardo

Re: Straw proposal: MI kthread vector/fp unit API

2020-06-22 Thread Eduardo Horvath

On Mon, 22 Jun 2020, Taylor R Campbell wrote:

> > Date: Mon, 22 Jun 2020 18:45:47 + (UTC)
> > From: Eduardo Horvath 
> > 
> > I think this is sort of a half-measure since it restricts
> > coprocessor usage to a few threads.  If you want to say, implement
> > the kenrel memcopy using vector registers (the way sparc64 does)
> > this doesn't help and may end up getting in the way.
> 
> Why do you think this restricts it to a few threads or gets in the way
> of anything?
> 
> As I wrote in my original message:
> 
>That way, for example, you can use (say) an AES encryption routine
>aes_enc as a subroutine anywhere in the kernel, and an MD definition
>of aes_enc can internally use AES-NI with the appropriate MD
>fpu_kern_enter -- but it's a little cheaper to use aes_enc in an
>FPU-enabled kthread.  This gave a modest measurable boost to cgd(4)
>throughput in my preliminary experiments.
> 
> Note that the subroutine (here aes_enc, but it could in principle be
> memcpy too) works `anywhere in the kernel', not just restricted to a
> few threads.
> 
> The definition of aes_enc with AES-NI CPU instructions on x86 already
> works (https://mail-index.netbsd.org/tech-kern/2020/06/18/msg026505.html
> for details); just putting kthread_fpu_enter/exit around cgd_process
> in cgd.c improved throughput on a RAM-backed disk by about 20%
> (presumably mostly because it avoids zeroing the fpu registers on
> every aes_* call in that thread).

It sounded to me as if you set a flag in the kthread indicating that 
thread is allowed to use FPU instructions.  Maybe I'm missing something 
but from the description I assumed you created a kthread, set the flag, 
and now you can start using the FPU.

I suppose I could be mistaken and the flag is being controlled by 
kthread_fpu_entrer()/_exit(), but in that case you have issues if you ever 
need to nest coprocesor usage.  

> > I'd do something simpler such as adding a MI routine to allocate or 
> > activate a temporary or permanent register save area that can be used by 
> > kernel threads.  
> > 
> > Then, if you want, in the coprocessor trap handler, if you want, if you 
> > are in kernel state you can check whether a kernel save area has been 
> > allocated and panic if not.
> 
> This sounds like a plausible alternative to disabling kpreemption in
> some cases, but it is also orthogonal to my proposal -- in an
> FPU-enabled kthread there is simply no need to allocate an extra save
> area at all because it's already allocated in the lwp pcb, so if a
> subroutine does use the FPU then it's cheaper to call that subroutine
> in an FPU-enabled kthread than otherwise.
> 
> You say it would be simpler -- can you elaborate on how it would
> simplify the implementations that already work on x86 and aarch64 by
> just adding and testing a new flag in a couple places, and enabling or
> disabling the CPU's FPU-enable bit?
> 
> https://anonhg.netbsd.org/src-all/rev/e83ef87e4f53
> https://anonhg.netbsd.org/src-all/rev/7ec4225df101

Franky, I have not looked at either the x86 or aarch64 implementations, 
and it's been a very long time since I last looked at the spar64 
implementation.

The SPARC has always had a lazy FPU save logic.  The fpstate structure is 
not part of the pcb and is allocated on first use.  

When I added the block mem*() routines I piggybacked on that 
implementation.  When a kthread is created the FPU starts out disabled and 
the pointer to the fpstate is NULL.  If userland decides to use the FPU, a 
kernel trap is generated, an fpstate is allocated and added to the 
kthread, and the CPU structure, and the FPU is enabled.  On context 
switches the FPU is disabled.

In a very simplistic description of how I implemented the block copy 
operations, they:

1) check if the FPU is dirty, if it is, save the state to the fpstate in 
the CPU structure.

2) Allocate a new fpstate (usually on the stack) and store a pointer to it 
in the CPU structure.  Save the current kthread's fpstate pointer on the 
stack and replace it with a pointer to the new fpstate.

3) When the block operation is complete, clear the FPU dirty bits, disable 
the FPU and clear the pointer in the CPU structure and restore the fpstate 
ponter in the kthread.

Remembering all this stuff is making my brain hurt.

Eduardo

Re: Straw proposal: MI kthread vector/fp unit API

2020-06-22 Thread Eduardo Horvath

On Sat, 20 Jun 2020, Taylor R Campbell wrote:

> Here's a straw proposal for an MI API to allow a kthread to use any
> vector or floating-point unit on the CPU -- call it the `FPU' for
> brevity.

Description elided.

> Thoughts?

I think this is sort of a half-measure since it restricts coprocessor 
usage to a few threads.  If you want to say, implement the kenrel memcopy 
using vector registers (the way sparc64 does) this doesn't help and may 
end up getting in the way.

I'd do something simpler such as adding a MI routine to allocate or 
activate a temporary or permanent register save area that can be used by 
kernel threads.  

Then, if you want, in the coprocessor trap handler, if you want, if you 
are in kernel state you can check whether a kernel save area has been 
allocated and panic if not.

Eduardo

Re: Am I using bus_dma right?

2020-04-24 Thread Eduardo Horvath



You missed the most important part of my response:

On Fri, 24 Apr 2020, Eduardo Horvath wrote:
> 
> > So I have to treat it like a DMA write even if there is never any
> > write-direction DMA actually going on?
> 
> Yes.
> 
> > Then the problem *probably* is not bus_dma botchery.

Eduardo

Re: Am I using bus_dma right?

2020-04-24 Thread Eduardo Horvath

On Thu, 23 Apr 2020, Mouse wrote:

> Okay, here's the first problem.  There is no clear "transaction
> completes".

Let's clarify that.

> The card has a DMA engine on it (a PLX9080, on the off chance you've
> run into it before) that can DMA into chained buffers.  I set it up
> with a ring of butters - a chain of buffers with the last buffer
> pointing to the first, none of them with the "end of chain" bit set -
> and tell it to go.  I request an interrupt at completion of each
> buffer, so I have a buffer-granularity idea of where it's at, modulo
> interrupt servicing latency.
> 
> This means that there is no clear "this transfer has completed" moment.
> What I want to do is inspect the DMA buffer to see how far it's been
> overwritten, since there is a data value I know cannot be generated by
> the hardware that's feeding samples to the card (over half the data
> pins are hardwired to known logic levels).
> 
> I've been treating it as though my inspection of a given sample in the
> buffer counts as "transfer completed" for purposes of that sample.

Are you inspecting the buffer only after reciept of an interrupt or are 
you polling?  

> 
> > When you do a write operation you should:
> 
> > 1) Make sure the buffer contains all the data you want to transmit.
> 
> > 2) Do a BUS_DMASYNC_PREWRITE to make sure any data that may remain in
> > the CPU writeback cache is flushed to memory.
> 
> > 3) Tell the hardware to do the write operation.
> 
> > 4) When the write operation completes... well it shouldn't matter.
> 
> ...but, according to the 8.0 manpage, I should do a POSTWRITE anyway,
> and going under the hood (this is all on amd64), I find that PREREAD is
> a no-op and POSTWRITE might matter because it issues an mfence to avoid
> memory access reordering issues.

I doubt the mfence does much of anything in this circumstance, but 
POSTWRITE does tell the kernel it can free up any bounce buffers it 
may have allocated if it allocated bounce buffers, but I digress.

> 
> > If you have a ring buffer you should try to map it CONSISTENT which
> > will disable all caching of that memory.
> 
> CONSISTENT?  I don't find that anywhere; do you mean COHERENT?

Yes COHERENT.  (That's what I get for relying om my memory.)

> 
> > However, some CPUs will not allow you to disable caching, so you
> > should put in the appropriate bus_dmamap_sync() operations so the
> > code will not break on those machines.
> 
> For my immediate needs, I don't care about anything other than amd64.
> But I'd prefer to understand the paradigm properly for the benefit of
> potential future work.

I believe if you use COHERENT on amd64 none of this matters since it turns 
off caching on those memory regions.  (But I don't have time to grep the 
souces to verify this.)


> > Then copy the data out of the ring buffer and do another
> > BUS_DMASYNC_PREREAD or BUS_DMASYNC_PREWRITE as appropriate.
> 
> Then I think I was already doing everything necessary.  And, indeed, I
> tried making the read routine do POSTREAD|POSTWRITE before and
> PREREAD|PREWRITE after its read-test-write of the samples, and it
> didn't help.

Ah now we're getting to something interesting.

What failure mode are you seeing?

> >> One of the things that confuses me is that I have no write-direction
> >> DMA going on at all; all the DMA is in the read direction.  But
> >> there is a driver write to the buffer that is, to put it loosely,
> >> half of a write DMA operation (the "host writes the buffer" half).
> > When the CPU updates the contents of the ring buffer it *is* a DMA
> > write,
> 
> Well, maybe from bus_dma's point of view, but I would not say there is
> write-direction DMA happening unless something DMAs data out of memory.
>
> > even if the device never tries to read the contents, since the update
> > must be flushed from the cache to DRAM or you may end up reading
> > stale data later.
> 
> So I have to treat it like a DMA write even if there is never any
> write-direction DMA actually going on?

Yes.

> Then the problem *probably* is not bus_dma botchery.

Eduardo

Re: Am I using bus_dma right?

2020-04-23 Thread Eduardo Horvath

Let me try to simplify these concepts.

On Thu, 23 Apr 2020, Mouse wrote:

> I'm not doing read/write DMA.  DMA never transfers from memory to the
> device.  (Well, I suppose it does to a small extent, in that the device
> reads buffer descriptors.  But the buffer descriptors are set up once
> and never touched afterwards; the code snippet I posted is not writing
> to them.)

If you are not doing DMA you don't need to do any memory synchronization 
(modulo SMP issues with other CPUs, but that's a completely different 
topic.)

> The hardware is DMAing into the memory, and nothing else.  The driver
> reads the memory and immediately writes it again, to be read by the
> driver some later time, possibly being overwritten by DMA in between.
> So an example that says "do write DMA" is not directly applicable.

If a (non CPU) device is directly reading or writing DRAM without the 
CPU having to read a register and then write its contents to memory, then 
it is doing DMA.

The problem is many modern CPUs have write-back caches which are not 
shared by I/O devices.  So when you do a read operation (from device to 
CPU) you should:

1) Do a BUS_DMASYNC_PREREAD to make sure there is no data in the cache 
that may be written to DRAM during the I/O operation.

2) Tell the hardware to do the read operation.

3) When the transaction completes issue a BUS_DMASYNC_POSTREAD to make 
sure the CPU sees the data in DRAM not stale data in the cache.

When you do a write operation you should:

1) Make sure the buffer contains all the data you want to transmit.

2) Do a BUS_DMASYNC_PREWRITE to make sure any data that may remain in the 
CPU writeback cache is flushed to memory.

3) Tell the hardware to do the write operation.

4) When the write operation completes... well it shouldn't matter.

If you have a ring buffer you should try to map it CONSISTENT which will 
disable all caching of that memory.  However, some CPUs will not allow you 
to disable caching, so you should put in the appropriate bus_dmamap_sync() 
operations so the code will not break on those machines.

When you set up the mapping for the ring buffer you should do either a 
BUS_DMASYNC_PREREAD, or if you need to initialize some structures in that 
buffer use BUS_DMASYNC_PREWRITE.  One will do a cache invalidate, the 
other one will force a writeback operation.

When you get a device interrupt, you should do a BUS_DMAMEM_POSTREAD to 
make sure anything that might have magically migrated into the cache has 
been invalidated.  Then copy the data out of the ring buffer and do 
another BUS_DMASYNC_PREREAD or BUS_DMASYNC_PREWRITE as appropriate.

> The example makes it look as though read DMA (device->memory) needs to
> be bracketed by PREREAD and POSTREAD and write DMA by PREWRITE and
> POSTWRITE.  If that were what I'm doing, it would be straightforward.
> Instead, I have DMA and the driver both writing memory, but only the
> driver ever reading it.
> 
> Your placement for PREREAD and POSTREAD confuses me because it doesn't
> match the example.  The example says
> 
>   /* invalidate soon-to-be-stale cache blocks */
>   bus_dmamap_sync(..., BUS_DMASYNC_PREREAD);
>   [ do read DMA ]
>   /* copy from bounce */
>   bus_dmamap_sync(..., BUS_DMASYNC_POSTREAD);
>   /* read data now in driver-provided buffer */
>   [ computation ]
>   /* data to be written now in driver-provided buffer */
>   /* flush write buffers and writeback, copy to bounce */
>   bus_dmamap_sync(..., BUS_DMASYNC_PREWRITE);
>   [ do write DMA ]
>   /* probably a no-op, but provided for consistency */
>   bus_dmamap_sync(..., BUS_DMASYNC_POSTWRITE);
> 
> but what your changes would have my driver doing is
> 
> [read-direction DMA might happen here]
> PREREAD
> driver reads data from driver-provided buffer
> POSTREAD
> [read-direction DMA might happen here]
> PREWRITE
> driver writes data to driver-provided buffer
> POSTWRITE
> [read-direction DMA might happen here]

That bit is not right.

> The conceptual paradigm is
> 
> - at attach time: allocate, set up, and load the mapping
> 

Presumably you should do a BUS_DMASYNC_PREWRITE somewhere in here

> - at open time: tell hardware to start DMAing

and a BUS_DMASYNC_POSTWRUTE arond here.

> 

Here you need a BUS_DMAMEM_POSTREAD.

> - at read time (ie, repeatedly): driver reads buffer to see how much
>has been overwritten by DMA, copying the overwritten portion out and
>immediately resetting it to the pre-overwrite data, to be
>overwritten again later

If you wrote anything to the ring buffer during this operation you need to 
insert a BUS_DMASYNC_PREWRITE.

> 
> - at close tiem: tell hardware to stop DMAing
> 
> The map is never unloaded; the driver is not detachable.  The system
> has no use case for that, so I saw no point in putting time into it.
> 
> The code I quoted is the "at read time" part.  My guess based on the
> manpage's example and what you've written is that I need
> 
>

Re: RFC: New userspace fetch/store API

2019-02-25 Thread Eduardo Horvath

On Mon, 25 Feb 2019, Andrew Cagney wrote:

> On Mon, 25 Feb 2019 at 11:35, Eduardo Horvath  wrote:
> >
> > On Sat, 23 Feb 2019, Jason Thorpe wrote:
> >
> > > int ufetch_8(const uint8_t *uaddr, uint8_t *valp);
> > > int ufetch_16(const uint16_t *uaddr, uint16_t *valp);
> > > int ufetch_32(const uint32_t *uaddr, uint32_t *valp);
> > > #ifdef _LP64
> > > int ufetch_64(const uint64_t *uaddr, uint64_t *valp);
> > > #endif
> >
> > etc.
> >
> > I'd prefer to return the fetched value and have an out of band error
> > check.
> >
> > With this API the routine does a userland load, then a store into kernel
> > memory.  Then the kernel needs to reload that value.  On modern CPUs
> > memory accesses tend to be sloow.
> 
> So the out-of-band error check becomes the slow memory write?

Even cached accesses are slower than register accesses.  And the compiler 
is limited in what it can do to reorder the instruction stream while 
maintaining C language semantics.

> I don't see it as a problem as the data would presumably be written to
> the write-back cached's stack (besides, if the function is short, LTO
> will eliminate it).

The compiler can eliminate the function but need to keep the defined C 
language synchronization points so the memory accesses limit the ability 
of the compiler to properly optimize the code.  

Plus, there are a number of applications where you may want to do a 
series of userland fetches in a row and operate on them.  In this case it 
would be much easier to do the series and only check for success when 
you're done.

And all the existing code expects the value to be returned by the function 
not in one of the parameters, so existing code would require less 
rewriting.

I'd do something like:

uint64_t ufetch_64(const uint64_t *uaddr, int *errp);

where *errp needs to be initialized to zero and is set on fault so you can 
do:

int err = 0;
long hisflags = ufetch_64(flag1p, &err) | ufetch_64(flag2p, &err);

if (err) return EFAULT;

do_something(hisflags);

Eduardo

Re: RFC: New userspace fetch/store API

2019-02-25 Thread Eduardo Horvath

On Sat, 23 Feb 2019, Jason Thorpe wrote:

> int ufetch_8(const uint8_t *uaddr, uint8_t *valp);
> int ufetch_16(const uint16_t *uaddr, uint16_t *valp);
> int ufetch_32(const uint32_t *uaddr, uint32_t *valp);
> #ifdef _LP64
> int ufetch_64(const uint64_t *uaddr, uint64_t *valp);
> #endif

etc.

I'd prefer to return the fetched value and have an out of band error 
check.  

With this API the routine does a userland load, then a store into kernel 
memory.  Then the kernel needs to reload that value.  On modern CPUs 
memory accesses tend to be sloow.

Eduardo

Re: Missing compat_43 stuff for netbsd32?

2018-09-11 Thread Eduardo Horvath

On Tue, 11 Sep 2018, Paul Goyette wrote:

> While working on the compat code, I noticed that there are a few old
> syscalls which are defined in syc/compat/netbsd323/syscalls.master
> with a type of COMPAT_43, yet there does not exist any compat_netbsd32
> implementation as far as I can see...
> 
>   #64 ogetpagesize
>   #84 owait
>   #89 ogetdtablesize
>   #108osigvec
>   #142ogethostid (interestingly, there _is_ an implementation
>   for osethostid!)
>   #149oquota
> 
> Does any of this really matter?  Should we attempt to implement them?

I believe COMPAT_43 is not NetBSD 4.3 it's BSD 4.3.  Anybody have any old 
BSD 4.3 80386 binaries they still run?  Did BSD 4.3 run on an 80386?  Did 
the 80386 even exist when Berkeley published BSD 4.3?

It's probably only useful for running ancient SunOS 4.x binaries, maybe 
Ultrix, Irix or OSF-1 depending on how closely they followed BSD 4.3.

Eduardo

re: Kernel module framework status?

2018-05-04 Thread Eduardo Horvath

On Fri, 4 May 2018, matthew green wrote:

> John Nemeth writes:
> > On May 3, 10:54pm, Mouse wrote:
> > }
> > } >  There is also the idea of having a module specify the device(s)
> > } > it handles by vendor:product
> > } 
> > } Isn't that rather restrictive in what buses it permits supporting?
> > 
> >  I suppose that other types of identifiers could be used.
> 
> one imagines that it would be done via device properties, and so
> it would work with what ever bus features were provided for this
> sort of lookup.  maybe be useful to have a library backend for
> the common case of vendor:product.

OpenFirmware defines "compatible" properties, a comma separated list 
describing the device from most specific to least specific.  The 1275 
bindings for PCI define a way to map PCI vendor and product IDs to 
"compatible" properties.

What I would do is add a special ELF section to each module with a list of 
the "compatible" properties the driver supports.  Then, when the bus is 
enumerated, generate a "compatible" string for each device and search each 
"compatible" section in each driver module for a match.  If it's PCI you 
probe the vendor and device ID to create the property.  For SBus, you get 
it from the FCode.  (For ISA, you're SOL.)

That way you don't have to mess around with a bunch of obnoxious config 
files like Solaris does.

Eduardo

Re: amd64: svs

2018-01-11 Thread Eduardo Horvath

On Thu, 11 Jan 2018, Martin Husemann wrote:

> On Thu, Jan 11, 2018 at 08:14:59PM +0100, Jaromír Dole?ek wrote:
> > Okay, I'll look into this. The feature seems pretty simple to use, though
> > it will need
> > some care to allow inactive processes to relinguish PCID when there is
> > shortage.
> 
> All sane architectures are already using it ;-)
> 
> Sparc64 uses a pretty simple scheme, see pmap.c:ctx_alloc:
> 
> mutex_enter(&curcpu()->ci_ctx_lock);
> ctx = curcpu()->ci_pmap_next_ctx++;
> 
> /*
>  * if we have run out of contexts, remove all user entries from
>  * the TSB, TLB and dcache and start over with context 1 again.
>  */
> 
> if (ctx == curcpu()->ci_numctx) {
> 
> 
> If we start doing something more clever, we should make it MD and change all
> existing architectures to it.

If you look in the CVS history you will see that I originally had a much 
more clever implementation.  The claim back then is that the current 
implementation (stolen from MIPS I believe) was simpler and performed 
better.  I didn't necessarily agree at the time but didn't want to argue 
the point.

Eduardo

Re: uvm page coloring for cache aliasing

2017-12-26 Thread Eduardo Horvath

On Sat, 23 Dec 2017, co...@sdf.org wrote:

> Hi folks,
> 
> as I understand, a reason to have page coloring is extra broken MIPS
> hardware which has cache aliasing issues unless a large page size
> is used. picking the same color avoids aliasing.

Could be to work around cache aliasing, or it could be used to reduce 
cache line contention.  I'd say it's mostly used for the latter.  This is 
an optimization, not functional enforcement.

> now, looking at uvm_pagealloc_pgfl:
> 
> do {
>   /* trying to find pages in color.. */
>   /* goto success */
> 
>   color = (color + 1) & uvmexp.colormask;
>  } while (color != trycolor);
> 
> which means that if we fail to find a page in the requested color, we'll
> try another color. I think this might end up inducing cache aliasing
> issues, and we should instead fail for this case.
> 
> thoughts?

If there are functional issues with page mappings they should be handled 
in the pmap layer, either failing the map request, or disabling any 
potentially aliased caches.

Eduardo

Re: New line discipline flag to prevent userland open/close of the device

2017-10-30 Thread Eduardo Horvath

On Sun, 29 Oct 2017, Martin Husemann wrote:

> Now for those devices we definitively do not want userland access to the com
> device. This would be pretty easy if we add another flag to the tty
> line disciplines t_state member, like TS_NO_USER_OPEN. Then we could
> modify comclose like:

> Does that sound good? Other suggestions?

One other idea we did on sunos, but I don't know if it'll work on netbsd, 
is to have the keyboard/mouse driver do an exclusive open on the serial 
device that way everyone else who tries to open it will get an EBUSY.  
That should work with all the different serial drivers without having to 
modify each one's open routine.

Eduardo

Re: Access to DMA memory while DMA in progress?

2017-10-27 Thread Eduardo Horvath

On Fri, 27 Oct 2017, Mouse wrote:

> >> But I'm not sure what sort of sync calls I need to make.  [...]
> > You want to do a bus_dmamap_sync(BUS_DMASYNC_POSTREAD) [...]
> > In the NIC example above, you map the ring buffer with
> > BUS_DMA_COHERENT, fill it up and do a
> > bus_dmamap_sync(BUS_DMASYNC_PREREAD).  When you want to read it
> > (usually after getting an interrupt) you do
> > bus_dmamap_sync(BUS_DMASYNC_POSTREAD) before doing the read.
> 
> Don't you need to PREWRITE after filling it?  Based on the mental
> models I've formed, that feels necessary.

You'd want to do a PREWRITE and a POSTWRITE, but since writing wasn't part 
of your usage model I skipped that part.

Eduardo

Re: Access to DMA memory while DMA in progress?

2017-10-27 Thread Eduardo Horvath

On Fri, 27 Oct 2017, Mouse wrote:

> I would like to read the DMA buffer while DMA is still going on.  That
> is, I have a buffer of (say) 64K and the hardware is busily writing
> into it; I want to read the buffer and see what the hardware has
> written in the memory it has written and what used to be there in the
> memory it hasn't.  I'm fine if the CPU's view lags the hardware's view
> slightly, but I do care about the CPU's view of the DMA write order
> matching the hardware's: that is, if the CPU sees the value written by
> a given DMA cycle, then the CPU must also see the values written by all
> previous DMA cycles.  (This reading is being carried out from within
> the kernel, by driver code.  I might be able to move it to userland,
> but it would surprise me if userland could do something the kernel
> can't.)

This is all very hardware dependent.

Make sure you map that area with the BUS_DMA_COHERENT flag.  It will 
disable as much caching as possible on those sections of memory, and on 
some hardware may be required or the CPU won't be able to read the data 
until the segment is bus_dmamem_unmap()ped even with the bus_dmamap_sync() 
operations.

Many NICs do something like this.  They have a ring buffer the CPU sets up 
with pointers to other buffers to hold incoming packets.  When a packet 
comes in the NIC writes out the contents and then updates the pointer to 
indicate DMA completion.  The CPU then swaps the pointer with one pointing 
to an empty buffer.

> 
> But I'm not sure what sort of sync calls I need to make.  Because of
> things like bounce buffers and data caches, I presumably need
> bus_dmamap_sync(BUS_DMASYNC_POSTREAD) somewhere in the mix, but it is
> not clear to me how/when, nor how fine-grained those calls can be.  Do
> I just POSTREAD each byte/word/whatever before I read it?  How
> expensive is bus_dmamap_sync - for example, is a 1K sync significantly
> cheaper than four 256-byte syncs covering the same memory?  If I'm
> reading a bunch of (say) uint32_ts, is it reasonable to POSTREAD each
> uint32_t individually?  If I POSTREAD something that DMA hasn't written
> yet, will it work to POSTREAD it again (and then read it) after DMA
> _has_ written it?  Is BUS_DMA_STREAMING relevant?  I will be
> experimenting to see what seems to work, but I'd like to understand
> what is promised, not just what happens to work on my development
> system.
> 
> Of course, there is the risk of reading a partially-written datum.  In
> my case (aligned uint32_ts on amd64) I don't think that can happen.

You want to do a bus_dmamap_sync(BUS_DMASYNC_POSTREAD) for each... let's 
call it a snapshot.  It will try to provide the CPU a consistent view of 
that section of memory at the time the sync call is made.

The cost of these operations is very hardware dependent.  On some machines 
the bus_dmamem_map() operation with or without the BUS_DMA_COHERENT flag 
will turn off all caches and the bus_dmamap_sync() calls are noops.

On hardware that has an I/O cache, bus_dmamap_sync() may need to flush it 
first to get the DMA data into the coherency domain.

If there's a CPU cache that has not been disabled for that secion of 
memory, bus_dmamap_sync() may need to invalidate it.

In the NIC example above, you map the ring buffer with BUS_DMA_COHERENT, 
fill it up and do a bus_dmamap_sync(BUS_DMASYNC_PREREAD).  When you want 
to read it (usually after getting an interrupt) you do 
bus_dmamap_sync(BUS_DMASYNC_POSTREAD) before doing the read.

I have long argued that we should also have bus_dma accessor functions 
like the ones used by bus_dma to access device registers.  They can do fun 
things like fixing up alignment and endianness swapping without having to 
litter the driver with code only needed for certain hardware.

> The presence of bus_dmamem_mmap seems to me to imply that it should be
> possible to make simple memory accesses Just Work, but it's not clear
> to me to what extent bus_dmamem_mmap supports _concurrent_ access by
> DMA and userland (for example, does the driver have to
> BUS_DMASYNC_POSTREAD after the DMA and before userland access to
> mmapped memory, or does the equivalent happen automagically, eg in the
> page fault handler, or does bus_dmamem_mmap succeed only on systems
> where no such care needs to be taken, or what?).

Trying to do this in userland on a machine with an I/O cache won't work 
too good.

> My impression is that bus_dma is pretty stable, and, thus, version
> doesn't matter much.  But, in case it matters, 5.2 on amd64.

AFAIK amd64 disables all caches on BUS_DMAMAP_COHERENT, so the sync 
operations aren't really necessary.  But jumping through all these hoops 
is important on other hardware.

Eduardo

Re: how to tell if a process is 64-bit

2017-09-14 Thread Eduardo Horvath

On Thu, 14 Sep 2017, Martin Husemann wrote:

> On Thu, Sep 14, 2017 at 02:31:29PM +0200, Thomas Klausner wrote:
> > kp = kvm_getproc2(kvmp, KERN_PROC_PID, pid, sizeof(*kp), &res);
> > if (res != 1)
> > exit(1);
> 
>   if (kp->p_flag & P_32)
>   printf("it is a 32bit process\n");
> 
> Unless you are running with a 32bit kernel, then you'll never see that
> flag (but also the question does not make sense).

In theory you could run 64-bit processes on a 32-bit kernel on CPUs that 
have disjoint user and kernel address spaces, like sparcv9.  In that case 
it would make sense to use the P_32 flag to distinguish processes with 
32-bit address spaces from those with 64-bit address spaces.

However, it would be a lot of work and probably not worth the effort since 
the kernel would be limited to 32-bits of data structures such as file 
descriptors and uvm spaces and be hard pressed to keep up with the demands 
of 64-bit address spaces.  (And implementing copyin/copyout routines would 
be fun.)

So yes, at the moment setting the P_32 flag on every process running on a 
32-bit kernel seems like a waste of CPU cycles.

Eduardo

Re: how to tell if a process is 64-bit

2017-09-08 Thread Eduardo Horvath

On Fri, 8 Sep 2017, Mouse wrote:

> >> ([...] on most "64-bit" ports, a real question on amd64 (and others,
> >> if any) which support 32-bit userland.)
> > actually -- our mips64 ports largely use N32 userland, which is 64
> > bit registers and 32 bit addresses.
> 
> Oh!  Thank you.  Yes, that's an interesting case.
> 
> In addition to amd64/i386, it occurs to me that sparc64/sparc32 is
> another case; IIRC it's possible to take sparc64 hardware and build a
> (special? not sure) kernel that runs sparc32 userland.  I've never
> tried it; I don't know whether sparc32 and sparc64 are as freely
> mixable at runtime as amd64 and i386 are under amd64 kernels.

Yes, that's what compat/netbsd32 is for.  (Of course, there are still a 
few things like mount I never bothered to get working across ABIs.)

Eduardo

re: Can't compile NetBSD kernel in Virtual Box due to assym.h error

2017-07-05 Thread Eduardo Horvath

On Wed, 5 Jul 2017, matthew green wrote:

> Robert Elz writes:
> > Date:Tue, 04 Jul 2017 07:24:43 +0100
> > From:Robert Swindells 
> > Message-ID:  
> > 
> >   | You are running NetBSD/amd64 but trying to build a NetBSD/i386 kernel
> >   | using the native tools, that won't work.
> > 
> > Groan.  So wrapped up in my own issues about this I totally missed that!
> 
> it should be able to work.  all you need to use to -m32 and the
> equiv for as/ld options.  adding them to all i386 builds should
> be OK, they're the default there anyway.

You don't want to make this seamless 'cause then Devin would have 
successfully compiled and installed a brand new i386 kernel and then when 
he tried to reboot with an amd64 root, the system would be dead.  Best to 
force an explicit cross compile in these circumstances.

Eduardo

Re: ptrace(2) interface for hardware watchpoints (breakpoints)

2016-12-15 Thread Eduardo Horvath

On Thu, 15 Dec 2016, Andrew Cagney wrote:

> Might a better strategy be to first get the registers exposed, and then, if
> there's still time start to look at an abstract interface?

That's one way of looking at it.

Another way is to consider that watchpoints can be implemented through 
careful use of the MMU.

> and now lets consider this simple example, try to watch c.a in:
> 
> struct { char c; char a[3]; int32_t i; int64_t j; } c;
> 
> Under the proposed model (it looks a lot like gdb's remote protocol's Z
> packet) it's assumed this will allocate one watch-point:
> 
> address=&c.a, size=3

So when you register the watchpoint, the kernel adds the address and size 
to an internal list and changes the protection of that page.  If it's a 
write watchpoint, the page is made read-only.  If it's a read watchpoint, 
it's made invalid.

The userland program runs happily until it tries to access something on 
the page with the watchpoint.  Then it takes a page fault.  

The fault handler checks the fault address against its watchpoint list, 
and if there's a match, send a ptrace event and you're done.

If it doesn't match the address, the kernel can either map the address in 
and single step, set a breakpoint on the next instruction, or emulate the 
instruction, and then protect the page again to wait for the next fault.

It has a bit more overhead than using debug registers, but it scales a lot 
better.

Eduardo

Re: "Wire" definitions and __packed

2016-10-05 Thread Eduardo Horvath

On Wed, 5 Oct 2016, Roy Marples wrote:

> On 04/10/2016 23:06, Joerg Sonnenberger wrote:
> > I'd like to addressing this by cutting down on the first set. For this
> > purpose, I want to replace many of the __packed attributes in the
> > current network headers with CTASSERT of the proper size, especially for
> > those structs that are clearly not wire definitions by themselve.
> 
> I tested the following structs without packed with the latest dhcpcd
> trunk (not yet in NetBSD).
> 
> ip
> udphdr
> arphdr
> in_addr
> nd_router_advert
> nd_opt_hdr
> nd_opt_prefix_info
> nd_opt_mtu
> nd_opt_rdnss
> nd_opt_dnssl
> 
> Works fine so far.

What platforms did you test it on?

I recommend trying it on sparc64.  That's one of the worst cases, being 
big-endian 64-bit with alignment constraints.  And I recall some ABI (was 
it ARM?) has strange alignment restrictions on byte values.

Eduardo

Re: What is the best layer/device for a write-back cache based in nvram?

2016-09-14 Thread Eduardo Horvath

On Wed, 14 Sep 2016, Edgar Fu? wrote:

> > 2- In scattered writes contained in a same slice, it allows to reduce
> > the number of writes. With RAID 5/6 there is a advantage, the parity
> > is written only one time for several writes in the same slice, instead
> > of one time for every write in the same slice.
> > 3- It allows to consolidate several writes that takes the full length
> > of the stripe in one write, without reading the parity. This can be
> > the case for log structured file systems as LFS, and allows to use a
> > RAID 5/6 with the similar performance of a RAID-0.
> You ought to adjust youd slice size and FS block size then, I'd suppose.
> 
> I specifically don't get the LFS point. LFS writes in segments, which are 
> rather large. A segment should match a slice (or a number of them)
> I would suppose LFS to perform great on a RAIDframe. Isn't Manuel Bouyer 
> using this in production?
> 
> > 4- Faster synchronous writes.
> Y E S.
> This is the only point I fully aggree on. We've had severe problems with 
> brain-dead software (Firefox, Dropbox) performing tons of synchronous 4K 
> writes (on a bs=16K FFS) which nearly killed us until I wrote Dotcache 
> (http://www.math.uni-bonn.de/people/ef/dotcache) and we set XDG_CACHE_HOME 
> to point to local storage.

Hm...  Maybe what you need to do is make the LFS segment the same size as 
the RAID stripe, then mount LFS async so it only ever writes entire 
segments

Eduardo

Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

2016-08-18 Thread Eduardo Horvath

On Thu, 18 Aug 2016, David Holland wrote:

> some quibbles:
> 
> On Thu, Aug 18, 2016 at 05:24:53PM +, Eduardo Horvath wrote:
>  > And you should be able to roll back the 
>  > filesystem to snapshots of any earlier synchronization points.
> 
> In LFS there are only two snapshots and in practice often one of
> them's not valid (because it was halfway through being taken when the
> machine went down) so rolling further back isn't that feasible.

I don't remember seeing any code that overwrites old snapshots, so most of 
them are still on disk.  It's just a question of finding them, which is 
where the first and last superblock come into play.

>  > The problem is that LFS is less a product than a research project:
>  > 
>  > o Although there are multiple super blocks scattered across the disk just 
>  > like FFS, LFS only uses the first and last one.  If both of those are 
>  > corrupt, the filesystem image cannot be recovered.  LFS should be enhanced 
>  > to cycle through all the different super blocks for enhanced robustness.
> 
> This should be left to fsck, like it is in ffs. I don't remember if
> fsck_lfs supports recovering from an alternate superblock, but it
> isn't going to be that hard.

The LFS super block contains a pointer to the end of the log.  Since LFS 
only ever updates that pointer on the firt and last superblock, if you try 
to use any other superblock to recover the filesystem you essentially roll 
it back to just after the newfs_lfs ran.

> 
>  > o The rollback code is quite sketchy.  It doesn't really work so well, so 
>  > LFS has problems recovering from failures.  
> 
> Rolling *back* to the last snapshot is easy. It's the roll-forward
> code that's dodgy, isn't it?

In my experience the rollback code also has issues.  I've seen it get 
badly confused.

Eduardo

Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

2016-08-18 Thread Eduardo Horvath

On Thu, 18 Aug 2016, Jose Luis Rodriguez Garcia wrote:

> On Thu, Aug 18, 2016 at 7:24 PM, Eduardo Horvath  wrote:
> >
> > LFS writes the metadata at the same time, in the same place as the data.
> > No synchronous writes necessary.
> 
> As I understand LFS needs to do synchronous writes when there is
> metadata operations (directories)/fsync operations involved. Instead
> of writting a full segment (1 MB per default), it writes a "small
> segment". It kills performance in RAID 5/6, because the write isn't a
> full stripe: you have to read all the disks, for calculate the new
> parity 1 write on raid of x disks= x reads + 2 writes (data + parity).
> 
> The NVRAM memory solves this problem as buffer/ write cache.

If the rollback code worked properly then sync writes should not be 
necessary.  Looks like the SEGM_SYNC flag is only set when LFS is writing 
a checkpoint.  But I'm not sure there's any guarantee that earlier 
segments have hit the disk.

Anyway, I still think fixing LFS so synchronous writes are not needed is a 
better use of time than making it use a hardware workaround.

I suppose adding code to LFS where it posts the sync write to copy 
it out to NVRAM would be relatively easy.  But then you still need to hack 
the recovery code to look for data in the NVRAM *and* figure out how to 
use it to repair the filesystem.  (Which it should be able to repair just 
fine without the data in the NVRAM BTW.)

Better to fix the recovery code and just turn off sync writes entirely.  I 
suppose other people may have different opinions on the subject.

Eduardo

Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

2016-08-18 Thread Eduardo Horvath

On Thu, 18 Aug 2016, Jose Luis Rodriguez Garcia wrote:

> I would like to implement this in LFS:
> 
> Write full stripes to RAID 5/6 from LFS using a NVRAM card or similar:
> 
> For this, the segment would be written to a NVRAM card or similar.
> When the full segment is written to the NVRAM card, it would be
> written to the raid as a full strip, without penalizing because of the
> raid. I have have thought for easy implementation in increase/decrease
> the size of segment on LFS, for that it is multiple of the stripe of
> the RAID device.
> 
> Question: Is it a small/medium/big project?

Ummm... It's probably a big project.  And I'm not sure it buys you much if 
anything.

A regular unix filesystem will use synchronous metadata writes to keep the 
FS image consistent if the box crashes or loses power.  NVRAM will speed 
up those operations.

LFS writes the metadata at the same time, in the same place as the data.  
No synchronous writes necessary.  In theory, if there is a failure you 
just roll back the log to an earlier synchronization point.  You lose the 
data after that point, but that should be fairly small, and you would have 
lost it anyway with a regular FS.  And you should be able to roll back the 
filesystem to snapshots of any earlier synchronization points.

The problem is that LFS is less a product than a research project:

o Although there are multiple super blocks scattered across the disk just 
like FFS, LFS only uses the first and last one.  If both of those are 
corrupt, the filesystem image cannot be recovered.  LFS should be enhanced 
to cycle through all the different super blocks for enhanced robustness.

o The rollback code is quite sketchy.  It doesn't really work so well, so 
LFS has problems recovering from failures.  

o LFS keeps all of its inodes in a file called the ifile.  It's a regular 
LFS file, so in theory you can scan back to recover earlier revisions of 
that file.  Also, fsck_lfs should be able to reconstruct the ifile from 
scrach by scanning the disk.  This is yet another feature that has not 
been implemented yet.

LFS writes data in what's called a subsegment.  This is essentially an 
atomic operation which contains data and metadata.  The subsegments are 
collected into segments, which contain more metadata, such as a current 
snapshot of the ifile.  All the disk sectors in a subsegment are 
checksummed, so partial writes can be detected.  If the checksums on the 
subsegment is incorrect, LFS should roll back to a previous subsegment 
that does have a correct checksum.  I don't think that code exists, or if 
it does I don't think it works.

Anyway, hacking on LFS is lots of fun.  Enjoy!

Eduardo

Re: UVM and the NULL page

2016-08-01 Thread Eduardo Horvath

On Mon, 1 Aug 2016, Joerg Sonnenberger wrote:

> On Mon, Aug 01, 2016 at 04:46:34PM +0000, Eduardo Horvath wrote:
> > 
> > I don't understand.  If you can't enter the mapping into the TLB, who 
> > cares what UVM does?
> 
> If UVM tries [0, n] and could have picked [1, n+1] as page numbers as
> well, will it try the latter after the pmap insert failed for the
> former?

Can't think of any codepath where that would happen.  

Edaurdo

Re: UVM and the NULL page

2016-08-01 Thread Eduardo Horvath

On Mon, 1 Aug 2016, Joerg Sonnenberger wrote:

> I disagree. While it is nice to assert this property in the pmap, it is
> the wrong place. First of all, all pmaps need to be audited, at least on
> platforms with shared address space. It's not specific to x86. Second,
> part of the problem is that UVM does not handle its own constraints
> correctly. That means it is possible that some of the (then failing)
> requests could be fulfilled by correct code. In short, handling it in
> the pmap doesn't actually solve the problem completely either.

I don't understand.  If you can't enter the mapping into the TLB, who 
cares what UVM does?

Eduardo

Re: UVM and the NULL page

2016-08-01 Thread Eduardo Horvath

On Sat, 30 Jul 2016, Thor Lancelot Simon wrote:

> 1) amd64 partially shares VA space between the kernel and userland.  It
>is not unique in this but most architectures do not.

FWIW all the pmaps I worked on have split user/kernel address spaces and 
do not share this vulnerability.

Eduardo

Re: UVM and the NULL page

2016-08-01 Thread Eduardo Horvath

On Sat, 30 Jul 2016, Joerg Sonnenberger wrote:

> For what it is worth, I do believe that the handling of the 0 page
> should be part of UVM and not pmap. I am only objection to forcing it
> unconditionally.

I disagree.  Based on the number of files you need to touch it's much 
better to implement it in pmap.  All you need to do is check the VA in 
pmap_enter and if it's 0 return an error.  Done.  It only affects the 
particular architectures that have this problem.  You don't further 
complicate the UVM code, you don't need to instrument all the different 
syscall entries, and you don't have to worry about possibly missing some 
obscure path to get page zero mapped in to some process.

Edaurdo

Re: UVM and the NULL page

2016-07-29 Thread Eduardo Horvath

On Fri, 29 Jul 2016, Maxime Villard wrote:

> Anyway, the only thing I'm suggesting is doing it in amd64, so this is a
> little off-topic.

Yes, and if it's for amd64 it should be done in the pmap layer, not 
polluting the UVM layer all other architectures also use.

Eduardo

Re: UVM and the NULL page

2016-07-28 Thread Eduardo Horvath

On Thu, 28 Jul 2016, Maxime Villard wrote:

> Currently, there is no real way to make sure a userland process won't be
> able to allocate the NULL page. There is this attempt [1], but it has two
> major issues.

I don't think this is a good idea.  You should leave this to the pmap 
layer rather than polluting UVM.  There are some architectures that need 
to have page zero mapped in for various reasons.

Eduardo

Re: Scripting DDB in Forth?

2016-05-02 Thread Eduardo Horvath

On Mon, 2 May 2016, Valery Ushakov wrote:

> On Mon, May 02, 2016 at 00:59:06 -0400, Michael wrote:
> 
> > On Mon, 2 May 2016 04:59:32 +0300
> > Valery Ushakov  wrote:

> > > I thought it might be interesting to put it into the kernel so that
> > > it can be hooked into DDB.
> > 
> > I'm afraid my first thought was OF_interpret() on machines that have
> > OF-like device trees but no OF.
> 
> I'm not sure I understand.  OF_interpret() by itself is not that
> useful unless you have the real OF device tree you can talk to.  But
> if you don't care about the abstraction / OOP layer of OF and just
> need ability to execute some code dynamically, then I don't see why
> not.

You can use OF_interpret() to access the underlying Forth engine for for 
firmware that has it.  However nowadays IEEE-1275 device trees are use by 
firmware written in C without Forth engines.  Of course the problem with 
that is once the OS boots the ability of the firmware to allocate memory 
is severely limited.

> Or we can port real OF perhaps?  OpenBIOS is GPLv2, but since it will
> not be part of the kernel, that's not an issue.  Also, Sun did release
> OpenBOOT under BSD'ish license, if you want to be a purist about the
> licensing and don't mind doing extra work.

OpenBOOT is very SPARC-centric.  There's lots of SPARC assembly code 
throughout.

Firmworks put together Open Firmware, which is an IEEE-1275 implemenation 
released under a BSD license, but last time I looked it seemed mostly 
focused on ARM CPUs.

As far as that goes, about 10 years ago I threw together an IEEE-1275 
implementation based on the pForth engine.  Since the kernel is written in 
C, it's more easily portable.  I probably have the code lying around 
somewhere.

Eduardo

Re: dbcool, envsys, powerd shutting down my machine

2016-02-04 Thread Eduardo Horvath

I really don't know why I'm keeping this discussion going.  I doubt anyone 
else cares, but:

On Wed, 3 Feb 2016, Constantine A. Murenin wrote:

> On 2016-02-03 10:06, Eduardo Horvath wrote:
> > On Tue, 2 Feb 2016, Constantine A. Murenin wrote:
> > 
> > > Wouldn't the correct solution then be to kill the process-intensive jobs,
> > > instead of shutting down the whole system?
> > 
> > That doesn't really make too much sense.
> > 
> > In theory, if the CPU has a low power mode and the machine detects
> > thermal issues, you could lower the temperature by periodically switching
> > into the low power mode.  Some CPUs operate this way.
> > 
> > However, at the moment there's no way that a low power mode will generate
> > less heat than a no power mode.
> 
> I think the issue at stake are the false positives, during which powerd will
> rudely shutdown any box (w/ etc/powerd/scripts/sensor_temperature).
> 
> Also, I disagree that no power will always result in lower temperature,
> especially in the non-laptop environment:
> 
> 0. If your CPU is overheating, and you shutdown the whole box, it's quite
> likely that the other components within the enclosure will end up receiving
> extra heat due to the enclosure fans now being powered down.

Although you can always come up with a scenario based on some lousy 
design where shutting off the power is a bad idea, those cases are 
anomolous.  

If the box is overheating, either one of the fans has failed, the intake 
or exhaust ports are blocked, or the outside temperature is high enough 
that the fans are blowing heat into the box.  So no, pumping more energy 
into the box is counterproductive.

> 1. Also, if we talk about a Data Centre environment, wouldn't "graceful"
> shutdown from within the operating system simply initiate a restart sequence,
> potentially in an endless loop?

A shutdown should result in a shutdown.  A reboot should result in a 
reboot.  If a machine detects overtemp it shouldn't do either of those, it 
should power itself off.  Back in the olden days, computers couldn't turn 
off the power, so the best you could do is shut off the OS, put the CPU 
into a tight NOOP loop, and hope someone noticed and turned off the power 
before something fried.  Nowadays most machines can turn them selves off.

In a data center environment, there is usually a separate microcontroller
controlling the machine and doing environmental monitoring.  When it 
detects high temperatures it may try to tell the OS to cool it, but for a 
true overtemp situation it will turn off the power rails.  

> 
> I think system shutdown should be on an opt-in, not an opt-out, basis.
> 
> C.
> 
>

Re: dbcool, envsys, powerd shutting down my machine

2016-02-03 Thread Eduardo Horvath

On Tue, 2 Feb 2016, Constantine A. Murenin wrote:

> Wouldn't the correct solution then be to kill the process-intensive jobs,
> instead of shutting down the whole system?

That doesn't really make too much sense.

In theory, if the CPU has a low power mode and the machine detects 
thermal issues, you could lower the temperature by periodically switching 
into the low power mode.  Some CPUs operate this way.  

However, at the moment there's no way that a low power mode will generate 
less heat than a no power mode.

Eduardo

Re: Understanding SPL(9)

2015-08-31 Thread Eduardo Horvath

On Mon, 31 Aug 2015, Stephan wrote:

> I´m trying to understand interrupt priority levels using the example
> of x86. From what I´ve seen so far I´d say that all spl*() functions
> end up in either splraise() or spllower() from
> sys/arch/i386/i386/spl.S. What these functions actually do is not
> clear to me. For example, splraise() starts with this:
> 
> ENTRY(splraise)
> movl4(%esp),%edx
> movlCPUVAR(ILEVEL),%eax
> cmpl%edx,%eax
> ja  1f
> movl%edx,CPUVAR(ILEVEL)
> ...
> 
> I´m unable to find out what CPUVAR(ILEVEL) means. I would guess that
> something needs to happen to the APIC´s task priority register.
> However I can´t see any coherence just now.

Don't look at x86, it doesn't have real interrupt levels.  Look at SPARC 
or 68K which do.

Most machines nowadays only have one interrupt line and an external 
interrupt controller.  True interrupt levels are simulated by assigning 
levels to individual interrupt sources and masking the appropriate ones in 
the interrupt controller.  This makes the code rather complicated, 
especially since interrupts can nest.

If you want to see cleaner implementations look at machines with hardware 
interrupt levels.

Eduardo

Re: VOP_PUTPAGE ignores mount_nfs -o soft,intr

2015-06-23 Thread Eduardo Horvath

On Tue, 23 Jun 2015, Emmanuel Dreyfus wrote:

> I note we have this in genfs_do_io(), and I suspect this is the same 2 value:
> 
> if (iowrite) {
> mutex_enter(vp->v_interlock);
> vp->v_numoutput += 2;
> mutex_exit(vp->v_interlock);  
> }   
> mbp = ge
> 
> Why the vp->v_numoutput += 2 ?

>From what I recall it's because this is a nested buf.  You need one for 
the biodone of the parent buf, and another for the child.

Look at the bottom of the routine you will see:

loopdone:
if (skipbytes) {
UVMHIST_LOG(ubchist, "skipbytes %d", skipbytes, 0,0,0);
}
nestiobuf_done(mbp, skipbytes, error);
if (async) {
UVMHIST_LOG(ubchist, "returning 0 (async)", 0,0,0,0);
return (0);
}
UVMHIST_LOG(ubchist, "waiting for mbp %p", mbp,0,0,0);
error = biowait(mbp);
s = splbio();
(*iodone)(mbp);
splx(s);
UVMHIST_LOG(ubchist, "returning, error %d", error,0,0,0);
return (error);
}

So nestiobuf_done() will result in a call to biodone(), and there is an 
additional (*iodone)() call at the end for sync operations which also 
calls biodone().  For async operations the second biodone() is called by 
the storage driver when the I/O operation completes.

Eduardo

Re: bottom half

2015-06-19 Thread Eduardo Horvath

On Fri, 19 Jun 2015, Johnny Billquist wrote:

> On 2015-06-19 11:45, Edgar Fu?? wrote:
> > > "Runs on kernel stack in kernel space" is not the same thing as the Linux
> > > concept of bottom half. :-)
> > I don't know what the Linux (or VMS or Windows) concept of "nottom half" is.
> > I thought I knew what the BSD concept of kernel halves is.
> 
> I can't comment on Windows - I have no idea.
> VMS uses a very different solution, so it don't make much sense to talk about
> that here (or at least the parts that I know, which might be outdated).
> 
> If I remember Linux right, they have a fixed list of registered bottom half
> handlers (which of course ran out of space a long time ago), and then through
> some tricks extended it to be more general, but in essence the bottom half is
> the part of the device driver that runs after an interrupt to complete an I/O
> request. The top half being also running in the kernel, but in the context of
> a process that does the I/O, and the top half blocks until the I/O completes.
> And the bottom half is the part the unblocks the top half again. And each
> driver has its own bottom and top halves. One major point of the bottom halves
> is that when running the bottom half, interrupts are not blocked. A device
> driver normally do only a minimal amount of work in the interrupt handler
> itself, and then defer the rest of the work to the bottom half code, which
> will run at some later time.
> 
> Of course, I could be remembering this all wrong, and it might be outdated as
> well. So take what I write with a grain of salt. Or rather, read up on it in a
> Linux book instead.
> 
> I would say the bottom half concept in Linux is close to the softint stuff in
> NetBSD. But I might be wrong on that one, as I don't remember all the details
> of that either right now.

From what I remember the BSD book talks about the "top half" and "bottom 
half" of the driver, not just interrupt dispatch the way linux does.

On linux, the top half of the interrupt handler runs in interrupt context, 
either preempting a kernel thread on the kernel stack or on a separate 
interrupt stack depending on the particular architecture.

The bottom half of an interrupt handler is basically a softint that runs 
on a kernel stack.

In general, you register a top half handler to acknowledge the interrupt.  
If the interrupt has any notable amount of processing to do or needs to 
fiddle with locks, the top half schedules a bottom half interrupt to do 
that.  

In addition to that you can have code that runs in a kernel thread, and 
you can have code that runs in the kernel in process context as the result 
of a system call.

It's been a while since I fiddled with interrupts on NetBSD, but ISTR we 
now schedule an interrupt thread to do all of the processing so there is 
no equivalent of the linux interrupt top half and interrupt bottom half.  
Or is that Solaris?  I forget.

Eduardo

Re: mutex(9) on sparc64 RMO [was Re: pserialize(9) vs. TAILQ]

2014-11-25 Thread Eduardo Horvath

On Mon, 24 Nov 2014, Taylor R Campbell wrote:

>Date: Mon, 24 Nov 2014 16:44:41 + (UTC)
>From: Eduardo Horvath 
> 
>I enhanced membar_ops with proper memory barriers and then was looking at 
>the mutex code.  Unfortunately, I didn't get very far.  It seemed at the 
>time that the mutex code has two hooks for memory barriers after the 
>atomic operations, however it's missing memory barrier hooks to ensure 
>consistency before accessing the lock.
> 
> What exactly is the consistency you need before accessing the lock?

I know you need a `membar #Lookaside' before accessing the atomic 
variable.  I don't recall if other memory barriers are needed since it's 
been a while since I last looked at the V9 architecture reference.

Eduardo

Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Eduardo Horvath

On Sun, 23 Nov 2014, Dennis Ferguson wrote:

> On 23 Nov, 2014, at 01:01 , Martin Husemann  wrote:
> 
> > On Sat, Nov 22, 2014 at 01:24:42PM +0800, Dennis Ferguson wrote:
> >> I'll guess one problem is in sparc/mutex.h, here:
> >> 
> >>#define MUTEX_RECEIVE(mtx)  /* nothing */
> >>#define MUTEX_GIVE(mtx) /* nothing */
> >> 
> >> This works with TSO, but with RMO they need to somehow generate
> >> hardware memory barriers.  See arm/mutex.h for an example with
> >> them filled in.
> > 
> > Or src/sys/arch/sparc64/include/mutex.h - IIRC sparc v8 and earlier do not
> > have RMO.
> 
> Ah, got it.  I'd now guess that in
> 
>src/common/lib/libc/arch/sparc64/atomic/membar_ops.S
> 
> this comment
> 
> /* These assume Total Store Order (TSO) */
> 
> suggests the problem.

Yes, the existing code assumes TSO.  A while back I was looking in to fixing 
that.  

I enhanced membar_ops with proper memory barriers and then was looking at 
the mutex code.  Unfortunately, I didn't get very far.  It seemed at the 
time that the mutex code has two hooks for memory barriers after the 
atomic operations, however it's missing memory barrier hooks to ensure 
consistency before accessing the lock.

Eduardo

Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Eduardo Horvath

On Fri, 21 Nov 2014, Dennis Ferguson wrote:

> On 21 Nov, 2014, at 00:22 , Eduardo Horvath  wrote:
> > Or you could try to get the kernel to run on a SPARC V9 machine running 
> > with RMO memory ordering.  There's a lot more of those around.  I'm not 
> > convinced the existing APIs are sufficient to get that working.
> 
> It would be worrying if the kernel wouldn't run that way.  The table from
> the McKenney paper, reproduced here
> 
> http://en.wikipedia.org/wiki/Memory_ordering
> 
> indicates that SPARC RMO potentially reorders the same operations as
> ARMv7 and POWER.  Someone else would have to comment on how well NetBSD
> runs on multiprocessor versions of the latter processors, but it is supposed
> to and I haven't noticed any code which omits required barriers for machines
> like that (unlike the Alpha, where missing barriers for its unique quirk are
> easy to find).  The API seems sufficient.

The last time I tried it the kernel would only run in TSO mode on SPARC 
machines.  Mutexes just didn't work.

Eduardo

Re: pserialize(9) vs. TAILQ

2014-11-20 Thread Eduardo Horvath

On Thu, 20 Nov 2014, Thor Lancelot Simon wrote:

> On Thu, Nov 20, 2014 at 01:05:05PM +0800, Dennis Ferguson wrote:
> > 
> > find hardware which behaves like this to test on).  I haven't
> > heard anything one way or the other concerning support for MP
> > Alphas, but the implicit message from the current state of
> > the code is that an MP Alpha isn't a NetBSD target.
> 
> We do run on a pretty good variety of multiprocessor Alpha systems.
> Whether it's worth the effort to ensure we continue to do so... well,
> it might be worth at least a policy of "do no harm" (in other words,
> declaring the necessary barrier operation and using it where we notice
> it is needed).

Or you could try to get the kernel to run on a SPARC V9 machine running 
with RMO memory ordering.  There's a lot more of those around.  I'm not 
convinced the existing APIs are sufficient to get that working.

Eduardo

Re: fsck_lfs

2014-07-14 Thread Eduardo Horvath


You probably want to do some testing of the roll-forward code.  Back when 
I was whacking on LFS I noticed that running fsck on a dirty filesystem 
appeared to cause more problems than it fixed.  And running multiple 
passes caused even more damage.

Eduardo

On Sat, 12 Jul 2014, Konrad Schroder wrote:

> I'm sure it does need a complete overhaul.  What I did with fsck_lfs was
> essentially to adapt the parts of fsck_ffs that check what is common to the
> two FSs (directories, inodes, block addresses, sizes, etc.) to LFS's way of
> locating inodes, as well as the additional constraint that blocks should not
> lie in free segments.
> 
> The reason for checking the filesystem before rolling forward is that,
> roll-forward being relatively untested, I wanted to check that the older
> checkpoint was consistent before checking the newer (remember that the newer
> checkpoint's consistency can't be taken for granted).  If the older checkpoint
> was not consistent, it should be fixed before rolling forward.  If it was
> consistent (either checked or, in the case of  fsck -p, assumed) but
> roll-forward between the two checkpoints failed, the older one was a valid
> state of the filesystem; if it succeeded, the newer checkpoint was a valid
> state of the filesystem.  Rolling forward past the newer checkpoint requires
> resizing inodes etc. and only makes sense in the context of a consistent file
> system.
> 
> So, I have no doubt that rewriting fsck_lfs from scratch and/or cleaning it up
> make perfect sense, but there is some reasoning in the madness too.  In
> particular, I don't agree that fsck_lfs should be limited to recreating the
> ifile; it needs to be able to recover as much as possible in the event of bad
> blocks as well.
> 
> The in-kernel roll forward is disabled because it is broken.  It worked
> briefly before LFSv2, but I never got back to fixing it after it broke.  I
> think Dr. Seltzer must be thinking of another OS, since 4.4lite2 did not
> contain any roll forward code at all. It's also worth asking whether the user
> should have control over when and whether roll-forward occurs, which is
> straightforward with a userland fsck but more difficult if it is in-kernel.
> 
> Take care,
> 
> -Konrad
> 
> On 07/12/2014 10:52 AM, David Holland wrote:
> > A long time ago (in )
> > you wrote:
> > 
> >   > I do disable fsck_lfs.  It usually causes more problems than it
> >   > solves.  It needs a complete overhaul.  It tries to act like
> >   > fsck_ffs instead of validating segment checksums and regenerating
> >   > the ifile.
> > 
> > A quick look at fsck_lfs with this in mind suggests that it's full of
> > bull, yes. For some reason it tries to check the fs *before* rolling
> > forward; it seems unlikely that this would ever work properly.
> > 
> > However, as far as I can readily tell the obvious problems are limited
> > to doing a full fsck, and all that the reboot time fsck -p does is
> > roll forward. Given that the kernel roll forward code is disabled by
> > default (does anyone know why? Margo Seltzer says it shouldn't be),
> > disabling boot-time fsck seems like a bad idea.
> > 
> > Unless the roll-forward code is broken, in which case it should be
> > fixed. I don't see any PRs on it though.
> > 
> > Anyhow, in the absence of any specific information, unless testing
> > turns up some issues, I'm inclined to revert the commit I just made
> > and re-enable fsck_lfs -p.
> > 
> 
>

Re: Making tmpfs reserved memory configurable

2014-06-05 Thread Eduardo Horvath

On Thu, 5 Jun 2014, Martin Husemann wrote:

> On Thu, Jun 05, 2014 at 08:50:07AM -0700, Matt Thomas wrote:
> > 
> > can you try using freetarg?
> 
> Did that and it worked as well.
> 
> Does freetarg ever change after boot?

Maybe.  It's set in uvmpd_tune(), which may be called by the page daemon 
in some circumstances.  Take a look at sys/uvm/uvm_pdaemon.c.

Eduardo

Re: Making tmpfs reserved memory configurable

2014-06-05 Thread Eduardo Horvath

On Thu, 5 Jun 2014, Martin Husemann wrote:

> On Fri, May 30, 2014 at 04:56:01PM +0200, Martin Husemann wrote:
> > I have been on a quest to make the stock vax install CD (-image) usable on
> > VAX machines with 8 MB recently. (8 MB is the lowest I could persuade simh
> > to emulate, for 4 MB we will need a custom kernel anyway and for smaller
> > even a custom /boot - I will cover installing on those machines in an
> > upcoming improvement of the install docs).
> 
> Ok, this (much simpler) patch makes tmpfs work on low memory machines.
> Comments?

Have you tested this?

The way the old scanner used to work, it started when the number of free 
pages hit freemin, and continued scanning until the number of free pages 
hit freetarg.

Looking at the code, it appears now the page scanner starts running when 
the number of free pages goes below uvmpd.freetarg.  I'm a bit concerned 
that if tmpfs allocates enough pages to put the system permanently below 
the uvmpd.freetarg threshold, the page scanner will never stop running.  
I'm not sure if this would be a problem or not, so it would be good to put 
a system into that condition to know for sure.

Eduardo

Re: Making tmpfs reserved memory configurable

2014-05-30 Thread Eduardo Horvath

On Fri, 30 May 2014, Martin Husemann wrote:

> See mount_tmpfs(8), in the paragraph about the -s option:
> 
>Note that four megabytes are always reserved for the system and cannot
>be assigned to the file system.
> 
> Now, with a 3.2 MB text GENERIC kernel and 8 MB RAM, we certainly don't have
> 4 MB available at all - so tmpfs is not usable.

This just doesn't sound right.  Why is tmpfs reserving a fixed amount of 
RAM?  Shouldn't it be using uvmexp.freemin?  That's basically what we're 
reserving for emergencies.

RAM scaling is always a pain in the posterior.  The choices made for a 
system with 16MB RAM don't make sense for a system with 16GB RAM, and visa 
versa.

Eduardo

Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM

2014-02-11 Thread Eduardo Horvath

On Tue, 11 Feb 2014, David Laight wrote:

> On Tue, Feb 11, 2014 at 04:19:26PM +0000, Eduardo Horvath wrote:
> > 
> > We really should enhance the bus_dma framework to add bus_space-like 
> > accessor routines so we can implement something like this.  Using bswap is 
> > a lousy way to implement byte swapping.  Yes, on x86 you have byte swap 
> > instructions that allow you to work on register contents.  But most RISC 
> > CPUs do the byte swapping in the load/store path.  That really doesn't 
> > map well to the bswap API.  Instead of one load or store operation to 
> > swap a 64-bit value, you need a load/store plus another dozen shift and 
> > mask operations.  
> > 
> > I proposed such an extension years ago.  Someone might want to resurrect 
> > it.
> 
> What you don't want to have is an API that swaps data in memory
> (unless that is really what you want to do).
> 
> IIRC modern gcc detects uses of its internal byteswap function
> that are related to memory read/write and uses the appropriate
> byte-swapping memory access.
> 
> I can see the advantage of being able to do byteswap in the load/store
> path, but sometimes that can't be arranged and a byteswap instruction
> is very useful.

When do you ever really want to byte swap the contents of one register to 
another register?  Byte swapping almost always involves I/O, which 
means reading or writing memory or a device register.  In this case we 
are specifically talking about DMA, in which case there is always a load 
or store operation involved.

The current API we have using the bswap routines is a real pain in the 
neck for DMA.  You really want the byte swaps to happen when needed.  They 
should be controlled by the DMA attributes of the device you're talking to 
along with the characteristics of the CPU and page in question.  A 
big-endian CPU talking to a device that runs only little-endian needs to 
do byte swapping when accessing DMA structures.  But what if the device 
can also support big-endian DMA?  So each driver needs to determine 
whether it needs to do byte swapping during setup time and have code to 
conditionally byte swap data if needed for each access to a structure that 
needs DMA.

> I really can't imagine implementing it being a big problem!

Yes, it a big problem.  For a 2 byte swap you need to do 2 shift 
operations, one mask operation (if you're lucky) and one or operation.  
Double that for a 4 byte swap.  And even if you argue that a dozen CPU 
cycles here or there don't make much difference, the byte swap code is 
replicated all over the place since the routines are macros, so you're 
paying for it with your I$ bandwidth.

Eduardo

Re: 4byte aligned com(4) and PCI_MAPREG_TYPE_MEM

2014-02-11 Thread Eduardo Horvath

On Tue, 11 Feb 2014, Izumi Tsutsui wrote:

> mrg@ wrote:
> 
> (now completely off topic)
> 
> > > > FYI:  not all -- sparc64 either maps PCI space as little
> > > > endian or uses little endian accesses, both of which give
> > > > you the byte swapped data directly.
> > > 
> > > Even in that case, the hardware checks access width and
> > > all byte accesses can be done at the same address
> > > as the little endian machines by complicated hardware
> > > (i.e. #if BYTE_ORDER in the patch won't work), right?
> > 
> > right.  i just wanted to point out that sometimes the byte
> > swapping occurs in hardware.
> 
> I also wonder if there is any OS implementation that enables
> that feature for DMA and how the hardware designers considered
> about possible use cases.
> 
> Most bus masters use word access even on fetching byte data, and
> there are also many implementation (memcpy(9) etc) which use word
> access even against stream data (which shouldn't be byte-swapped)
> so access width detection for byteswap won't work as expected.

I think Solaris does, but it's been a while since I looked at the guts of 
DDI.

We really should enhance the bus_dma framework to add bus_space-like 
accessor routines so we can implement something like this.  Using bswap is 
a lousy way to implement byte swapping.  Yes, on x86 you have byte swap 
instructions that allow you to work on register contents.  But most RISC 
CPUs do the byte swapping in the load/store path.  That really doesn't 
map well to the bswap API.  Instead of one load or store operation to 
swap a 64-bit value, you need a load/store plus another dozen shift and 
mask operations.  

I proposed such an extension years ago.  Someone might want to resurrect 
it.

Eduardo

Re: [Milkymist port] virtual memory management

2014-02-10 Thread Eduardo Horvath

On Sun, 9 Feb 2014, Yann Sionneau wrote:

> Thank you for your answer Matt,
> 
> Le 09/02/14 19:49, Matt Thomas a écrit :
> > On Feb 9, 2014, at 10:07 AM, Yann Sionneau  wrote:

> > > Since the kernel runs with MMU on, using virtual addresses, it cannot
> > > dereference physical pointers then it cannot add/modify/remove PTEs,
> > > right?
> > Wrong.  See above.
> You mean that the TLB contains entries which map a physical address to itself?
> like 0xabcd. is mapped to 0xabcd.? Or you mean all RAM is always
> mapped but to the (0xa000.000+physical_pframe) kind of virtual address you
> mention later in your reply?

What I did for BookE is reserve the low half of the kernel address space 
for VA=PA mappings.  The kernel resides in the high half of the address 
space.  I did this because the existing PPC port did strange things with 
BAT registers to access physical memory and copyin/copyout operations and 
I couldn't come up with a better way to do something compatible with the 
BookE MMU.  It did limit the machine to 2GB RAM, which wasn't a problem 
for the 405GP.

Also, the user address space is not shared with the kernel address space 
as on most machines.  Instead, user processes get access to their own 4GB 
address space, and the kernel has 2GB to play with when you deduct the 2GB 
VA==PA region.  (It's the same sort of thing I did for sparc64 way back 
when it was running 32-bit userland.  But it doesn't need VA==PA mappings 
and can access physical and userland addresses while the kernel address 
space is active.  Much nicer design.)

When a BookE machine takes an MMU miss fault, the fault handler examines 
the faulting address if the high bit is zero, it synthesizes a TLB entry 
where the physical address is the same as the virtual address.  If the 
high bit is set, it walks the page tables to find the TLB entry.

This did make the copyin/copyout operations a bit complicated since it 
requires flipping the MMU between two contexts while doing the copy 
operation.

> > > Also, is it possible to make sure that everything (in kernel space) is
> > > mapped so that virtual_addr = physical_addr - RAM_START_ADDR +
> > > virtual_offset
> > > In my case RAM_START_ADDR is 0x4000 and I am trying to use
> > > virtual_offset of 0xc000 (everything in my kernel ELF binary is mapped
> > > at virtual address starting at 0xc000)
> > > If I can ensure that this formula is always correct I can then use a very
> > > simple macro to translate "statically" a physical address to a virtual
> > > address.
> > Not knowing how much ram you have, I can only speak in generalities.
> I have 128 MB of RAM.
> > But in general you reserve a part of the address space for direct mapped
> > memory and then place the kernel about that.
> > 
> > For instance, you might have 512MB of RAM which you map at 0xa000.
> > and then have the kernel's mapped va space start at 0xc000..
> So if I understand correctly, the first page of physical ram (0x4000.) is
> mapped at virtual address 0xa000. *and* at 0xc000. ?
> Isn't it a problem that a physical address is mapped twice in the same process
> (here the kernel)?
> My caches are VIPT, couldn't it generate cache aliases issues?

If the MMU is always on while the kernel is running, and covers all of the 
KVA, then you could relocate the kernel text and data segments wherever 
you want them to be.  If you want to put the kernel text and data segments 
in the direct-mapped range, you can easily do that.  If you want it 
elsewhere, that should work too.  

The cache aliasing issues in VIPT caches only occur if the cache way size 
is larger than the page size.  If you're designing your own hardware, 
don't do that.  Otherwise, remember to only access a page through a single 
mapping and you won't have aliasing issues.  And flush the page from the 
cache wenever establishing a new mapping.

Eduardo

Re: SAS tape drives

2013-12-11 Thread Eduardo Horvath

On Wed, 11 Dec 2013, Mark Davies wrote:

> On Wed, 11 Dec 2013, Eduardo Horvath wrote:
> > Last time I fiddled around with the LSI MegaRAID stack it did not
> > provide any sort of transparent access to attached devices.  Can
> > you create a LUN with the tape device?
> > 
> > You might have more success with the LSI MPT stack.  That at least
> > provides transparent access to the target devices.  I don't know
> > whether mpt hooks into NetBSD's scsipi layer in a way that it will
> > attach non-disk devices, but I suspect it would.
> 
> I used the MegaRAID card as that was what we had lying around but I 
> could buy something else.
> 
> I found this: 
> https://www.ascent.co.nz/productspecification.aspx?ItemID=413461
> but that seems to use the mps driver on FreeBSD and NetBSD doesn't 
> have it.  Any guess on how hard it would be to port?

Keep in mind that some of LSI's chips can run either MegaRAID or MPT, you 
can download whichever firmware image you want.  Since MegaRAID requires 
more resources than MPT, I think you could probably "downgrade" the card 
to mpt.

Eduardo

Re: SAS tape drives

2013-12-10 Thread Eduardo Horvath

On Wed, 11 Dec 2013, Mark Davies wrote:

> Are SAS tape drives supported in NetBSD?
> 
> I have an LSI MegaRAID SAS card with an HP LTO5 SAS drive attached.  
> The card's WebBIOS can see the tape attached and NetBSD can see the 
> LSI card but NetBSD show no evidence of seeing the tape drive (not 
> even as an unconfigured device).

Last time I fiddled around with the LSI MegaRAID stack it did not provide 
any sort of transparent access to attached devices.  Can you create a LUN 
with the tape device?  

You might have more success with the LSI MPT stack.  That at least 
provides transparent access to the target devices.  I don't know whether 
mpt hooks into NetBSD's scsipi layer in a way that it will attach non-disk 
devices, but I suspect it would.

Eduardo

Re: posix_fallocate

2013-11-19 Thread Eduardo Horvath

On Tue, 19 Nov 2013, Christoph Badura wrote:

> On Mon, Nov 18, 2013 at 12:31:41PM +1100, matthew green wrote:
> > i would buy this argument if mmap()ing a large sparse file
> > and filling it up randomly (but with relatively large chunks
> > at a time) did not lead to severely fragmented files that
> > can take 10x to read, vs one written with plain sequential
> > write() calls.
> 
> There's another option that should avoid that behaviour: make the file
> system place the sparse blocks approximately where it would place them
> where they written in sequential order.  One could do the same also
> after an lseek() that creates a hole.
> 
> Such a change should be relatively straightforward for file systems like
> UFS that limit the amount of data that is allocated in a cylinder group on
> sequential writes.

Or... LFS doesn't allocate the actual location of disk blocks until it 
starts the write operation.  ISTR FFS allocates disk locations when the 
data blocks are created.  Maybe FFS should do what LFS does.  It's easier 
to make the blocks contiguous if they're all allocated at the same time.

Eduardo

Re: mpt device shuffling

2013-10-21 Thread Eduardo Horvath

On Sat, 19 Oct 2013, Edgar Fu? wrote:

> Strictly speaking, this is not a NetBSD kernel issue.
> However, I hope that someone more familiar with mpt(4) has come accross that 
> MPT "feature" before:
> 
> One additional oddity I faced with Thursday's disc failure was that after 
> physically replacing the failed disc with a spare, the SAS controller decided 
> to assign a new pseudo target id to the new disc, skipping the old disc's 3
> and using 8 for the replacement (still at PHY 3). Since the kernel assigns 
> sd and, thus, dk numbers in the order of increasing SCSI target ids, this 
> meant 
> my RAIDframe components now became dk2, 6, 3, 4, 5 (not 2, 3, 4, 5, 6 as 
> before). Not a big deal, but somewhat confusing at midnight.
> I assume that's intentional behaviour to the benefit of some commercial 
> or GPL-licenced OS software. The renumbering was persistent accoss both a 
> hard 
> reset and a soft power cycle (I didn't try a hard power cycle).
> Is there a way of reverting to the old sequence or disabling that nonsense 
> in the first place?

Actually, it's for Windoze.

When I was at Sun we had lots of fun with this quirk of the mpt firmware.  

mpt pretends to be a SCSI HBA because most operating systems are not 
capable of handling SAS the way it's supposed to be, a fabric where you 
identify devices not by their location but by the GUID or disklabel.  In 
order to make it easier to write a driver, the mpt firmware does the job 
that the OS should be doing, keeping a table of device GUIDs and `device 
ID' mappings in PROM or NVRAM.  The `device ID' is what mpt uses to 
identify the disk to the driver.

There are several numbering schemes the firmware implements.  On of them 
assigns all the IDs based on the device's GUID.  Another one assigns the 
`device ID' based on the port number for directly attached devices.  That 
way if you swap out a broken device with a new one, the new one will have 
the same `device ID' as the one you removed.  It still falls back 
to the GUID for devices on the other side of a SAS switch.

There should be a way to alter the `device ID' assigned to a drive with 
both the mpt BIOS and the command line utility (who's name escapes me at 
the moment.)  There also should be a way to change the behavior for the 
assignment of `device ID's to directly attached devices by changing some 
values in the PROM, but this may be only possible with the command line 
utility.  And, of course, LSI kept changing how this worked across 
different versions of the MPT firmware so I can't give you details about 
how your particular setup works.

Eduardo

Re: [Milkymist-devel] [Milkymist port] virtual memory management

2013-06-05 Thread Eduardo Horvath

On Wed, 5 Jun 2013, Yann Sionneau wrote:

> But I will definitely think about adding ASID as a first improvement to the
> MMU when everything will be working with the current design :)

I don't see the point of making major architectural changes to the MMU 
incrementally.  These features affect choices made in architecture of the 
VM subsystem.  I don't think you want to re-design and re-implement that 
code multiple times.  Better to just implement the hardware and figure 
out how to use it once, then generate the code.

Eduardo

Re: [Milkymist-devel] [Milkymist port] virtual memory management

2013-06-03 Thread Eduardo Horvath

On Mon, 3 Jun 2013, Yann Sionneau wrote:

> Thank you all for your answers :)

No prob.
 
> Le 30/05/13 22:45, Eduardo Horvath a écrit :
> > On Wed, 29 May 2013, Yann Sionneau wrote:
> > 
> > > Hello NetBSD fellows,
> > > 
> > > As I mentioned in my previous e-mail, I may need from time to time a
> > > little
> > > bit of help since this is my first "full featured OS" port.
> > > 
> > > I am wondering how I can manage virtual memory (especially how to avoid
> > > tlb
> > > miss, or deal with them) in exception handlers.
> > There are essentially three ways to do this.  Which one you chose depends
> > on the hardware.
> > 
> > 1) Turn off the MMU on exception
> That sounds like the better thing to do from my point of view, I don't see any
> big downside apart from having to duplicate a few pointers in a few structures
> to have both virtual and physical addresses of some data structures like PCB,
> trapframes, page tables.
> 
> Is there any big downside I don't see there?

Donno.  Depends on your hardware.  I'm aware of some hardware designs 
where caches where you need to have the MMU enabled to use the caches.  If 
your hardware doesn't have any weird quirks like that then turning off the 
MMU may well be a good solution.

> If not, I would pick this solution to implement the virtual memory subsystem.
> > 2) Keep parts of the address space untranslated
> I could modify the MMU to do that, but I would prefer keeping the entire 4 GB
> address space for user space :)

Yeah.  Well.  Thing is, some architectures, like MIPS, implicitly assume 
this sort of thing, so thier ports implement things like 
copyin()/copyout() using that feature.

> > 3) Lock important pages into the TLB
> That's pretty easy to do for locking exception vectors in ITLB since vectors
> are contiguous in memory.
> Locating every data I need to access to from exception vectors in the same
> couple of pages is not so easy I guess.
> Moreover, locking a few TLB entries would mean that those virtual addresses (
> a few pages ) would not be mappable for user space use since I could not eject
> those entries from the TLB.
> I would prefer not locking any TLB entry to have it all available for mapping
> things for user space.

So your MMU doesn't support multiple address space IDs?  That sux.  That 
means you need to blow away the entire MMU each time you switch processes.

If you do have ASIDs, I like to reserve one for the kernel.  That way you 
don't need to share the address space with userland.  

Eduardo

Re: [Milkymist port] virtual memory management

2013-05-30 Thread Eduardo Horvath

On Wed, 29 May 2013, Yann Sionneau wrote:

> Hello NetBSD fellows,
> 
> As I mentioned in my previous e-mail, I may need from time to time a little
> bit of help since this is my first "full featured OS" port.
> 
> I am wondering how I can manage virtual memory (especially how to avoid tlb
> miss, or deal with them) in exception handlers.

There are essentially three ways to do this.  Which one you chose depends 
on the hardware.

1) Turn off the MMU on exception

2) Keep parts of the address space untranslated

3) Lock important pages into the TLB

Turning off the MMU is pretty straight-forward.  ISTR the PowerPC Book E 
processors do this.  Just look up the TLB entry in the table and return 
from exception.  You just have to make sure that the kernel manages page 
tables using physical addresses.

Keeping parts of the address space untranslated is what MIPS does.  2GB 
goes through the MMU for user space, 1GB is kernel virtual addresses, and 
512MB is untranslated VA->PA (+offset).  In this scheme usually the 
kernel text is untranslated but kernel data is virtual.  If the hardware 
doesn't support untranslated parts of the address space, you can fake it 
by checking the address in the TLB miss handler and generating an entry 
based on the fault address.  (I did something similar to that for the 
PowerPC Book E port 'cause parts of the MD code make assumptions about 
accessing untranslated data.)  

This design does have some disadvantages.  User address space is reduced 
by both the kernel address space and the direct mapped address space.  You 
can't distinguish pages mapped into the kernel from random memory so 
kernel core dumps involve dumping all of RAM, not just kernel space.  It's 
also difficult to properly protect kernel text and data structures if you 
can scribble on random physical addresses.

If your MMU supports huge pages, you can lock the trap table and important 
bits of the kernel text and data segment in the TLB.  When I did the 
sparc64 port I used one 4MB TLB entry for kernel text, and another 4MB 
entry for kernel data.  After a while the kernel text overflowed the 4MB 
page and we needed to sacrifice a couple more TLB entries.  

I like this scheme since it allows 32-bit processes to use all 4GB of 
address space, but I did use some interesting features of the SPARC 
instruction set that allow load/store operations on alternate address 
spaces and physical addresses without needing to fiddle with the MMU.

So you can implement your VM subsystem several different ways.  Just 
remember that on a TLB-only machine you need to make sure the MMU handlers 
can access the page tables, either bypassing the MMU or using some trick 
that does not result in recursive TLB faults.

Eduardo

Re: Using uvmhist without ddb?

2013-02-21 Thread Eduardo Horvath

On Thu, 21 Feb 2013, Brian Buhrow wrote:

>   Hello.  I'm working on an issue with NetBSD-5 that may involve a
> problem with error paths in uvm.  I'd like to  use the uvmhist facilities
> in NetBSD to see if I can help track the issue down.  However, the machine
> on which I'm doing this work doesn't support the convenient use of ddb(4).
> Is there a way to get  uvm to print its statistical history without having
> to drop the machine into ddb?

Look at UVMHIST_PRINT.

Eduardo

Re: pmap_enter(9) rework

2013-02-01 Thread Eduardo Horvath

On Sat, 2 Feb 2013, Toru Nishimura wrote:

> I feel boring that pmap_enter(9) can not avoid to
> have goto jumps for the logic simplity.  This indicates
> pmap_enter(9) is mistakenly designed and used for
> mulitple purposes in parallel.  Rework is seriously
> requested..

I've always felt the p->v mapping should be managed by higher level code 
and that pmap(9) should be allowed to forget any mappings that aren't 
wired.  Implementing all that in VM code would be more maintainable than 
having each pmap have to manage all that.  It makes porting a pmap layer 
onto a new MMU a pain.

Eduardo

Re: uvn_fp2 [was: Help with issue with mpt(4) driver]

2013-01-29 Thread Eduardo Horvath

On Mon, 28 Jan 2013, Brian Buhrow wrote:

> (gdb) print pg
> $1 = (struct vm_page *) 0xc40c4cd0
> (gdb) print *pg
> $2 = {rb_node = {rb_nodes = {0x0, 0x0}, rb_info = 3275287704}, pageq = {
> queue = {tqe_next = 0xc338ec98, tqe_prev = 0xc1425ad4}, list = {
>   le_next = 0xc338ec98, le_prev = 0xc1425ad4}}, listq = {queue = {
>   tqe_next = 0xc338ec98, tqe_prev = 0xc24efd8c}, list = {
>   le_next = 0xc338ec98, le_prev = 0xc24efd8c}}, uanon = 0x0, 
>   uobject = 0xd37c6684, offset = 18808832, flags = 140, loan_count = 0, 
>   wire_count = 0, pqflags = 512, phys_addr = 3140771840, mdpage = {mp_pp = {
>   pp_lock = {u = {mtxa_owner = 1537}}, pp_u = {u_pte = {pte_ptp = 0x0, 
>   pte_va = 3504025600}, u_head = {pvh_list = {lh_first = 0x0}}, 
> u_link = 0x0}, pp_flags = 1 '\001', pp_attrs = 7 '\a'}}}

If I did my math right, flags of 140 is 0x8c which is PG_TABLED, PG_CLEAN, 
and PG_RDONLY.

Since the PG_BUSY bit is not set the page is not locked.
And the lack of PG_WANTED means there should be no waiters on the page.

Hm.  Is it possible we have a condition where there are multiple waiters 
for a page, but when the waiters are woken up, one of them grabs 
the page but PG_BUSY and PG_WANTED bits are cleared and the other waiters 
are forgotten?

Anyway, you may want to enable UVMHIST in the kernel and look at the logs.  
They should tell you the sequence of operations on that page, assuming the 
logs don't roll over.  (You may need to do some kernel hacking 'cause last 
time I tried UVMHIST it had initialization issues which required 
reordering things in init_main.c.)  [What you're doing now is like trying 
to reconstruct an airplane collision just from the debris left on the 
ground.  You need to enable the flight recorder.]

Eduardo

Re: Help with issue with mpt(4) driver

2013-01-28 Thread Eduardo Horvath

On Sat, 26 Jan 2013, Brian Buhrow wrote:

> Hello.  I believe Patrick may be on to something.  Further
> investigation into my mpt(4) issues reveals that while there are still some
> steps I can take to make the mpt(4) driver more robust when it comes to
> recovering from LSI errors, I believe this particular problem is, strictly
> speaking, outside the mpt(4) driver.  The work I'm doing exaserbates the
> issue, but I can say, with a high degree of confidence, that when I see
> this filesystem lock up state, it's not because the mpt(4) driver lost some
> request.  Rather, I think, requests to the  driver from the filesystem
> layer got reordered and some may have passed some time threshold, from
> which the filesystem layer never recovered.  On other NetBSD systems, I
> often see processes get stuck in uvn_fp2 wait states for long periods of
> time, when, apparently, the machine is doing nothing.  Then, after some
> indeterminate amount of time, some thread somewhere wakes up, notices the
> problem, and things take off again.  I think the trick here is to figure
> out what, exactly, is being waited on, and, once that's done, we'll be able
> to figure out what's really going on.  I suspect that once we find this
> issue, we'll actually solve a number of performance issues which haven't
> been fatal, but which have been troubling a lot of folks in one way or
> another.
> right now, I have a machine in this file system lockup state, and it
> has a fully symboled debug kernel on it.  Can anyone provide  some examples
> of how I should go about looking at the various locks using gdb?
> Specifically, how do I get a look at the lock that's being blocked in the
> uvn_fp2 state?  I saw some examples from Chuck earlier in this thread, but
> I think that was using ddb.  Some help with gdb would be helpful if someone
> has some script snippets they care to share.

Locks won't help.  They probably aren't being held.  Just dump the state 
of the page structure and look at the flags.  Then go to the containing 
vnode and associated inode and dump the state.  And if we can find an 
associated buf structure as well, that should describe the pendin I/O 
operation.

Eduardo

Re: uvn_fp2 [was: Help with issue with mpt(4) driver]

2013-01-28 Thread Eduardo Horvath

On Sun, 27 Jan 2013, Patrick Welche wrote:

> 
> More details - so we now know that the page is BUSY. Sadly an
> attempt at a core dump to a separate disk failed and this is
> before I connect the serial port block to the motherboard...
> 
> Cheers,
> 
> Patrick
> 
> 
> PIDLID S CPU FLAGS  STRUCT LWP * NAME WAIT
> 133226 3   3 0   fe82 013f 42c0  vlc  uvn_fp2
> 
> vlc: proc  fe81 702c 4cf8  vmspace/map  fe81 c79f e5c8   flags 4000
> lwp 6   fe82 013f 42c0  pcb  fe81 21ff 5d80
>   stat 3 flags 0 cpu 3 pri 43
>   wmesg uvn_fp2  wchan  8000 0827 2390
> 
> PAGE 0x 8000 0827 2390:
>   flags=0x4f, pqflags=0x0, wire_count=0,
>   pa=0x4468 8000
>   uobject=0x fe81 8ef3 9270, uanon=0x0, offset=0x8ee  loan_count=0
>   [page ownership tracking disabled]
>   checking object list
>   page found on object list

Hm.  BUSY usually means I/O in progress, and FAKE indicates the page is 
still being initialized, probably reading data from disk.  

WANTED was probably set by the thread in uvn_fp2 state so it'll get woken up.  

TABLED means it's on an object's list.


> 
> OBJECT 0x fe81 8ef3 9270:
>   locked=0, pgops=0x  808f d9e0, npages=38064, refs=1
> 
> VNODE flags 0x30
> mp 0x fe81 0e37 a008  numoutput 0  size 0x1fe3d08f  writesize 0x1fe3d08f
> data 0x fe81 5ea4 4900  writecount 0  holdcnt 6
> tag VT_UFS(1) type VREG(1)
> mount0x fe81 0e37 a008
> typedata 0x fe81 8a0f 74c8
> v_lock   0x fe81 8ef3 9380

Let's see.  

VREG and VT_UFS indicate it's a normal file on a UFS filesystem.

numoutput of 0 means there are no pending writes on this vnode.

Isn't there a /f flag to dump more vnode information?

If we had the inode number, we could hunt it down from the mountpoint to 
figure out which file this is.


> 
> v_uobj.vmobjlock lock details:
> lock address : 0x fe80 5d68 6640type: sleep/adaptive
> initialized  : 0x  8046 52a5
> shared holds = shared wanted = current cpu = 0
> current lwp  : 0x fe82 1f55 3840  last held : 0   <--- LWP different...
> last locked  : 0x  8074 a6d2  unlocked*: 0x  8074 a708
> owner field  : 0 wait/spin : 0/0

Looks like the lock is free at the moment.  


> 
> Turnstile chain at 0x  80c4 7e40
> => No active turnstile for this lock.
> 
> 
> lwp_t  fe82 1f55 3840
> system: pid 0  proc   80c3 1000 vmspace/map   80c6 a980 flags 
> 20002
>   lwp 2 [idle/0]   fe82 1f55 3840  pcb  fe81 0e07 8d80
> stat 7 flags 201 cpu 0 pri 0
> 
> mount:
>   flag=MNT_LOG,MNT_LOCAL
>   iflag=IMNT_MPSAFE,IMNT_HAS_TRANS,IMNT_DTYPE
>   locked vnodes = 0x fe81 8ef3 9270
> 
> db{0}> show ncache fe818ef39270
> name not found
> 
> lock address : 0x fe81 8ef3 9380  type: sleep/adaptive
> initialized  : 0x  8077 c329
> shared holds : 1  exclusive : 0
> shared wanted: 0  exclusive : 0
> current cpu  : 0  last held : 3
> current lwp  : 0x fe82 1f55 3840  last held : 0x fe82 013f 42c0
> last locked  : 0x  802a b74c  unlocked* : 0x  802a b7bd
> owner/count  : 0x10  flags: 0
> 
> Turnstile chain at 0x  80c4 80c0
> => No active turnstile for this lock.
> 
> 
> db{0}> reboot 0x104
> usbd_do_request: not in process context
> usbd_do_request: not in process context
> usbd_do_request: not in process context
>

Re: Help with issue with mpt(4) driver

2013-01-21 Thread Eduardo Horvath

On Mon, 21 Jan 2013, Patrick Welche wrote:

> I have just been experiencing filesystem lock-up with a process in
> uvn_fp2, so it may be unrelated to you mpt fiddling... That systems
> disks are on ahcisata.
> 
> It can withstand builds of the world, but not GraphicsMagick:
> 
> struct proc *   fe81 4c8c 1d10
> uarea * fe81 3d6d 5d80
> vmspace/vm_map  fe81 0f23 2470
> 
> PID   LID S CPU FLAGS   STRUCT LWP *   NAME WAIT
> 917 4 3   1  1080   fe8147dac180 gm psem
> 917 3 3   280   fe8147dac5a0 gm psem
> 917 2 3   3  1080   fe8147dac9c0 gm psem
> 917 1 3   4  1000   fe813b57e160 gm uvn_fp2
> 
> wmesg psem wchan fe81106a1658
> 
> lwp 1 fe813b57e160 pcb fe813d6c9d80
>   stat 3 flags 1000 cpu 1 pri 43
>   wmesg uvn_fp2 wchan 803e33f8
> 
> ? VNODE flags 0x30
> v_lock  fe81 1427 0280
> 
> but I'm not sure what to look for...

Grab the wchan for the thread waiting in uvn_fp2 and dump it as a 
vm_page_t.  

Eduardo

Re: Help with issue with mpt(4) driver

2013-01-15 Thread Eduardo Horvath

On Mon, 14 Jan 2013, Brian Buhrow wrote:

>   Hello.  I'm working on some patches to make the LSI Fusion SCSI driver
> (mpt(4)) more robust.  I'm making good progress, but I've run into a n
> issue that has momentarily baffled me.  If I get a bunch of concurrent jobs
> running on a filesystem mounted on a raid set  using disks across two
> mpt(4) instances, they get into a state where they become deadlocked and
> all but one of the processes is stuck in tstile, and the other remaining
> process is in uvn_fp2.  All the processes are trying to read the same file
> in the filesystem, not write it, but read it.  I have a debug version of
> the kernel, and the machine is running, and other operations against the
> filesystem work fine and complete successfully. I'm assuming the problem is
> something I've introduced into the mpt(4) driver, though I'm not sure how
> at the moment, sinceI've not been able to reproduce it In an alternative
> environment.
>   When a process gets into uvn_fp2 state, it's waiting for something to
> find it pages.  Is there  a way to figure out what it's waiting for and
> which underlying kernel process the uvn_fp2 call is  expecting to wake it
> up?
> 
> Any help on this issue would be greatly appreciated.  I can give a lot more
> details if someone is interested.

If you take a look in uvm_findpage() you'll see that the wait address for 
uvn_fp2 should be the page structure itself.  You can dump the page 
structure and look at the flags and the lock structure to figure out what 
state it's in.

Given that you're fiddling around with mpt, the most likely reason for 
this sort of behavior is that a disk transaction has been lost.  The 
operation may have been lost because of some locking issue in the 
completion callback, but most likely the firmware lost track of the 
operation.  

If you're writing a SCSI driver properly, you should have a list of all 
outstanding operations, and each should have a timeout associated with it 
so the driver can determine it's been dropped somewhere and can be aborted 
and retried.  The NetBSD mpt driver does not appear to do that.  This 
tends to be a problem with LSI's drivers.  They like to assume that the 
firmware is faultless, something that is usually not the case.

I generally allocate an array for outstanding commands and use the array 
index for the identifier I give to the firmware.  Of course, this does 
put a hard limit on the number of outstanding commands at any one time.  
But if the array fills up it can be reallocated on the fly without losing 
outstanding command IDs.  

You also need to be careful with command timeouts on certain devices.  
While a one or two minute timeout should be plenty for a disk type device, 
some operations on SCSI tape drives can take hours to complete.

Eduardo

Re: "Hijacking" a PCU.

2012-12-17 Thread Eduardo Horvath

On Sat, 15 Dec 2012, Martin Husemann wrote:

> I did some statistics and did not find a single lwp exiting that had no
> fpu state allocated. Maybe some libc changes could change this (if we want),
> but on the other hand this clearly points out that the better solution for
> this instance of a per cpu unit would be to have it preallocated in every
> lwp.

That's probably due to memcpy()/memset().  Maybe the FPU in use bits 
should be cleared at the end of memcpy() and memset() so the kernel 
doesn't have to save the state.

Eduardo

Re: [RFC][PATCH] _UC_TLSBASE for all ports

2012-08-13 Thread Eduardo Horvath

On Sat, 11 Aug 2012, Matt Thomas wrote:

> On Aug 11, 2012, at 10:35 AM, Thor Lancelot Simon wrote:
> 
> > On Sat, Aug 11, 2012 at 06:45:12AM +, Christos Zoulas wrote:
> >> 
> >> It is a slippery slope, but I think in this case it is wise to bend.
> >> If we cannot reach agreement here, consult core.
> > 
> > I see no point bending NetBSD into knots in this case if the resulting
> > performance is as bad as Joerg claims it will be.  Is it actually the
> > case that our *context() functions are almost as heavy as a full
> > kernel-level thread switch?
> 
> I'm wondering if we need a new makecontext which can allocate a new
> private thread-local area.  We can set the stack via uc_stack but
> there isn't a way to allocate a new thread-local area.
> 

I think this whole thing needs a careful redesign.  IMHO the reason we 
never got scheduler activations stable across all architectures is that 
the semantics of the *context() routines were never properly specified.  

If I knew what those routines were supposed to do I might have been able 
to fix them.  But as it was implementation defined functionality

Eduardo

Re: SAS scsibus target numbering

2012-07-26 Thread Eduardo Horvath

On Thu, 26 Jul 2012, Mouse wrote:

> > it's usual for the SCSI HBA to assign a targetID for itself.
> 
> For real SCSI - ie, non-SAS - it's actually necessary; the protocols
> used for initiators and targets to speak with one another require a
> line for the initiator as well as for the target.  But the host is
> usually ID 7.

ISTR SPARC machines used an initiator address 6 for hysterical reasons.  I 
forget why.

> Perhaps this is a SAS difference?

SAS doesn't use target IDs on the wire.  The "target ID" is something LSI 
came up with to provide backwards compatibilty with parallel SCSI.   

Eduardo

Re: SAS scsibus target numbering

2012-07-26 Thread Eduardo Horvath

On Thu, 26 Jul 2012, Edgar Fu? wrote:

> > You can change them arbitrarily by messing with mpt
> > either from BIOS or their command line utilities.
> I tried the BIOS configuration (the one you get by typing Ctrl-C 
> at the right time, but I couldn't anything to assign target IDs.
> Do you remember where to find the relevant setting?

Sorry, I don't remember.  Most of the time I was using OpenBoot machines 
rather than BIOS.  Also this works differently on different versions of 
their firmware.

> What exactly do you refer tp by "their command line utilities"?

LSI distributes some command line utilities for Solaris and Windows and I 
think they also have a GUI based config utility for Windows.

Eduardo

Re: SAS scsibus target numbering

2012-07-26 Thread Eduardo Horvath

On Thu, 26 Jul 2012, Edgar Fu? wrote:

> I have a (mpt) SAS with seven discs connected.
> The discs attach as sd0..sd6, but the SCSI target numbers are 0..5 and 7.
> It appears to me that someone is skippig ID 6 for the controller.
> It doesn't hurt too much, but it took me a while to find out why detaching 
> targets 2, 3, 4 and 5 worked and 6 didn't (of course, 7 worked).
> 
> Is there a reason for this behaviour?

"target IDs" are assigned by the mpt firmware.  You can change them 
arbitrarily by messing with mpt either from BIOS or their command line 
utilities.  The idea is that they can emulate a SCSI bus and thus boot 
Windows.

ISTR that the mpt interface is the same across their SAS and parallel SCSI 
adapters.  With parallel SCSI, the HBA driver allocates itsel a SCSI ID, 
usually 6.  This is meaningless for SAS.  

Some versions of the mpt firmware also support RAID.  When a RAID volume 
is created by the firmware it is also allocated a target ID.   

The whole thing was a gigantic mess.  When I was at Sun we switched from 
using the "target ID" to identify the drives to using a combination of PHY 
numbers for directly attaced devices and GUIDs for RAID volumes and 
devices on the other end of SAS switches.

I can't say I know anything about the NetBSD mpt driver, but I figure if 
you don't like the way the "target IDs" are assigned, go into the firmware 
and swap them around until you're happy.

Eduardo

Re: Syscall kill(2) called for a zombie process should return 0

2012-07-18 Thread Eduardo Horvath

On Wed, 18 Jul 2012, Mouse wrote:

> Subject: Re: Syscall kill(2) called for a zombie process should return 0
> 
> > +   if (p != NULL && P_ZOMBIE(p)) {
> > +   mutex_exit(proc_lock);
> > +   return 0;
> > +   }
> > mutex_exit(proc_lock);
> > return ESRCH;
> 
> > This is a general question, not necessarily specific to the patch.
> > Which is more costly?  Two function calls as above, or storing the
> > return value in a variable to return with just one function call to
> > mutex_exit?
> 
> "It depends."  A good optimizer could turn either one into the other,
> so it may make no real difference.  If optimization is disabled or
> limited, the version quoted above will probably be marginally larger
> and, assuming "larger code" doesn't mean more cache line fills,
> marginally faster.  Which is `more costly' depends on what costs you
> care about and to what extent.

Not necessarily.  Don't forget about the extra instructions needed to 
store and reload the register contents from the variable in RAM (unless 
the optimizer is turned on and you have a machine with register windows).

OTOH, if you really care about this level of detail you should be writing 
hand optimized assembly.  If you use a compiler you should worry more 
about maintainability than micro-optimizations that can have all sorts of 
strange sideffects like moving a block of code over an instruction cache 
line.

Eduardo

Re: software interrupts & scheduling oddities

2012-07-05 Thread Eduardo Horvath

On Thu, 5 Jul 2012, David Young wrote:

> I'm using the SCHED_M2 scheduler, btw, on a uniprocessor.  SCHED_M2 is
> kind of an arbitrary choice.  I haven't tried SCHED_4BSD, yet, but I
> will.

I'd recommend you try the BSD scheduler.  It may give you better results, 
even though it has a little more overhead.

I don't think the M2 ever worked as well as was hoped.  Solaris eventually 
added an interactive scheduler to solve just these sorts of issues.

Eduardo

Re: Issues using KGDB on a Linux machine to debug NetBSD

2012-06-08 Thread Eduardo Horvath

On Fri, 8 Jun 2012, Israel Jacquez wrote:

> Hi Eduardo,
> 
> Would that command simply happen to be "kgdb"?

Sounds familliar.  It's been a while since I've done this.

Eduardo

Re: Issues using KGDB on a Linux machine to debug NetBSD

2012-06-08 Thread Eduardo Horvath

On Fri, 8 Jun 2012, Israel Jacquez wrote:

> Hello,
> 
> I'll make this short. I can't seem to get debugging support working
> even when following the guide:
> .
> 
> Target: NetBSD 5.1.2 on the i386 port
> Remote: Debian GNU/Linux
> Kernel on target: NetBSD-CURRENT
> 
> I have enabled the following options in the kernel config file:
> 
> options DDB
> options DDB_HISTORY_SIZE=512
> options KGDB
> options "KGDB_DEVNAME=\"com\"",KGDB_DEVADDR=0x3f8,KGDB_DEVRATE=115200
> makeoptions DEBUG="-g"
> 
> After compiling the kernel on the remote machine (Debian GNU/Linux), I
> copy the new kernel to / on the target machine and I see:
> 
> When I invoke dmesg(8): dmesg | grep -E '^com', I get the following:
> com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
> com0: console
> com0: kgdb
> 
> I then reboot and at the boot loader, I invoke:
> boot -d
> 
> Immediately, I get dropped into DDB. I can only see this through the
> serial console as the target machine is running headless. After, I

If you have both DDB and KGDB enabled you need to give DDB a specific 
command to drop into KGDB.  Only then can you connect the remote gdb.

Eduardo

Re: raw/block device disc troughput

2012-05-25 Thread Eduardo Horvath

On Fri, 25 May 2012, Edgar Fu? wrote:

> Thanks for the most insightful explanation!
> 
> > Also keep in mind:
> Yes, sure. That's why I would have expected the raw device to outperform even 
> at lower block sizes.

No, for small block sizes the overhead of the copyin() is more than offset 
by the larger buffercache block size.  And the I/O operation is 
asynchronous with respect to the write() system call.

With the character device the I/O operation must complete before the 
write() returns.  So the I/O operations cannot be combined and you suffer 
the overhead of each one.

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath

On Thu, 24 May 2012, Thor Lancelot Simon wrote:

> On Thu, May 24, 2012 at 05:31:43PM +0000, Eduardo Horvath wrote:
> > 
> > With large transfers (larger than MAXPHYS) the writes are split up into 
> > MAXPHYS chunks and the disk handles them in parallel, hence the 
> > performance increase even beyond MAXPHYS.
> 
> Is this actually true?  For requests from userspace via the raw device,
> does physio actually issue the smaller chunks in parallel?

Depends... this case it's true.  physio() breaks the iov into chunks and 
allocates a buf for each chunk and calls the strategy() routine on each 
buf without waiting for completion.  So on a controller that does tagged 
queuing they run in parallel.

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath

On Thu, 24 May 2012, Edgar Fu? wrote:

> > Keep in mind mpt uese a rather inefficient communication protocol and does
> > tagged queuing.
> You mean the protocol the main CPU uses to communicate with an MPT adapter is
> inefficient? Or do you mean SAS is inefficient?

The protocol used to communicate between the CPU and the adapter is 
inefficient.  Not well designed.  They redesigned it for SAS2.

> > The former means the overhead for each command is not so good, but the
> > latter means it can keep lots of commands in the air at the same time. 
> I'm sorry, I'm unable to conclude why this explains my results.

dd will send the kernel individual write operations.  sd and physio() will 
break them up into MAXPHYS chunks.  Each chunk will be queued at the 
HBA.  The HBA will dispatch them all as fast as it can.  Tagged queuing 
will overlap them.  

With smaller transfers, the setup overhead becomes significant and you see 
poor performance.

With large transfers (larger than MAXPHYS) the writes are split up into 
MAXPHYS chunks and the disk handles them in parallel, hence the 
performance increase even beyond MAXPHYS.

Also keep in mind:

When using the block device the data is copied from the process buffer 
into the buffer cache and the I/O happens from the buffer cache pages.

When using the raw device the I/O happens directly from process memory, no 
copying involved.

Eduardo

Re: raw/block device disc troughput

2012-05-24 Thread Eduardo Horvath

On Thu, 24 May 2012, Edgar Fu? wrote:

> It seems that I have to update my understanding of raw and block devices
> for discs.
> 
> Using a (non-recent) 6.0_BETA INSTALL kernel and an ST9146853SS 15k SAS disc
> behind an LSI SAS 1068E (i.e. mpt(4)), I did a
>   dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx.

What's "od="?

> For the raw device, the troughput dramatically increased with the block size:
>   Block size  16k 64k 256k1M
>   Troughput (MByte/s) 4   15  49  112
> For the block device, throughput was around 81MByte/s independent of block 
> size.
> 
> This surprised me in two ways:
> 1. I would have expected the raw device to outperform the block devices
>with not too small block sizes.
> 2. I would have expected inceasing the block size above MAXPHYS not
>improving the performance.
> 
> So obviously, my understanding is wrong.

Not awfully surprizing given your setup.  Keep in mind mpt uese a rather 
inefficient communication protocol and does tagged queuing.  The former 
means the overhead for each command is not so good, but the latter means 
it can keep lots of commands in the air at the same time. 

> I then build a RAID 1 with SectorsPerSU=128 (e.g. a 64k stripe size) on two
> of these discs, and, after the parity initialisation was complete, wrote
> to [r]raid0b.
> On the raw device, throghput ranged from 4MByte/s to 97MByte/s depending on 
> bs.
> On the block device, it was always 3MByte/s. Furthermore, dd's WCHAN was
> "vnode" for the whole run. Why is that so and why is throughput so low?

Now you're just complicating things 8^).

Let's see, RAID 1 is striping.  That means all operations are broken at 
64K boundaries so they can be sent to different disks.  And split 
operations need to wait for all the devices to complete before the master 
operation can be completed.  I expect you would probably get some rather 
unusual non-linear behavior in this sort of setup.  

Eduardo

Re: mlockall() and small memory systems

2012-05-24 Thread Eduardo Horvath

On Thu, 24 May 2012, Martin Husemann wrote:

> On Wed, May 23, 2012 at 07:15:41PM +0100, David Laight wrote:
> > What are the default ulimit values?
> 
> Good point. Page size is 4k, with 32MB the limits are
> 
> # ulimit -a
> time  (-t seconds) unlimited
> file  (-f blocks ) unlimited
> data  (-d kbytes ) 65536
> stack (-s kbytes ) 8192
> coredump  (-c blocks ) unlimited
> memory(-m kbytes ) 10664
> locked memory (-l kbytes ) 3554
> process   (-p processes  ) 160
> nofiles   (-n descriptors) 128
> vmemory   (-v kbytes ) unlimited
> sbsize(-b bytes  ) unlimited
> 
> With 64 MB ram they are:
> 
> time  (-t seconds) unlimited
> file  (-f blocks ) unlimited
> data  (-d kbytes ) 65536
> stack (-s kbytes ) 8192
> coredump  (-c blocks ) unlimited
> memory(-m kbytes ) 42776
> locked memory (-l kbytes ) 14258
> process   (-p processes  ) 160
> nofiles   (-n descriptors) 128
> vmemory   (-v kbytes ) unlimited
> sbsize(-b bytes  ) unlimited
> 
> "Locked memory" sounds like what we run into - but shouldn't mlockall() return
> some error code if we exceed the limit?

Not necessarily.  If the code does mlockall() with the MCL_FUTURE flag 
then any *new* pages are also supposed to be locked into memory.  So if 
the process address space at the time the mlockall() is done is smaller 
than the locked memory limit, the mlockall() should succeed.  

But later if the mmap() makes the total process address space exceed the 
mmap() call should return an error instead of succeeding.  I think there 
may be a bug in mmap().  You should check to see if the mmap() causes the 
process to cross the locked memory limit.

Eduardo

Re: RAIDframe performance vs. stripe size

2012-05-14 Thread Eduardo Horvath

On Sat, 12 May 2012, Edgar Fu? wrote:

> > In general it won't access just one filesystem block.
> > It will try to readahead 64KB
> Oh, so this declustering seems to make matters even more
> complicated^Winteresting.
> 
> Staying with my example of a 16K fsbsize FFS on a 4+1 disc Level 5
> RAIDframe with a stripe size of 4*16k=64k:
> 
> Suppose a process does something that could immediately be satisfied
> by reading one fs block (probably it matters whether that's a small file,
> a small portion of a large file, a small directory, a portion of a large
> directory, inodes, free list or whatever?). Now, if that, as I understand,
> always causes FFS to in fact issue a 64k request to RAIDframe, this would
> need to read a full stripe and so need all but one disc. So it can't be
> parallelised with another process' request, can it? Does this mean I'm better
> off with a stripe size of 4*64k if I'm after low latency for concurrent
> access?

The problem here is NFS, which requires writes to be persistent before 
returning status to the caller.

Under normal operation, ufs will attempt to use the buffer cache in the 
most efficient manner, doing readahead and delaying writes as much as 
possible to be able to do maximize the number of clustered operations as 
it can.  

Now if NFS does not do similar clustering on writes (I don't know NFS that 
well, especially V3 and V4 which allegedly have write optimizations) then 
you get the situation where the underlying ufs will try to cluster reads 
(satisfying reads out of the buffer cache is much faster than hitting the 
platters) but write out only single filesystem blocks (to satisfy the NFS 
consistency requirements.)  My understanding is that later versions of NFS 
(v3+) have a mechanism for the client side to request writes without the 
consistency guarantee and a separate explicit sync operation.  But using 
those is the responsibility of the NFS client machine.  Of course, if all 
the files are on the order of one filesystem block, clustering won't 
happen at all.

I think you should attempt to characterize your workload here to determine 
the size of the I/O operations the clients are requesting so you can 
decide if clustering is a benefit to you, and if not, turn it off.  (I 
think it can be tweaked with tunefs(8).)

Eduardo

Re: RAIDframe performance vs. stripe size

2012-05-11 Thread Eduardo Horvath

On Fri, 11 May 2012, Edgar Fu? wrote:

> EF> I have one process doing something largely resulting in meta-data
> EF> reads (i.e. traversing a very large directory tree). Will the kernel
> EF> only issue sequential reads or will it be able to parallelise, e.g.
> EF> reading indirect blocks?
> GO> I don't know the answer to this off the top of my head... 
> Oops, any file-system experts round here?

It depends.  

To do any disk operation you need the disk address.  Those are stored in 
the inode and indirect blocks.  If you have all the relevent inodes in 
memory, the filesystem could issue all the reads in parallel.  (Unlikely 
if it's just one process, since it will request one directory or even one 
directory entry at a time.)  

However, since ffs does clustering, it will probably try to read the 
entire directory (if it's contiguous 64KB or less) in one operation.  
That's true for all files.  In general it won't access just one filesystem 
block.  It will try to readahead 64KB unless you have disabled clustering 
on that filesystem.  (Under normal circumstances it will also do 64KB 
write behind, but you're running NFS which has that restriction of hitting 
non-volatile storage before returning status, so I don't think the 
underlying filesystem can do clustering in that circumstance.)

Eduardo

Re: introduce device_is_attached()

2012-04-17 Thread Eduardo Horvath

On Tue, 17 Apr 2012, Christoph Egger wrote:

> On 04/16/12 19:37, David Young wrote:

> > I'm not sure I fully understand the purpose of amdnb_miscbus.
> > Are all of the functions that do/will attach at amdnb_miscbus
> > configuration-space only functions, or are they something else?  Please
> > explain what amdnb_miscbus is for.
> 
> Drivers attaching to amdnb_miscbus are all pci drivers which use
> different capabilities of the northbridge device.
> Their match/attach routines need the same 'aux' parameter as it
> is passed to amdnb_misc(4).
> 
> amdtemp(4) uses some PCI registers of the northbridge device to read
> the cpu temperature.
> I have a local driver which uses a different feature of the same
> northbridge device.
> To access the device I need the same chipset tag and pci tag.
> 
> Instead of making amdtemp(4) a mess I chose to implement different
> features in different small drivers.
> This also simplifies handling of erratas: If a feature doesn't work
> then just don't let match/attach the corresponding driver.

Oh that problem.  It's been what, ten years now and the design of our PCI 
attachment framework is still a giant pain that needs to be worked around.

I first suffered from this when I started working on the PCI code for 
SPARCs.  They run OFW, which should be used to probe and enumerate the PCI 
buses.  All the important information about device enumeration is encoded 
as properties on the OFW device tree, and that is what should be used to 
identify and attach drivers instead of probing the PCI config space 
registers directly.  Important interrupt routing information is encoded in 
the device tree which is not available from the PCI registers, so it is 
necessary to correlate OFW device nodes with PCI driver instances to make 
the machine work properly.  

Unfortunately, our PCI framework wants to take over all aspects of device 
identification and probing and it's very difficult for the PCI bus driver 
to manipulate attachment of or provide extra information to child devices. 
I wanted to go and rewrite all that stuff so it worked properly, but 
instead we ended up using all sorts of nasty mechanisms to pass 
information around behind the PCI framework's back.

As far as I'm concerned, device_is_attached*() is a hack to work around 
the inadequacies of certain bus frameworks.

Someone should give the PCI code an overhaul to allow the parent device to 
identify the contents of individual PCI slots control attachment of 
drivers there, similar to the way the SBus framwork operates.  That way 
you could keep track of each driver that attaches.

Eduardo

Re: making kmem more efficient

2012-03-01 Thread Eduardo Horvath

On Thu, 1 Mar 2012, Lars Heidieker wrote:

> On 03/01/2012 06:04 PM, Eduardo Horvath wrote:
> > On Thu, 1 Mar 2012, Lars Heidieker wrote:
> > 
> >> Hi,
> >> 
> >> this splits the lookup table into two parts, for smaller 
> >> allocations and larger ones this has the following advantages:
> >> 
> >> - smaller lookup tables (less cache line pollution) - makes large 
> >> kmem caches possible currently up to min(16384, 4*PAGE_SIZE) - 
> >> smaller caches allocate from larger pool-pages if that reduces the 
> >> wastage
> >> 
> >> any objections?
> > 
> > Why would you want to go larger than PAGE_SIZE?  At that point 
> > wouldn't you just want to allocate individual pages and map them
> > into the VM space?
> > 
> > Eduardo
> > 
> 
> Allocations larger then PAGE_SIZE are infrequent (at the moment) that's
> true.
> Supporting larger pool-pages makes some caches more efficient eg 320/384
> and some size possible like a byte 3072.

How does it make this more efficient?  And why would you want to have a 
3KB pool?  How many 3KB allocations are made?

> Having these allocators in place the larger then PAGE_SIZE caches are a
> trivial extension.

That's not really the issue.  It's easy to increase the kernel code size.  
The question is whether the increase in complexity and code size is offset 
by a commensurate performance improvement.

> All caches multiplies of PAGE_SIZE come for free they don't introduce
> any additional memory overhead in terms of footprint, on memory pressure
> they can always be freed from the cache and with having them in place

But you do have the overhead of the pool itself.

> you save the TLB shoot-downs because of mapping and un-mapping them and
> the allocation deallocation of the page frames.
> So they are more then a magnitude faster.

Bold claims.  Do you have numbers that show the performance improvement?

> Lars
> 
> 
> Just some stats of a system (not up long) with those changes:
> collected with "vmstat -mvWC"
> 
> > kmem-1024 102457890515863173584
> > 408   176   4096   584 0   inf3 0x800  89.6%
> > kmem-112   11217550 84790828 27 
> >  126   409626 0   inf0 0x800  95.5%

Interesting numbers.  What exactly do they mean?  Column headers would 
help decypher them.

Eduardo

Re: making kmem more efficient

2012-03-01 Thread Eduardo Horvath

On Thu, 1 Mar 2012, Lars Heidieker wrote:

> Hi,
> 
> this splits the lookup table into two parts, for smaller allocations and
> larger ones this has the following advantages:
> 
> - smaller lookup tables (less cache line pollution)
> - makes large kmem caches possible currently up to min(16384, 4*PAGE_SIZE)
> - smaller caches allocate from larger pool-pages if that reduces the wastage
> 
> any objections?

Why would you want to go larger than PAGE_SIZE?  At that point wouldn't 
you just want to allocate individual pages and map them into the VM space?

Eduardo

Re: extattr namespaces

2012-02-06 Thread Eduardo Horvath

On Mon, 6 Feb 2012, Emmanuel Dreyfus wrote:

> Here is public disuccsion about extended attributs namespaces, following
> a private request from yamt@
> 
> We ahve two extended attributes API in tree: one from FreeBSD and one from 
> Linux. We are about to toss the FreeBSD one in favor of the Linux one. 
> That is easy now since we never had working extended attributes in a 
> release.

In order to have a sane conversation about what type of extended 
attributes we want to support it would be nice to know why we need them in 
the first place.  How are they going to be used?  What filesystems will be 
supporting them?  What happens when files with extended attributes get 
copied across different filesystems?

Eduardo

Re: kmem change related trouble

2012-02-03 Thread Eduardo Horvath

On Fri, 3 Feb 2012, Lars Heidieker wrote:

> On Fri, Feb 3, 2012 at 6:49 PM, Eduardo Horvath  wrote:
> > On Fri, 3 Feb 2012, Lars Heidieker wrote:
> >
> >> The code for proper kmem_arena sizing:
> >> http://www.netbsd.org/~para/kmemsizing.diff
> >>
> >> params done for i386/amd64/sparc64/arm32
> >
> > Explain this to me:
> >
> >  /*
> > - * Minimum and maximum sizes of the kernel malloc arena in
> > PAGE_SIZE-sized
> > + * Minimum size of the kernel kmem_arena in PAGE_SIZE-sized
> >  * logical pages.
> > + * No enforced maximum on sparc64.
> >  */
> > -#define        NKMEMPAGES_MIN_DEFAULT  ((6 * 1024 * 1024) >> PAGE_SHIFT)
> > -#define        NKMEMPAGES_MAX_DEFAULT  ((128 * 1024 * 1024) >>
> > PAGE_SHIFT)
> > +#define        NKMEMPAGES_MIN_DEFAULT  ((64 * 1024 * 1024) >> PAGE_SHIFT)
> > +#define        NKMEMPAGES_MAX_UNLIMITED 1
> >
> >
> > Does this mean a machine needs to allocate a minimum of 64MB for the
> > kernel kmem_arena or it won't boot?  What happens if a machine only has
> > 64MB of DRAM?
> >
> 
> It's not about physcial memory, it's sizing the kmem_arenas virtual size.
> It is sized by physical memory size as an aproximation, with a certain
> lower limit and if required an upper bound.
> The upper bound is only required on archs that have limited kernel
> virtual memory space (in comparison to physical memory) eg i386 as 1GB
> virtual memory kernel space but probably 2-3GB physcial memory. So the
> kmem_arena is the limited to 280MB to leave space for other maps uares
> buffers.

Interesting, but it didn't really answer the question.  Will it attempt to 
allocate NKMEMPAGES_MIN_DEFAULT pages on startup?  Will this break 
machines with 64MB of RAM?  'Cause I don't think that's something we want 
to do.

Eduardo

Re: kmem change related trouble

2012-02-03 Thread Eduardo Horvath

On Fri, 3 Feb 2012, Lars Heidieker wrote:

> The code for proper kmem_arena sizing:
> http://www.netbsd.org/~para/kmemsizing.diff
> 
> params done for i386/amd64/sparc64/arm32

Explain this to me:

 /*
- * Minimum and maximum sizes of the kernel malloc arena in 
PAGE_SIZE-sized
+ * Minimum size of the kernel kmem_arena in PAGE_SIZE-sized
  * logical pages.
+ * No enforced maximum on sparc64.
  */
-#defineNKMEMPAGES_MIN_DEFAULT  ((6 * 1024 * 1024) >> PAGE_SHIFT)
-#defineNKMEMPAGES_MAX_DEFAULT  ((128 * 1024 * 1024) >> 
PAGE_SHIFT)
+#defineNKMEMPAGES_MIN_DEFAULT  ((64 * 1024 * 1024) >> PAGE_SHIFT)
+#defineNKMEMPAGES_MAX_UNLIMITED 1
 

Does this mean a machine needs to allocate a minimum of 64MB for the 
kernel kmem_arena or it won't boot?  What happens if a machine only has 
64MB of DRAM?

Eduardo

Re: RFC: New bus_space routine: bus_space_sync

2012-01-20 Thread Eduardo Horvath

On Fri, 20 Jan 2012, Mouse wrote:

> >> Even if originally intended for something else, [...]
> 
> > Why do you think BUS_SPACE_BARRIER_SYNC was intended for something
> > else ?  I can't see how a write barrier that doesn't ensure the write
> > has reached the target (main or device memory) can be usefull.
> 
> I can't comment on why someone else thinks something.  But barriers
> that have nothing to do with write completion to the target can still
> be useful.  There are algorithms that don't require that writes
> complete on any particular schedule, but do require that _this_ write
> complete before _that_ one.  When faced with write coalescing and
> reordering, a write barrier that does nothing but enforce ordering (in
> the sequence A-barrier-B, the barrier enforces the constraint that
> there is no time at which write B has completed but write A hasn't) can
> be useful.
> 
> For example, the standard double-buffering trick of "write inactive
> copy, then write variable indicating which is the active copy" does not
> work if the indicator's write can complete before the
> (formerly-)inactive copy's writes complete - but, in many uses, there
> is no requirement that those writes, as a sequence, be pushed to their
> target at any particular time.

That's not what the manpage documenting BUS_SPACE_BARRIER_SYNC says.  Read 
the manpage.

Eduardo

Re: RFC: New bus_space routine: bus_space_sync

2012-01-20 Thread Eduardo Horvath

On Fri, 20 Jan 2012, David Young wrote:

> Date: Fri, 20 Jan 2012 12:48:48 -0600
> From: David Young 
> To: tech-kern Discussion List 
> Subject: Re: RFC: New bus_space routine: bus_space_sync
> 
> On Fri, Jan 20, 2012 at 11:18:38AM +0100, Manuel Bouyer wrote:
> > On Thu, Jan 19, 2012 at 08:45:41PM +0100, Martin Husemann wrote:
> > > Even if originally intended for something else, like Matt says, wouldn't 
> > > it
> > 
> > Why do you think BUS_SPACE_BARRIER_SYNC was intended for something else ?
> > I can't see how a write barrier that doesn't ensure the write has
> > reached the target (main or device memory) can be usefull.
> 
> My understanding of BUS_SPACE_BARRIER_SYNC is that no read issued
> before the barrier may satisfy or follow any read after the barrier,
> and no write before the barrier may follow or be combined with
> any write after the barrier.  Likewise, no read or write before
> the barrier may follow a write or read, respectively, after the
> barrier.  The reads and writes do NOT have to be completed when
> bus_space_barrier(...BUS_SPACE_BARRIER_SYNC...) returns.
> 
> My interpretation of the manual is not very literal, but I believe
> that it's a fair description of what to expect on any non-fanciful
> implementation of bus_space(9) for memory-mapped PCI space, where writes
> can be posted.


No.

Quote:

   BUS_SPACE_BARRIER_SYNCForce all memory operations
 and any pending exceptions to
 be completed before any
 instructions after the bar-
 rier may be issued.

This means that the write operation must complete before the SYNC returns.  
It means that any caches involved must have been flushed *and* the data 
must have been entered into the registers otherwise you may get an 
asynchronous exception from one of the pending stores.

> bus_space_barrier() is used so little that it may be better to document
> the semantics that are useful and feasible, and make sure that the
> implementations guarantte those semantics, than to spend a lot of time
> on the interpretation.

The semantics seem pretty clear to me.  Now we may have a bunch of buggy 
implementations, but the man page seems pretty clear to me.

Eduardo

Re: RFC: New bus_space routine: bus_space_sync

2012-01-20 Thread Eduardo Horvath

On Thu, 19 Jan 2012, Matt Thomas wrote:

> For prefetchable regions (like framebuffers) mapped by bus_space_map, there 
> is a need to able force the contents out of the cache back into memory 
> (especially when the cache is a writeback cache).
> 
> There is no MI way to do this with the bus_space framework so I'm proposing 
> we add a:
> 
>   void bus_space_sync(bus_space_tag_t bst,
>   bus_space_handle_t bsh,
>   bus_size_t offset,
>   bus_size_t length,
>   int ops);
> 
> where ops is one of:
> 
> #define   BUS_SPACE_SYNC_WB   1 // defined by MD
> #define   BUS_SPACE_SYNC_WBINV2 // defined by MD
> 
> One caveat is that though a BUS_SPACE_SYNC_WB was requested, a platform can 
> perform BUS_SPACE_SYNC_WBINV instead.  If the platform can't support just 
> writeback, it is allowed to silently do a writeback-invalidate instead.

Could elaborate a bit more on how you plan to use this and why 
bus_space_barrier() with BUS_SPACE_BARRIER_SYNC is insufficient?

Yes, I know the current implementation ov bus_space_barrier() on the 
architecture you're using doesn't do this, but why can't it be enhanced to 
if you do the initial mapping with BUS_SPACE_MAP_CACHEABLE?  Presumably if 
bus_space_sync() is passd a handle, offset, and length you're using the 
bus_space_{read,write}*() accessors rather than the pointer returned by 
bus_space_vaddr().

Also, how would this work if your machine has an I/O cache (separate from 
the CPU cache)?

Eduardo

Re: RFC: import of posix_spawn GSoC results

2011-12-28 Thread Eduardo Horvath

On Wed, 28 Dec 2011, Joerg Sonnenberger wrote:

> On Wed, Dec 28, 2011 at 04:07:45AM +, YAMAMOTO Takashi wrote:
> > my understanding:
> > there is no need to stop other threads as far as posix_spawn is concerned.
> > so there is no big performace problems with a vfork-based implementation.
> > because our current implementation of vfork suspends the calling threads
> > only, it can be used to implement posix_spawn as it is.  a vfork'ed child
> > should carefully avoid touching memory shared with other threads, but it's
> > doable and not too complex.
> 
> Which is exactly why vfork usage is not safe. The child has to know all
> interfaces that are possible shared, which can often happen behind your
> back in libc.

Ahem.  vfork() is dangerous because:

1) it can arbitrarily change the value of global data structures

2) it stomps on the parent's stack.

All threaded programs have problems with #1 which is why they have thread 
safe libraries and locks.  Or do locks not work after the vfork() call?

#2 is the same problem a non-threaded process has with vfork() and should 
be solvable with the same mechanism.

Why would vfork() need to do anything fancy like suspend or clone any 
other threads?  Or am I missing something obvious?

Eduardo

Re: Why is it called strategy?

2011-10-18 Thread Eduardo Horvath

On Tue, 18 Oct 2011, Emmanuel Dreyfus wrote:

> As I understand, at VFS level, VOP_STRATEGY(9) is used for I/O to block
> devices. Where does that name comes from? 

Block devices use the `strategy()' routines to schedule operations 
because, unlike character devices which typically immediately post the 
write opertion to the hardware, they used to go through disksort() so the 
elevator algorithm could be used to optimize disk access.  Hence the 
strategy() routine did not necessarily start the I/O operation, it would 
usually add it to a work queue.

Eduardo

Re: Implement mmap for PUD

2011-09-12 Thread Eduardo Horvath

On Sat, 10 Sep 2011, Masao Uebayashi wrote:

> On Sat, Sep 10, 2011 at 7:24 PM, Roger Pau Monné
>  wrote:

> > PUD is a framework present in NetBSD that allows to implement
> > character and block devices in userspace. I'm trying to implement a
> > blktap [1] driver purely in userspace, and the mmap operation is
> > needed (and it would also be beneficial for PUD to have the full set
> > of operations implemented, for future uses). The implementation of
> > blktap driver using fuse was discused in the port-xen mailing list,
> > but the blktap driver needs a set of specific operations over
> > character and block devices.
> >
> > My main concern is if it is possible to pass memory form a userspace
> > program to another trough the kernel (that is mainly what my mmap
> > implementation tries to accomplish). I trough that I could accomplish
> 
> It is called pipe(2), isn't it?

Did you forget a smiley there?  No it isn't, that's page loaning.

I don't think the device mmap() infrastructure will work for you.  As I 
said before, it's designed to hand out unmanaged device physical memory 
and you're working with managed memory.  While you may be able to cobble 
together something that appears functional, it will probably not properly 
manage VM reference counts and eventually go belly up.  Keep in mind the 
way device mmap is not done during the mmap() call.  Instead nothing 
happens until there's a page fault, which is when the driver's mmap() 
routine is called to do the v->p mapping and insert it into the pmap().

I think you may need to create a new uvm object to hold the pages you want 
to share and attach it to the vmspaces of both the server process handing 
out the pages and the ... err ... client(?) process trying to do the mmap.
fork() does this as well as mmap() of a file and sysv_shm.  I think the 
set of operations in sysv_shm is the best bet since it's the closest to 
what you want to do.  

You will probably need to find some way to intercept the mmap() syscall 
and have it do something unique for the PUD device, maybe by fiddling with 
vnode OP vectors.  I don't know, but I don't think this will be 
straight-forward.

Eduardo

Re: Implement mmap for PUD

2011-09-09 Thread Eduardo Horvath

On Wed, 7 Sep 2011, Roger Pau Monné wrote:

> Basically we use pud_request to pass the request to the user-space
> server, and the server returns a memory address, allocated in the
> user-space memory of it's process. Then I try to read the value of the
> user space memory from the kernel, which works ok, I can fetch the
> correct value. After reading the value (that is just used for
> debugging), the physical address of the memory region is collected
> using pmap_extract and returned.

I'm not sure you can do this.  The mmap() interface in drivers is designed 
to hand out unmanaged pages, not managed page frames.  Userland processes 
use page frames to hold pages that could be paged out at any time.  You 
could have nasty problems with wiring and reference counts.  What you 
really need to do here is shared memory not a typical device mmap().

WHY do you want to do this?  What is PUD?  Why do you have kernel devices 
backed by userland daemons?  I think a filesystem may be more appropriate 
than a device in this case.

Eduardo

Re: netbsd32 emulation in driver open() or read()

2011-08-30 Thread Eduardo Horvath

On Tue, 30 Aug 2011, Manuel Bouyer wrote:

> That may be nice to have, but won't help with my problem which is
> getting a N32 mips binary to talk to a N64 kernel.

Hm, MIPS.  In this case you may need to check the struct emul to 
differentiate o32 and n32.  Or do they have the exact same structure 
layouts?  All the different MIPS ABIs make my head spin.

Eduardo

Re: netbsd32 emulation in driver open() or read()

2011-08-29 Thread Eduardo Horvath

On Mon, 29 Aug 2011, Manuel Bouyer wrote:

> So: is there a way to know if the emulation used by a userland program
> doing an open() is 32 or 64bit ?

sys/proc.h:

1.233 ad343: /*
1.273 ad344:  * These flags are kept in p_flag and are 
protected by p_lock.  Access from
1.233 ad345:  * process context only.
346:  */
...
353: #definePK_32   0x0004 /* 
32-bit process (used on 64-bit kernels) */

So you can check if that bit is set in the current proc's p_flag member.


Eduardo

Re: what to do on memory or cache errors?

2011-08-25 Thread Eduardo Horvath

On Mon, 22 Aug 2011, Matt Thomas wrote:

> besides panicing, of course.
> 
> This is going to involve a lot of help from UVM.  
> 
> It seems that uvm_fault is not the right place to handle this.  Maybe we need 
> a
> 
> void uvm_page_error(paddr_t pa, int etype);
> 
> where etype would indicate if this was a memory or cache fault, was the cache 
> line dirty, etc.  If uvm_page_error can't "correct" the error, it would panic.
> 
> Interactions with copyin/copyout will also need to be addressed.
> 
> Preemptively, we could have a thread force dirty cache lines to memory if 
> they've been in L2 "too long" (thereby reducing the problem to an ECC error 
> on a clean cache line which means you just toss the cache-line contents.)  We 
> can also have a thread that reads all of memory (slowly) thereby causing any 
> single bit errors to be corrected before they become double-bit errors.
> 
> I'm not familiar enough with UVM internals to actually know what to do but I 
> hope someone else reading this is.
> 
> Comments anyone?

(I can't believe I'm actually getting involved in this discussion.)

I would recommend against trying to add memory error recovery.

1) It doesn't happen very often.

2) It's HARD to implement.  (More on this later.)

3) It's difficult to verify correct operation because of 1.

4) It's highly machine dependent.

5) If you claim to support this and it doesn't work it may open up legal 
issues.

If you did want to do this, most of it would be in MD code.  This means 
both pmap/page fault handling code for CPU faults and on the I/O side for 
DMA issues.  

Things get really interesting (complicated) when you get a fault and try 
to determine the faulting address.  The design of the processor, cache, 
memory, and I/O subsystems is important here.  Where is ECC generated and 
checked?  In the memory controller?  The cache?  The main bus?  The CPU 
core?  How many cache levels are there?  Are they write-back or 
write-through?  Is there an I/O cache?  All these variables affect the MD 
portions of the design, which you need to get right to be able to properly 
survive an error without creating the possiblity of data corruption.  We 
could discuss what steps are needed to recover from a specific type of 
memory error in a particular cache level on one model of CPU, but I don't 
think that's something that can be generalized.

Let's assume for the sake of argument that you can implement the MD parts 
of memory error handling correctly across a non-trivial set of machines.  
This means you can identify the fault and cleaned up the state as much as 
possible.  What do you do now?

There are two different types of faults, correctable and uncorrectable.  
Correctable faults are annoying but dealing with them is relatively 
simple, assuming the system is set up to report correctable faults.  You 
first need to determine if the fault is a hard fault or a soft fault by 
retrying the faulting operation to see if it recurs.  If a memory or cache 
location was modified by random radiation you have a soft fault that is 
unlikely to recur.  In that case you need to keep track of the fault rate 
of that device to decide if it's beginning to wear out and needs to be 
replaced.  If it doesn't need to be replaced then just go about your 
business.  

If the fault rate is too high, or the memory location has a hard fault, 
say a trace has shorted out and you have lost some of your redundancy and 
should stop using that device.  If the memory location is in a cache, you 
need to figure out how to disable it.  If the problem is in RAM, you can 
retire the page and hope it's the only one affected or you can disable the 
entire device.  To disable a device you need to migrate all the pages off 
the device to some other location.  This brings us to an interesting 
problem of identifying the specific piece of hardware that corresponds to 
a certain memory range.  If you don't know where the memory associated 
with that device starts or ends you don't know how many pages need to be 
migrated or where are safe destinations.  And there's the problem of 
generating the error message "Please replace DIMM number 53."

If you have an uncorrectable error things get more interesting.  After 
retiring the memory you need to try to recover the system.  Obviously if 
you have a clean page you should be able to recover it either from backing 
store or ZFOD.  If it's a dirty userland page you can usually send a the 
process a SIGBUS, unless the error was caused by DMA, in which case things 
get really interesting.  Do you retry the operation and attempt to correct 
it or generate an error?  Do you send a signal or return an error from an 
I/O system call?  It depends on the device and the type of I/O operation, 
synchronous, asynchronous, or memory mapped.

Finally kernel pages may not be recoverable or relocatable, depending on 
how the kernel address space is managed for a particular machine.

Anyway, I'd think the first step you'

Re: bus_dma(9) BUS_DMA_COHERENT is a hint (or not)

2011-08-24 Thread Eduardo Horvath

On Wed, 24 Aug 2011, Frank Zerangue wrote:

> bus_dma(9) specifies that for bus_dmamem_map() the flag BUS_DMA_COHERENT is a 
> hint; and that a device driver must not rely on this flag for correct 
> operation.  All calls to bus_dmamap_sync() must still be made.
> 
> But for frame buffers this seems impractical to me and it appears in 
> practice, that frame buffers that use DMA do indeed depend on this flag and 
> do not call bus_dmamap_sync() functions.  An example of this is 
> arch/arm/xcale/pxa2xo_lcd.c .
> 
> Does anyone have advice on how one should proceed when writing a driver for a 
> new graphics device?

If you want to be portable you should always insert bus_dmamap_sync() 
calls in appropriate places in the code.  On some machines you don't need 
them.  On some machines you don't even need the BUS_DMA_COHERENT flag, but 
that's not portable.   What if your processor has a relaxed memory model 
and stores to memory and writes can be re-ordered before they ever hit the 
cache?  Or there's a cache on the I/O device that needs to be flushed?

OTOH, if you don't care about portability you can do whatever you want.

Eduardo

Re: genfs_getpages vs. genfs_compat_getpages

2011-08-02 Thread Eduardo Horvath

On Tue, 2 Aug 2011, paul_kon...@dell.com wrote:

> Gentlepeople,
> 
> Some file systems use genfs_compat_getpages while others (most of them) use 
> genfs_getpages.  I'm trying to figure out the essential differences, and why 
> one would pick one over the other.
> 
> Any pointers?

genfs_vnops.c:

revision 1.43
date: 2001/12/18 07:49:36;  author: chs;  state: Exp;  lines: +137 -2
add some compatibility routines to allow mmap() to work non-UBCified
filesystems (in the same non-coherent fashion that they worked before).
=

Eduardo

Re: rfc: vmem(9) API/implementation changes

2011-07-27 Thread Eduardo Horvath

On Wed, 27 Jul 2011, David Young wrote:

> On Wed, Jul 27, 2011 at 04:58:23PM +0000, Eduardo Horvath wrote:
> > On Wed, 27 Jul 2011, David Young wrote:
> > 
> > > There are a couple of changes to the API that I would like to make.
> > > First, I don't think that vmem_addr_t 0 should be reserved for error
> > > indications (0 == VMEM_ADDR_NULL), but the API should change from
> > > this:
> > 
> > I'd recommend returning -1 on error.  0 is a valid address, but while -1 
> > is a valid address, when do you ever use this interface to allocate 
> > something that starts at address -1?  And it gets around all the noxious 
> > problems involved in returning data through reference parameters.
> 
> I don't know.  Suppose sizeof(vmem_addr_t) == sizeof(uint32_t).  Which
> of these cases should fail, and on which statement?
> 
> Case A:
> 
>  1vm = vmem_create("test", 0x, 1, 0, NULL, NULL, NULL, 1,
>  2VM_SLEEP, IPL_NONE);
>  3p = vmem_alloc(vm, 1, VM_SLEEP);
> 
> Case B:
> 
>  1vm = vmem_create("test", 0xfffe, 2, 0, NULL, NULL, NULL, 1,
>  2VM_SLEEP, IPL_NONE);
>  3p = vmem_alloc(vm, 2, VM_SLEEP);
> 
> Case C:
> 
>  1vm = vmem_create("test", 0xfffe, 2, 0, NULL, NULL, NULL, 1,
>  2VM_SLEEP, IPL_NONE);
>  3p = vmem_alloc(vm, 1, VM_SLEEP);
>  4q = vmem_alloc(vm, 1, VM_SLEEP);

All of them should fail in all the routines since you're specifying a 
quantum of 0.  This means you can only allocate multiples of 0 items from 
the list.

OTOH, how many times have you seen code like this:

foo(vmem_t *v) {
 void *p;

 vmem_alloc(v, 52, 0, (vmem_addr_t *)&p);




which has implementation defined functionality.

Eduardo

Re: rfc: vmem(9) API/implementation changes

2011-07-27 Thread Eduardo Horvath

On Wed, 27 Jul 2011, David Young wrote:

> There are a couple of changes to the API that I would like to make.
> First, I don't think that vmem_addr_t 0 should be reserved for error
> indications (0 == VMEM_ADDR_NULL), but the API should change from
> this:

I'd recommend returning -1 on error.  0 is a valid address, but while -1 
is a valid address, when do you ever use this interface to allocate 
something that starts at address -1?  And it gets around all the noxious 
problems involved in returning data through reference parameters.

Eduardo

Re: Multiple device attachments

2011-07-22 Thread Eduardo Horvath

On Fri, 22 Jul 2011, Frank Zerangue wrote:

> I have a hardware configuration with a cmos camera sensor on an i2c bus (for 
> configuring the camera) and connected to an (ipu) image processing controller 
> that acts as a hub for all things video. I envisioned (naturally I think) a 
> camera driver inheriting from two parents 1) i2c bus driver and 2) ipu 
> controller driver. 
> 
> This does not seem like such a strange hardware configuration to me that 
> would not be found on other embedded systems. Does anyone have a suggestion 
> for an appropriate driver hierarchy for such a configuration?

Ouch.

The autoconfig infrastructure is not designed to deal with that sort of 
configuration.  That's also not a traditional multi-path arrangement where 
the same device is visible through two distinct bus conrollers.

In this case you have two distinct connections to two separate devices 
that need to be managed together.  (I assume you can't send the same 
command down either the I2C or IPU bus and expect the same behavior?)

Is the I2C controller and ipu controller under the same parent device or 
are they in different parts of the device tree?  If you have e.g. a PCI 
card that provides I2C and IPU functionality it would be best to generate 
a driver for that that manages the entire camera, with possible child 
devices to handle communication over I2C and IPU separately.  However 
coordination between the two would be best handled at the parent.

If not, you need to do some nasty hacks to allow the separate driver bits 
to rendevous.  I did a really ugly hack like that for the psycho device 
which is a single bus controller which provides two PCI child buses, and 
some registers are shared between the two buses.

But it's definitely best if you could stick to the standard tree approach 
with one parent and one or more children, at least until the config 
framework really has multipath support.

Eduardo

> On Jul 22, 2011, at 12:07 PM, Eduardo Horvath wrote:
> 
> > On Thu, 21 Jul 2011, Frank Zerangue wrote:
> > 
> >> The examples you site seem to indicate that for example the le device may 
> >> attach to many
> >> alternative devices (e.g. pci, tc, …), but only one attachment is made 
> >> when autoconf is complete. I may have 
> >> read the code examples incorrectly -- please pardon me if I did; but what 
> >> I want to know is --  can a 
> >> device have multiple attachments (more than one parent device) when 
> >> autoconf is complete. 
> > 
> > What we have is a device tree.  That means a device instance can only have 
> > one parent.  Once it has been instantiated, that instance, with its 
> > associated instance number, cannot appear anywhere else in the device 
> > tree.
> > 
> > It might be good to extend the device tree to a directed graph at some 
> > point to support multi-pathed devices on fabrics like SAS, Fibre-channel, 
> > or PCIe that allow that sort of thing, but that's a different issue.
> > 
> > Eduardo
> 
>

Re: Multiple device attachments

2011-07-22 Thread Eduardo Horvath

On Thu, 21 Jul 2011, Frank Zerangue wrote:

> The examples you site seem to indicate that for example the le device may 
> attach to many
> alternative devices (e.g. pci, tc, …), but only one attachment is made when 
> autoconf is complete. I may have 
> read the code examples incorrectly -- please pardon me if I did; but what I 
> want to know is --  can a 
> device have multiple attachments (more than one parent device) when autoconf 
> is complete. 

What we have is a device tree.  That means a device instance can only have 
one parent.  Once it has been instantiated, that instance, with its 
associated instance number, cannot appear anywhere else in the device 
tree.

It might be good to extend the device tree to a directed graph at some 
point to support multi-pathed devices on fabrics like SAS, Fibre-channel, 
or PCIe that allow that sort of thing, but that's a different issue.

Eduardo

Re: Sun keyboard on i386?

2011-07-13 Thread Eduardo Horvath

On Wed, 13 Jul 2011, Mouse wrote:

> I have a desk on which (for reasons not immediately relevant) the main
> head is an i386 machine (4.0.1).  But this has meant I'm stuck using a
> crappy peecee keyboard.
> 
> Today, I put together the interface electronics to put one of my good
> (Sun type 3) keyboards on one of the serial ports.  It works, in that a
> program that talks to the serial port can speak the keyboard's protocol
> and get keystrokes and suchlike.
> 
> I can, if I have to, bludgeon X into being such a program.  But I
> thought I would first try to use the existing kernel code for Sun
> keyboards (which would, I would expect, have the additional advantage
> of working in the text console).  Looking at the kernel configs, I see
> that on sparc64 (and on sparc, though the comments say it's just for
> test building) kbd can attach at com, which is convenient because it's
> exactly what I want to do.  So I appended a handful of lines to my i386
> machine's kernel config, mostly lifted from sparc64:
> 
> define firm_events
> file dev/sun/event.c  firm_events needs-flag
> device kbd: firm_events, wskbddev
> file dev/sun/kbd.ckbd needs-flag
> file dev/sun/kbd_tables.c kbd
> file dev/sun/wskbdmap_sun.c   kbd & wskbd
> attach kbd at com with kbd_tty
> file dev/sun/sunkbd.c kbd_tty
> file dev/sun/kbdsun.c kbd_tty
> kbd0 at com0
> 
> I had to change an #include and remove another to get the kernel to
> compile, and rip a little code out of kbd.c and sunkbd.c to get it to
> link, but surprisingly little.  Less than I was expecting.
> (Specifically: in sunkbd.c,  -> ,
> remove , and rip out both arms of the
> if (args->kmta_consdev) test in sunkbd_attach(); in kbd.c, remove
> sunkbd_wskbd_cn{getc,pollc,bell} and sunkbd_bell_off, remove
> sunkbd_wskbd_consops and the code in kbd_enable that conditionally uses
> it.  Exact diffs available if anyone wants.)
> 
> But it doesn't work.  I added a printf to sunkbd_match, and it's never
> even getting called.  Is there some kind person here who has any idea
> why not and can point me in a useful direction?  I daresay it's
> something that will be blindingly obvious once I see it

On sparc64 the console attach stuff is based on properties it gets from 
OBP.  On x86 that's done in a completely different manner.  You will have 
to do a lot of cnattach hacking.

I'd recommend getting a newer Type 4 or Type 5 USB keyboard instead of 
trying to kluge a serial keyboard.  Those work great on x86.

Eduardo

1 2 >

1 - 100 of 181 matches

Mail list logo