Re: ARM + CACHE_LINE_SIZE + DMA

2012-05-23 Thread Svatopluk Kraus
On Mon, May 21, 2012 at 6:20 PM, Ian Lepore
free...@damnhippie.dyndns.org wrote:
 ...
 Some more notes.

 SMP makes things worse and ARM11mpcore is about SMP too. For example,
 another thread could be open about that how to flush caches (exclusive
 L1 cache) in SMP case.

 I'm not sure how to correctly change memory attributes on page which
 is in use. Making new temporary mapping with different attributes is
 wrong and does not help at all. It's question how to do TLB and cache
 flushes on two and more processors and be sure that everything is OK.
 It could be slow and maybe, changing memory attributes on the fly is
 not a good idea at all.


 My suggestion of making a temporary writable mapping was the answer to
 how to correctly change memory attributes on a page which is in use, at
 least in the existing code, which is for a single processor.

 You don't need, and won't even use, the temporary mapping.  You would
 make it just because doing so invokes logic in arm/arm/pmap.c which will
 find all existing virtual mappings of the given physical pages, and
 disable caching in each of those existing mappings.  In effect, it makes
 all existing mappings of the physical pages DMA_COHERENT.  When you
 later free the temporary mapping, all other existing mappings are
 changed back to being cacheable (as long as no more than one of the
 mappings that remain is writable).

 I don't know that making a temporary mapping just for its side effect of
 changing other existing mappings is a good idea, it's just a quick and
 easy thing to do if you want to try changing all existing mappings to
 non-cacheable.  It could be that a better way would be to have the
 busdma_machdep code call directly to lower-level routines in pmap.c to
 change existing mappings without making a new temporary mapping in the
 kernel pmap.  The actual changes to the existing mappings are made by
 pmap_fix_cache() but that routine isn't directly callable right now.


Thanks for explanation. In fact, I known only a little about current
ARM pmap implementation in FreeBSD tree. I took i386 pmap
implementation and modified it according to arm11mpcore.

 Also, as far as I know all of this automatic disabling of cache for
 multiple writable mappings applies only to VIVT cache architectures.
 I'm not sure how the pmap code is going to change to support VIPT and
 PIPT caches, but it may no longer be true that making a second writable
 mapping of a page will lead to changing all existing mappings to
 non-cacheable.

 -- Ian


Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ARM + CACHE_LINE_SIZE + DMA

2012-05-23 Thread Svatopluk Kraus
Hi,

with respect to your replies and among other things, the following
summary could be made:

There are three kinds of DMA buffers according to their origin:

1. driver buffers
As Alexander wrote, the buffers should be allocated by
bus_dmamap_alloc(). The function should be implemented to allocate the
buffers correctly aligned with help of bus_dma_tag_t. For these
buffers, we can avoid bouncing totally just by correct driver
implementation. For badly implemented drivers, bouncing penalty is
paid in case of unaligned buffers. For BUS_DMA_COHERENT allocations,
as Mark wrote, an allocation pool of coherent pages is good
optimalization.

2. well-known system buffers
Mbufs and vfs buffers. The buffers should be aligned on
CACHE_LINE_SIZE (start and size).
It should be enough for vfs buffers as they are carring data only and
only whole buffers should be accessed by DMA. The mbuf is a structure
and data can be carried on three possible locations. The first one,
the external buffer, should be aligned on CACHE_LINE_SIZE. The next
two locations, which are parts of the mbuf structure, could be
unaligned in general. If we assume that no one else is writing any
part of the mbuf during DMA access, we can set BUS_DMA_UNALIGNED_SAFE
flag in mbuf load functions. I.e., we don't bounce unaligned buffers
if the flag is set in dmamap. A tunable can be implemented to suppres
the flag for debugging purposes.

3. other buffers
As we know nothing about these buffers, we must always bounce unaligned ones.

Just two more notes. The DMA buffer should not be access by anyone
(except DMA itself) after PRESYNC and before POSTSYNC. For DMA
descriptors (for example), using bus_dmamap_alloc() with
BUS_DMA_COHERENT flag could be inevitable.

As I'm implementing bus dma for ARM11mpcore, I'm doing it with next assumptions:
1. ARMv6k and higher
2. PIPT data cache
3. SMP ready

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ARM + CACHE_LINE_SIZE + DMA

2012-05-18 Thread Svatopluk Kraus
On Thu, May 17, 2012 at 10:07 PM, Ian Lepore
free...@damnhippie.dyndns.org wrote:
 On Thu, 2012-05-17 at 15:20 +0200, Svatopluk Kraus wrote:
 Hi,

 I'm working on DMA bus implementation for ARM11mpcore platform. I've
 looked at implementation in ARM tree, but IMHO it only works with some
 assumptions. There is a problem with DMA on memory block which is not
 aligned on CACHE_LINE_SIZE (start and end) if memory is not coherent.

 Let's have a buffer for DMA which is no aligned on CACHE_LINE_SIZE.
 Then first cache line associated with the buffer can be divided into
 two parts, A and B, where A is a memory we know nothing about it and B
 is buffer memory. The same stands for last cache line associatted with
 the buffer. We have no problem if a memory is coherent. Otherwise it
 depends on memory attributes.

 1. [no cache] attribute
 No problem as memory is coherent.

 2. [write throught] attribute
 The part A can be invalidated without loss of any data. It's not problem too.

 3. [write back] attribute
 In general, there is no way how to keep both parts consistent. At the
 start of DMA transaction, the cache line is written back and
 invalidated. However, as we know nothing about memory associated with
 part A of the cache line, the cache line can be filled again at any
 time and messing up DMA transaction if flushed. Even if the cache line
 is only filled but not flushed during DMA transaction, we must make it
 coherent with memory after that. There is a trick with saving part A
 of the line into temporary buffer, invalidating the line, and
 restoring part A in current ARM (MIPS) implementation. However, if
 somebody is writting to memory associated with part A of the line
 during this trick, the part A will be messed up. Moreover, the part A
 can be part of another DMA transaction.

 To safely use DMA with no coherent memory, a memory with [no cache] or
 [write throught] attributes can be used without problem. A memory with
 [write back] attribute must be aligned on CACHE_LINE_SIZE.

 However, for example mbuf, a buffer for DMA can be part of a structure
 which can be aligned on CACHE_LINE_SIZE, but not the buffer itself. We
 can know that nobody will write to the structure during DMA
 transaction, so it's safe to use the buffer event if it's not aligned
 on CACHE_LINE_SIZE.

 So, in practice, if DMA buffer is not aligned on CACHE_LINE_SIZE and
 we want to avoid bounce pages overhead, we must support additional
 information to DMA transaction. It should be easy to support the
 information about drivers data buffers. However, what about OS data
 buffers like mentioned mbufs?

 The question is following. Is or can be guaranteed for all or at least
 well-known OS data buffers which can be part of DMA access that the
 not CACHE_LINE_SIZE aligned buffers are surrounded by data which
 belongs to the same object as the buffer and the data is not written
 by OS when given to a driver?

 Any answer is appreciated. However, 'bounce pages' is not an answer.

 Thanks, Svata

 I'm adding freebsd-arm@ to the CC list; that's where this has been
 discussed before.

 Your analysis is correct... to the degree that it works at all right
 now, it's working by accident.  At work we've been making the good
 accident a bit more likely by setting the minimum allocation size to
 arm_dcache_align in kern_malloc.c.  This makes it somewhat less likely
 that unrelated objects in the kernel are sharing a cache line, but it
 also reduces the effectiveness of the cache somewhat.

 Another factor, not mentioned in your analysis, is the size of the IO
 operation.  Even if the beginning of the DMA buffer is cache-aligned, if
 the size isn't exactly a multiple of the cache line size you still have
 the partial flush situation and all of its problems.

 It's not guaranteed that data surrounding a DMA buffer will be untouched
 during the DMA, even when that surrounding data is part of the same
 conceptual object as the IO buffer.  It's most often true, but certainly
 not guaranteed.  In addition, as Mark pointed out in a prior reply,
 sometimes the DMA buffer is on the stack, and even returning from the
 function that starts the IO operation affects the cacheline associated
 with the DMA buffer.  Consider something like this:

    void do_io()
    {
        int buffer;
        start_read(buffer);
        // maybe do other stuff here
        wait_for_read_done();
    }

 start_read() gets some IO going, so before it returns a call has been
 made to bus_dmamap_sync(..., BUS_DMASYNC_PREREAD) and an invalidate gets
 done on the cacheline containing the variable 'buffer'.  The act of
 returning from the start_read() function causes that cacheline to get
 reloaded, so now the stale pre-DMA value of the variable 'buffer' is in
 cache again.  Right after that, the DMA completes so that ram has a
 newer value that belongs in the buffer variable and the copy in the
 cacheline is stale.

 Before control gets into the wait_for_read_done() routine

ARM + CACHE_LINE_SIZE + DMA

2012-05-17 Thread Svatopluk Kraus
Hi,

I'm working on DMA bus implementation for ARM11mpcore platform. I've
looked at implementation in ARM tree, but IMHO it only works with some
assumptions. There is a problem with DMA on memory block which is not
aligned on CACHE_LINE_SIZE (start and end) if memory is not coherent.

Let's have a buffer for DMA which is no aligned on CACHE_LINE_SIZE.
Then first cache line associated with the buffer can be divided into
two parts, A and B, where A is a memory we know nothing about it and B
is buffer memory. The same stands for last cache line associatted with
the buffer. We have no problem if a memory is coherent. Otherwise it
depends on memory attributes.

1. [no cache] attribute
No problem as memory is coherent.

2. [write throught] attribute
The part A can be invalidated without loss of any data. It's not problem too.

3. [write back] attribute
In general, there is no way how to keep both parts consistent. At the
start of DMA transaction, the cache line is written back and
invalidated. However, as we know nothing about memory associated with
part A of the cache line, the cache line can be filled again at any
time and messing up DMA transaction if flushed. Even if the cache line
is only filled but not flushed during DMA transaction, we must make it
coherent with memory after that. There is a trick with saving part A
of the line into temporary buffer, invalidating the line, and
restoring part A in current ARM (MIPS) implementation. However, if
somebody is writting to memory associated with part A of the line
during this trick, the part A will be messed up. Moreover, the part A
can be part of another DMA transaction.

To safely use DMA with no coherent memory, a memory with [no cache] or
[write throught] attributes can be used without problem. A memory with
[write back] attribute must be aligned on CACHE_LINE_SIZE.

However, for example mbuf, a buffer for DMA can be part of a structure
which can be aligned on CACHE_LINE_SIZE, but not the buffer itself. We
can know that nobody will write to the structure during DMA
transaction, so it's safe to use the buffer event if it's not aligned
on CACHE_LINE_SIZE.

So, in practice, if DMA buffer is not aligned on CACHE_LINE_SIZE and
we want to avoid bounce pages overhead, we must support additional
information to DMA transaction. It should be easy to support the
information about drivers data buffers. However, what about OS data
buffers like mentioned mbufs?

The question is following. Is or can be guaranteed for all or at least
well-known OS data buffers which can be part of DMA access that the
not CACHE_LINE_SIZE aligned buffers are surrounded by data which
belongs to the same object as the buffer and the data is not written
by OS when given to a driver?

Any answer is appreciated. However, 'bounce pages' is not an answer.

Thanks, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-04-04 Thread Svatopluk Kraus
On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd adr...@freebsd.org wrote:
 Hi,

 I'm interested in this, primarily because I'm tinkering with file
 storage stuff on my little (most wifi targetted) embedded MIPS
 platforms.

 So what's the story here? How can I reproduce your issue and do some
 of my own profiling/investigation?


 Adrian

Hi,

your interest has made me to do more solid/comparable investigation on
my embedded ELAN486 platform. With more test results, I made full
tracing of related VFS, filesystem, and disk function calls. It took
some time to understand what about the issue really is.

My test case:
Single file copy (no O_FSYNC). It means that no other filesystem
operation is served. The file size must be big enough according to
hidirtybuffers value. Other processes on machine, where the test was
run, almost were inactive. The real copy time was profiled. In all
tests, a machine was booted, a file was copied, file was removed, the
machine was rebooted. Thus, the the file was copied into same disk
layout.

The motivation is that my embedded machines don't do any writing to a
disk mostly. Only during software update, a single process is writing
to a disk (file by file). It doesn't need to be a problem at all, but
an update must be successful even under full cpu load. So, the writing
should be tuned up greatly to not affect other processes too much and
to finish in finite time.

On my embedded ELAN486 machines, a flash memory is used as a disk. It
means that a reading is very fast, but a writing is slow. Further, a
flash memory is divided into sectors and only complete sector can be
erased at once. A sector erasure is very time expensive action.

When I tried to tune up VFS by various parameters changing, I found
out that real copy time depends on two things. Both of them are a
subject of bufdaemon. Namely, its feature to try to work harder, if
its buffers flushing mission is failing. It's not suprise that the
best copy times were achived when bufdaemon was excluded from buffers
flushing at all (by VFS parameters setting).

This bufdaemon feature brings along (with respect to the real copy time):
1. bufdaemon runtime itself,
2. very frequent filesystem buffers flushing.

What really happens in the test case on my machine:

A copy program uses a buffer for coping. The default buffer size is
128 KiB in my case. The simplified sys_write() implementation for
DTYPE_VNODE and VREG type is following:

sys_write()
{
 dofilewrite()
 {
  bwillwrite()
  fo_write() = vn_write()
  {
   bwillwrite()
   vn_lock()
   VOP_WRITE()
   VOP_UNLOCK()
  }
 }
}

So, all 128 KiB is written under VNODE lock. When I take back the
machine defaults:

hidirtybuffers: 134
lodirtybuffers: 67
bufdirtythresh: 120
buffer size (filesystem block size): 512 bytes

and do some simple calculations:

134 * 512 = 68608  - high water bytes count
120 * 512 = 61440
67 * 512 = 34304   - low water byte count

then it's obvious that bufdaemon has something to do during each
sys_write(). However, almost all dirty buffers belong to new file
VNODE and the VNODE is locked. What remains are filesystem buffers
only. I.e., superblock buffer and free block bitmap buffers. So,
bufdaemon iterates over all dirty buffers queue, what takes a
SIGNIFICANT time on my machine, and does not find any buffer to be
able to flush almost all time. If bufdaemon flushes one or two
buffers, kern_yield() is called, and new iteration is started until no
buffer is flushed. So, very often TWO full iteration over dirty
buffers queue is done to flush only one or two filesystem buffers and
to failed to reach lodirtybuffers threshold. A bufdaemon runtime is
growing up. Moreover, the frequent filesystem buffers flushing brings
along higher cpu load (geom down thread, geom up thread, disk thread
scheduling) and a disk blocks writing re-ordering. The correct disk
blocks writing order is important for the flash disk. Further, while
the file data buffers are aged but not flushed, filesystem buffers are
written repeatedly but flushed.

Of course, I use a sector cache in the flash disk, but I can't cache
too many sectors because of total memory size. So, filesystem disk
blocks often are written and that evokes more disk sector flushes. A
sector flush really takes long time, so real copy time grows up beyond
control. Last but not least, the flash memory are going to be aged
uselessly.

Well, this is my old story. Just to be honest, I quite forgot that my
kernel was compiled with FULL_PREEMPTION option. The things are very
much worse in this case. However, the option just makes the issue
worse, the issue doesn't disapper without it.

In this old story, I played a game with and focused to bufdirtythresh
value. However, bufdirtythresh is changing the way, how and by who
buffers are flushed, too much. I recorded disk sector flush count and
total disk_strategy() calls count with BIO_WRITE command (and total
bytes count to write). I used a file with size 2235517 bytes. When I
was caching 

Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-04-04 Thread Svatopluk Kraus
2012/3/21 Konstantin Belousov kostik...@gmail.com:
 On Thu, Mar 15, 2012 at 08:00:41PM +0100, Svatopluk Kraus wrote:
 2012/3/15 Konstantin Belousov kostik...@gmail.com:
  On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote:
  On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov
  kostik...@gmail.com wrote:
   On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote:
   Hi,
  
      I have solved a following problem. If a big file (according to
   'hidirtybuffers') is being written, the write speed is very poor.
  
      It's observed on system with elan 486 and 32MB RAM (i.e., low speed
   CPU and not too much memory) running FreeBSD-9.
  
      Analysis: A file is being written. All or almost all dirty buffers
   belong to the file. The file vnode is almost all time locked by
   writing process. The buf_daemon() can not flush any dirty buffer as a
   chance to acquire the file vnode lock is very low. A number of dirty
   buffers grows up very slow and with each new dirty buffer slower,
   because buf_daemon() eats more and more CPU time by looping on dirty
   buffers queue (with very low or no effect).
  
      This slowing down effect is started by buf_daemon() itself, when
   'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon()
   is waked up by own timeout. The timeout fires at 'hz' period, but
   starts to fire at 'hz/10' immediately as buf_daemon() fails to reach
   'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly)
   reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the
   buf_daemon() can be waked up within bdwrite() too and it's much worse.
   Finally and with very slow speed, the 'hidirtybuffers' or
   'dirtybufthresh' is reached, the dirty buffers are flushed, and
   everything starts from beginning...
   Note that for some time, bufdaemon work is distributed among bufdaemon
   thread itself and any thread that fails to allocate a buffer, esp.
   a thread that owns vnode lock and covers long queue of dirty buffers.
 
  However, the problem starts when numdirtybuffers reaches
  lodirtybuffers count and ends around hidirtybuffers count. There are
  still plenty of free buffers in system.
 
  
      On the system, a buffer size is 512 bytes and the default
   thresholds are following:
  
      vfs.hidirtybuffers = 134
      vfs.lodirtybuffers = 67
      vfs.dirtybufthresh = 120
  
      For example, a 2MB file is copied into flash disk in about 3
   minutes and 15 second. If dirtybufthresh is set to 40, the copy time
   is about 20 seconds.
  
      My solution is a mix of three things:
      1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in
   the main buf_daemon() loop.
   I cannot understand this. Please provide a patch that shows what do
   you mean there.
  
        curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
        mtx_lock(bdlock);
        for (;;) {
  -             bd_request = 0;
  +             bd_request = 1;
                mtx_unlock(bdlock);
  Is this a complete patch ? The change just causes lost wakeups for 
  bufdaemon,
  nothing more.
 Yes, it's a complete patch. And exactly, it causes lost wakeups which are:
 1. !! UNREASONABLE !!, because bufdaemon is not sleeping,
 2. not wanted, because it looks that it's correct behaviour for the
 sleep with hz/10 period. However, if the sleep with hz/10 period is
 expected to be waked up by bd_wakeup(), then bd_request should be set
 to 0 just before sleep() call, and then bufdaemon behaviour will be
 clear.
 No, your description is wrong.

 If bufdaemon is unable to flush enough buffers and numdirtybuffers still
 greater then lodirtybuffers, then bufdaemon enters qsleep state
 without resetting bd_request, with timeouts of one tens of second.
 Your patch will cause all wakeups for this case to be lost. This is
 exactly the situation when we want bufdaemon to run harder to avoid
 possible deadlocks, not to slow down.

OK. Let's focus to bufdaemon implementation. Now, qsleep state is
entered with random bd_request value. If someone calls bd_wakeup()
during bufdaemon iteration over dirty buffers queues, then bd_request
is set to 1. Otherwise, bd_request remains 0. I.e., sometimes qsleep
state only can be timeouted, sometimes it can be waked up by
bd_wakeup(). So, this random behaviour is what is wanted?

 All stuff around bd_request and bufdaemon sleep is under bd_lock, so
 if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are
 unreasonable! The patch is about that mainly.
 Wakeups itself are very cheap for the running process. Mostly, it comes
 down to locking sleepq and waking all threads that are present in the
 sleepq blocked queue. If there is no threads in queue, nothing is done.

Are you serious? Is spin mutex really cheap? Many calls are cheap, but
they are not any matter where.

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-03-15 Thread Svatopluk Kraus
2012/3/15 Konstantin Belousov kostik...@gmail.com:
 On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote:
 On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov
 kostik...@gmail.com wrote:
  On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote:
  Hi,
 
     I have solved a following problem. If a big file (according to
  'hidirtybuffers') is being written, the write speed is very poor.
 
     It's observed on system with elan 486 and 32MB RAM (i.e., low speed
  CPU and not too much memory) running FreeBSD-9.
 
     Analysis: A file is being written. All or almost all dirty buffers
  belong to the file. The file vnode is almost all time locked by
  writing process. The buf_daemon() can not flush any dirty buffer as a
  chance to acquire the file vnode lock is very low. A number of dirty
  buffers grows up very slow and with each new dirty buffer slower,
  because buf_daemon() eats more and more CPU time by looping on dirty
  buffers queue (with very low or no effect).
 
     This slowing down effect is started by buf_daemon() itself, when
  'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon()
  is waked up by own timeout. The timeout fires at 'hz' period, but
  starts to fire at 'hz/10' immediately as buf_daemon() fails to reach
  'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly)
  reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the
  buf_daemon() can be waked up within bdwrite() too and it's much worse.
  Finally and with very slow speed, the 'hidirtybuffers' or
  'dirtybufthresh' is reached, the dirty buffers are flushed, and
  everything starts from beginning...
  Note that for some time, bufdaemon work is distributed among bufdaemon
  thread itself and any thread that fails to allocate a buffer, esp.
  a thread that owns vnode lock and covers long queue of dirty buffers.

 However, the problem starts when numdirtybuffers reaches
 lodirtybuffers count and ends around hidirtybuffers count. There are
 still plenty of free buffers in system.

 
     On the system, a buffer size is 512 bytes and the default
  thresholds are following:
 
     vfs.hidirtybuffers = 134
     vfs.lodirtybuffers = 67
     vfs.dirtybufthresh = 120
 
     For example, a 2MB file is copied into flash disk in about 3
  minutes and 15 second. If dirtybufthresh is set to 40, the copy time
  is about 20 seconds.
 
     My solution is a mix of three things:
     1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in
  the main buf_daemon() loop.
  I cannot understand this. Please provide a patch that shows what do
  you mean there.
 
       curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
       mtx_lock(bdlock);
       for (;;) {
 -             bd_request = 0;
 +             bd_request = 1;
               mtx_unlock(bdlock);
 Is this a complete patch ? The change just causes lost wakeups for bufdaemon,
 nothing more.
Yes, it's a complete patch. And exactly, it causes lost wakeups which are:
1. !! UNREASONABLE !!, because bufdaemon is not sleeping,
2. not wanted, because it looks that it's correct behaviour for the
sleep with hz/10 period. However, if the sleep with hz/10 period is
expected to be waked up by bd_wakeup(), then bd_request should be set
to 0 just before sleep() call, and then bufdaemon behaviour will be
clear.

All stuff around bd_request and bufdaemon sleep is under bd_lock, so
if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are
unreasonable! The patch is about that mainly.



 I read description of bd_request variable. However, bd_request should
 serve as an indicator that buf_daemon() is in sleep. I.e., the
 following paradigma should be used:

 mtx_lock(bdlock);
 bd_request = 0;    /* now, it's only time when wakeup() will be meaningful */
 sleep(bd_request, ..., hz/10);
 bd_request = 1;   /* in case of timeout, we must set it (bd_wakeup()
 already set it) */
 mtx_unlock(bdlock);

 My patch follows the paradigma. What happens without the patch in
 described problem: buf_daemon() fails in its job and goes to sleep
 with hz/10 period. It supposes that next early wakeup will do nothing
 too. bd_request is untouched but buf_daemon() doesn't know if its last
 wakeup was made by bd_wakeup() or by timeout. So, bd_request could be
 0 and buf_daemon() can be waked up before hz/10 just by bd_wakeup().
 Moreover, setting bd_request to 0 when buf_daemon() is not in sleep
 can cause time consuming and useless wakeup() calls without effect.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-03-13 Thread Svatopluk Kraus
On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov
kostik...@gmail.com wrote:
 On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote:
 Hi,

    I have solved a following problem. If a big file (according to
 'hidirtybuffers') is being written, the write speed is very poor.

    It's observed on system with elan 486 and 32MB RAM (i.e., low speed
 CPU and not too much memory) running FreeBSD-9.

    Analysis: A file is being written. All or almost all dirty buffers
 belong to the file. The file vnode is almost all time locked by
 writing process. The buf_daemon() can not flush any dirty buffer as a
 chance to acquire the file vnode lock is very low. A number of dirty
 buffers grows up very slow and with each new dirty buffer slower,
 because buf_daemon() eats more and more CPU time by looping on dirty
 buffers queue (with very low or no effect).

    This slowing down effect is started by buf_daemon() itself, when
 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon()
 is waked up by own timeout. The timeout fires at 'hz' period, but
 starts to fire at 'hz/10' immediately as buf_daemon() fails to reach
 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly)
 reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the
 buf_daemon() can be waked up within bdwrite() too and it's much worse.
 Finally and with very slow speed, the 'hidirtybuffers' or
 'dirtybufthresh' is reached, the dirty buffers are flushed, and
 everything starts from beginning...
 Note that for some time, bufdaemon work is distributed among bufdaemon
 thread itself and any thread that fails to allocate a buffer, esp.
 a thread that owns vnode lock and covers long queue of dirty buffers.

However, the problem starts when numdirtybuffers reaches
lodirtybuffers count and ends around hidirtybuffers count. There are
still plenty of free buffers in system.


    On the system, a buffer size is 512 bytes and the default
 thresholds are following:

    vfs.hidirtybuffers = 134
    vfs.lodirtybuffers = 67
    vfs.dirtybufthresh = 120

    For example, a 2MB file is copied into flash disk in about 3
 minutes and 15 second. If dirtybufthresh is set to 40, the copy time
 is about 20 seconds.

    My solution is a mix of three things:
    1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in
 the main buf_daemon() loop.
 I cannot understand this. Please provide a patch that shows what do
 you mean there.

curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
mtx_lock(bdlock);
for (;;) {
-   bd_request = 0;
+   bd_request = 1;
mtx_unlock(bdlock);

I read description of bd_request variable. However, bd_request should
serve as an indicator that buf_daemon() is in sleep. I.e., the
following paradigma should be used:

mtx_lock(bdlock);
bd_request = 0;/* now, it's only time when wakeup() will be meaningful */
sleep(bd_request, ..., hz/10);
bd_request = 1;   /* in case of timeout, we must set it (bd_wakeup()
already set it) */
mtx_unlock(bdlock);

My patch follows the paradigma. What happens without the patch in
described problem: buf_daemon() fails in its job and goes to sleep
with hz/10 period. It supposes that next early wakeup will do nothing
too. bd_request is untouched but buf_daemon() doesn't know if its last
wakeup was made by bd_wakeup() or by timeout. So, bd_request could be
0 and buf_daemon() can be waked up before hz/10 just by bd_wakeup().
Moreover, setting bd_request to 0 when buf_daemon() is not in sleep
can cause time consuming and useless wakeup() calls without effect.

    2. Increment of buf_daemon() fast timeout from hz/10 to hz/4.
    3. Tuning dirtybufthresh to (((lodirtybuffers + hidirtybuffers) /
 2) - 15) magic.
 Even hz / 10 is awfully long time on modern hardware.
 The dirtybufthresh is already the sysctl that you can change.

Yes, I noted low-speed CPU. Don't forget that even if buf_daemon()
sleeps for hz/4 period (and this is expected to be rare case),
dirtybufthresh still works and helps. And I don't push the changes
(except bd_request one (a little)). I'm just sharing my experience.

 The 32MB is indeed around the lowest amount of memory where recent
 FreeBSD can make an illusion of being useful. I am not sure how much
 should the system be tuned by default for such configuration.

Even recent FreeBSD on this configuration is useful pretty much. Of
course, file operations are not main concern ... IMHO, it's always
good to know how the system works (and its parts) in various
configurations.


    The mention copy time is about 30 seconds now.

    The described problem is just for information to anyone who can be
 interested in. Comments are welcome. However, the bd_request thing is
 more general.

    bd_request (despite its description) should be 0 only when
 buf_daemon() is in sleep(). Otherwise, wakeup() on bd_request channel
 is useless. Therefore, setting bd_request to 1 in the main
 buf_daemon() loop

[vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-03-12 Thread Svatopluk Kraus
Hi,

   I have solved a following problem. If a big file (according to
'hidirtybuffers') is being written, the write speed is very poor.

   It's observed on system with elan 486 and 32MB RAM (i.e., low speed
CPU and not too much memory) running FreeBSD-9.

   Analysis: A file is being written. All or almost all dirty buffers
belong to the file. The file vnode is almost all time locked by
writing process. The buf_daemon() can not flush any dirty buffer as a
chance to acquire the file vnode lock is very low. A number of dirty
buffers grows up very slow and with each new dirty buffer slower,
because buf_daemon() eats more and more CPU time by looping on dirty
buffers queue (with very low or no effect).

   This slowing down effect is started by buf_daemon() itself, when
'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon()
is waked up by own timeout. The timeout fires at 'hz' period, but
starts to fire at 'hz/10' immediately as buf_daemon() fails to reach
'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly)
reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the
buf_daemon() can be waked up within bdwrite() too and it's much worse.
Finally and with very slow speed, the 'hidirtybuffers' or
'dirtybufthresh' is reached, the dirty buffers are flushed, and
everything starts from beginning...

   On the system, a buffer size is 512 bytes and the default
thresholds are following:

   vfs.hidirtybuffers = 134
   vfs.lodirtybuffers = 67
   vfs.dirtybufthresh = 120

   For example, a 2MB file is copied into flash disk in about 3
minutes and 15 second. If dirtybufthresh is set to 40, the copy time
is about 20 seconds.

   My solution is a mix of three things:
   1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in
the main buf_daemon() loop.
   2. Increment of buf_daemon() fast timeout from hz/10 to hz/4.
   3. Tuning dirtybufthresh to (((lodirtybuffers + hidirtybuffers) /
2) - 15) magic.

   The mention copy time is about 30 seconds now.

   The described problem is just for information to anyone who can be
interested in. Comments are welcome. However, the bd_request thing is
more general.

   bd_request (despite its description) should be 0 only when
buf_daemon() is in sleep(). Otherwise, wakeup() on bd_request channel
is useless. Therefore, setting bd_request to 1 in the main
buf_daemon() loop is correct and better as it saves time spent by
wakeup() on not existing channel.

  Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


pccbb device doesn't send device_shutdown() to childs (reboot freezes)

2012-03-12 Thread Svatopluk Kraus
Hi,

   I solved very curious problem with rarely system (FreeBSD-9)
freezing during reboot. Finally, I found out that system freezes in ep
device callout. The part of device tree is following:

   - pccbb - pccard - ep

   cbb_pci_shutdown() method in pccbb device places the cards in
reset, turns off the interrupts and powers down the socket. No child
has a chance to know about it. Thus, if ep device callout fires
between device_shutdown() is called on root device and interrupts are
disabled, the callout freezes in never-ending while loop, which reads
status from hardware (now without power).

   I propose following change (editted by hand) in cbb_pci_shutdown():

struct cbb_softc *sc = (struct cbb_softc *)device_get_softc(brdev);
+
+   /* Inform all childs. */
+   bus_generic_shutdown(brdev);
+
/*
 * We're about to pull the rug out from the card, so mark it as
 * gone to prevent harm.
 */
sc-cardok = 0;

Futhermore, ep device (ed device too, ... ?) has not implemented
device_shutdown method. So, fixing pccbb device is not enough to solve
the freezing problem. I somehow patched the mentioned devices too, but
maybe someone more competent should do it for FreeBSD tree.

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


i386 - pmap_enter() superpage promotion on kernel addresses

2011-11-09 Thread Svatopluk Kraus
Hi,

   I'm tuning pmap code for arm11 mpcore port, which is inspired by
i386 one. My question is about superpage promotions on kernel
addresses in i386 pmap code. pmap_promote_pde() is called from
pmap_enter() only and if following conditions are fulfilled:

   1. promotions are enabled,
   2. all pages in a superpage are allocated (physical space condition),

   and for user addresses,

   3. all pages in a superpage are mapped (virtual space condition).

   For kernel addresses, the third condition is not checked. I
understand that it is not easy to evaluate the third condition for
kernel addresses. However, pmap_promote_pde() often can be called
unnecessarily now and it's rather expensive call. Or is there any
other reason for that?

   Moreover, there are many temporary mappings (pmap_qenter(),...) in
kernel and if pmap_promote_pde() is called without 3th condition, the
promotion can be successfull. As temporary mappings do nothing with
promotions and demotions, it is a fault. Or a superpage with temporary
kernel mapping never can be promoted because of locking or something
else?

   The third condition is evaluated on page table bases (wire_count is
used) for user addresses. Page tables for kernel addresses have wire
count set to 0 or 1. Page tables preallocated during boot are
post-initialized in pmap_init() but wire_count is left untouched
(wire_count is 0). Page tables allocated in pmap_growkernel() are
allocated wired (wire_count is 1).

   [branch] If a kernel superpage is demoted in pmap_demote_pde() and
corresponding page table wire_count is 1, the page table is
re-initialized uselessly as a newly allocated one.

   My idea is that kernel address mappings made in pmap_enter() can be
marked 'stable' (as opposite to 'temporary') and counted by wire_count
in same way as for user addresses and then third condition could be
applied and will be fulfilled only for this 'stable' mappings (which
know about promotions and demotions). Is anything wrong with this
idea?

   Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-31 Thread Svatopluk Kraus
On Fri, Oct 28, 2011 at 7:38 AM, Alan Cox a...@rice.edu wrote:
 On 10/26/2011 06:23, Svatopluk Kraus wrote:

 Hi,

 well, I'm working on new port (arm11 mpcore) and pmap_enter_object()
 is what I'm debugging rigth now. And I did not find any way in
 userland how to force kernel to call pmap_enter_object() which makes
 SUPERPAGE mapping without promotion. I tried to call mmap() with
 MAP_PREFAULT_READ without success. I tried to call madvise() with
 MADV_WILLNEED without success too.


 mmap() should call pmap_enter_object() if MAP_PREFAULT_READ was specified.
  I'm surprised to hear that it's not happening for you.

Yes, it's not happening for me really.

mmap() with MAP_PREFAULT_READ case:

vm_mmap() in sys/vm/vm_mmap.c (r225617)
line 1501 - if MAP_ANON then docow = 0
line 1525 - vm_map_find() is called with zeroed docow

It's propagated down the calling stack, so even vm_map_pmap_enter() is
not called in vm_map_insert(). Most likely, this is correct.
(Anonymous object - no physical memory allocation in advance - no
SUPERPAGE mapping without promotion.)

madvise() with MADV_WILLNEED case:
--
vm_map_pmap_enter() in sys/vm/vm_map.c (r223825)
line 1814 - vm_page_find_least() is called

During madvise(),  vm_map_pmap_enter() is called. However, in the
call, vm_page_find_least() returns NULL. It returns NULL, if no page
is allocated in object with pindex greater or equal to the parameter
pindex. The following loop after the call says that if no page is
allocated for SUPERPAGE (i.e. for given region), pmap_enter_object()
is not called and this is correct.

snip

 Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if
 I have SUPERPAGE mapping without promotion, the mapping is demoted
 after first write, and promoted again after all underlying pages are
 accessed by write. There is 4K page table saving no longer.


 Yes, that is all true.  It is possible to change things so that the page
 table pages are reclaimed after a time, and not kept around indefinitely.
  However, this not high on my personal priority list.  Before that, it is
 more likely that I will add an option to avoid the demotion on write, if we
 don't have to copy the entire superpage to do so.

Well, I just wanted to remark that there is no 4K page table saving
now. However, there is still big TLB entries saving with SUPERPAGE
promotions. I'm not pushing you to do anything.

I understand that physical pages allocation in advance is not good
idea and it goes against great copy on write feature. However,
something like MAP_PREFAULT_WRITE on MAP_ANON, which allocates all
physical pages in advance and does SUPERPAGE mapping without promotion
sounds like a good-but-really-specific feature, which can be utilized
sometimes. Nevertheless, IMHO, it's not worth to do such specific
feature.

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-26 Thread Svatopluk Kraus
Hi,

well, I'm working on new port (arm11 mpcore) and pmap_enter_object()
is what I'm debugging rigth now. And I did not find any way in
userland how to force kernel to call pmap_enter_object() which makes
SUPERPAGE mapping without promotion. I tried to call mmap() with
MAP_PREFAULT_READ without success. I tried to call madvise() with
MADV_WILLNEED without success too.

To make SUPERPAGE mapping, it's obvious that all physical pages under
SUPERPAGE must be allocated in vm_object. And SUPERPAGE mapping must
be done before first access to them, otherwise a promotion is on the
way. MAP_PREFAULT_READ does nothing with it. If madvice() is used,
vm_object_madvise() is called but only cached pages are allocated in
advance. Of coarse, an allocation of all physical memory behind
virtual address space in advance is not preferred in most situations.

For example, I want to do some computation on 4M memory space (I know
that each byte will be accessed) and want to utilize SUPERPAGE mapping
without promotion, so save 4K page table (i386 machine). However,
malloc() leads to promotion, mmap() with MAP_PREFAULT_READ doesn't do
nothing so SUPERPAGE mapping is promoted, and madvice() with
MADV_WILLNEED calls vm_object_madvise() but because the pages are not
cached (how can be on anonymous memory), it is not work without
promotion too.

So, SUPERPAGE mapping without promotions is fine, but it can be done
only if physical memory being mapped is already allocated. Is it
really possible to force that in userland?

Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if
I have SUPERPAGE mapping without promotion, the mapping is demoted
after first write, and promoted again after all underlying pages are
accessed by write. There is 4K page table saving no longer.

   Svata

On Wed, Oct 26, 2011 at 1:35 AM, Alan Cox a...@rice.edu wrote:
 On 10/10/2011 4:28 PM, Wojciech Puchar wrote:

 Notice that vm.pmap.pde.promotions increased by 31.  This means that 31
 superpage mappings were created by promotion from small page mappings.

 thank you. i looked at .mappings as it seemed logical for me that is shows
 total.

 In contrast, vm.pmap.pde.mappings counts superpage mappings that are
 created directly and not by promotion from small page mappings.  For
 example, if a large executable, such as gcc, is resident in memory, the text
 segment will be pre-mapped using superpage mappings, avoiding soft fault and
 promotion overhead.  Similarly, mmap(..., MAP_PREFAULT_READ) on a large,
 memory resident file may pre-map the file using superpage mappings.

 your options are not described in mmap manpage nor madvise
 (MAP_PREFAULT_READ).

 when can i find the up to date manpage or description?


 A few minutes ago, I merged the changes to support and document
 MAP_PREFAULT_READ into 8-STABLE.  So, now it exists in HEAD, 9.0, and
 8-STABLE.

 Alan



 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


threads runtime value is incorrect (tc_cpu_ticks() problem)

2011-06-22 Thread Svatopluk Kraus
Hi,

  I've tested FreeBSD-current from June 16 2011 on x86 (AMD Elan
SC400). I found out that a sum of runtimes of all threads is about 120
minutes after 180 minutes of system uptime and the difference is
getting worse with time. The problem is in tc_cpu_ticks()
implementation which takes into acount just one timecounter overflow,
but in tested BSP (16-bit hardware counter) very often more than one
overflow occured between two tc_cpu_ticks() calls.

  I understand that 16-bit timecounter is a real relict nowadays, but
I would like to solve the problem somehow reasonably. I have a few
questions.

  According to description in definition of timecounter structure
(sys/timetc.h), tc_get_timecount() should read the counter and
tc_counter_mask should mask off any unimplemented bits. In
tc_cpu_ticks(), if ticks count returned from tc_get_timecount()
overflows then (tc_counter_mask + 1) is added to result.

  However, timecounter hardware can be initialized to value from
interval (0, tc_counter_mask, so if the description of
tc_get_timecount() doesn't lie then adding (tc_counter_mask + 1) value
at all times is not correct. Better description which satisfies
tc_cpu_ticks() implementation is that tc_get_timecount() should count
the ticks in interval 0, tc_counter_mask. That's what
i8254_get_timecount() (in sys/x86/isa/clock.c) does really. However,
if tc_get_timecount() should count the ticks (and doesn't read the
counter) then it can count the ticks in full uint64_t range? And
tc_cpu_ticks() implementation could be very simple (not masking, not
overflow checking). In i8254_get_timecount(), it is enough to change
global variable 'i8254_offset' and local variable 'count' from
uint16_t to uint64_t type.

  Now, cpu_ticks() (whichs point to tc_cpu_ticks() by default) is
called from mi_switch() which must be called often enough to satisfy
tc_cpu_ticks() implementation (recognize just one timecounter
overflow). That limits some of system parameters (at least hz
selection).

  It looks that tc_counter_mask is a little bit misused?

  Maybe, tc_cpu_ticks() is only used for back compatibility and new
system should use set_cputicker() to change this default?

  Thanks for some help to better understand that.

  Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: threads runtime value is incorrect (tc_cpu_ticks() problem)

2011-06-22 Thread Svatopluk Kraus
On Wed, Jun 22, 2011 at 1:40 PM, Uffe Jakobsen u...@uffe.org wrote:


 On 2011-06-22 12:33, Svatopluk Kraus wrote:

 Hi,

   I've tested FreeBSD-current from June 16 2011 on x86 (AMD Elan
 SC400). I found out that a sum of runtimes of all threads is about 120
 minutes after 180 minutes of system uptime and the difference is
 getting worse with time. The problem is in tc_cpu_ticks()
 implementation which takes into acount just one timecounter overflow,
 but in tested BSP (16-bit hardware counter) very often more than one
 overflow occured between two tc_cpu_ticks() calls.

   I understand that 16-bit timecounter is a real relict nowadays, but
 I would like to solve the problem somehow reasonably. I have a few
 questions.

   According to description in definition of timecounter structure
 (sys/timetc.h), tc_get_timecount() should read the counter and
 tc_counter_mask should mask off any unimplemented bits. In
 tc_cpu_ticks(), if ticks count returned from tc_get_timecount()
 overflows then (tc_counter_mask + 1) is added to result.

   However, timecounter hardware can be initialized to value from
 interval (0, tc_counter_mask, so if the description of
 tc_get_timecount() doesn't lie then adding (tc_counter_mask + 1) value
 at all times is not correct. Better description which satisfies
 tc_cpu_ticks() implementation is that tc_get_timecount() should count
 the ticks in interval0, tc_counter_mask. That's what
 i8254_get_timecount() (in sys/x86/isa/clock.c) does really. However,
 if tc_get_timecount() should count the ticks (and doesn't read the
 counter) then it can count the ticks in full uint64_t range? And
 tc_cpu_ticks() implementation could be very simple (not masking, not
 overflow checking). In i8254_get_timecount(), it is enough to change
 global variable 'i8254_offset' and local variable 'count' from
 uint16_t to uint64_t type.

   Now, cpu_ticks() (whichs point to tc_cpu_ticks() by default) is
 called from mi_switch() which must be called often enough to satisfy
 tc_cpu_ticks() implementation (recognize just one timecounter
 overflow). That limits some of system parameters (at least hz
 selection).

   It looks that tc_counter_mask is a little bit misused?

   Maybe, tc_cpu_ticks() is only used for back compatibility and new
 system should use set_cputicker() to change this default?

   Thanks for some help to better understand that.


 I'm by no means an expert in this field - but your mentioning of AMD Elan
 SC400 triggered some old knowledge about the AMD Elan SC520.

 If you have a look at the sys/i386/i386/elan-mmcr.c

 Function init_AMD_Elan_sc520() adresses the fact that the i8254 has a
 nonstandard frequency with the AMD Elan SC520 at least - could it be the
 same with the SC400 ?

You are correct, AMD Elan SC400 i8254 has nonstandard frequency, but
it's not the problem. After system startup, no new threads start and
no threads exit, but sum of runtimes of all existing  threads is much
much less than system uptime and the difference is worse with time.
Only one timecounter in system. System uptime is correct and respons
to time measured by my watch.

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ichsmb - correct locking strategy?

2011-02-23 Thread Svatopluk Kraus
On Tue, Feb 22, 2011 at 3:37 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, February 18, 2011 9:10:47 am Svatopluk Kraus wrote:
 Hi,

   I try to figure out locking strategy in FreeBSD and found 'ichsmb'
 device. There is a mutex which protects smb bus (ichsmb device). For
 example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is
 locked and a command is written to bus, then unbounded (but with
 timeout) sleep is done (mutex is unlocked during sleep). After sleep a
 word is read from bus and the mutex is unlocked.

   1. If an use of the device IS NOT serialized by layers around then
 more calls to this function (or others) can be started or even done
 before the first one is finished. The mutex don't protect sms bus.

   2. If an use of the device IS serialized by layers around then the
 mutex is useless.

   Moreover, I don't mension interrupt routine which uses the mutex and
 smb bus too.

   Am I right? Or did I miss something?

 Hmm, the mutex could be useful if you have an smb controller with an interrupt
 handler (I think ichsmb or maybe intpm can support an interrupt handler) to
 prevent concurrent access to device registers.  That is the purpose of the
 mutex at least.  There is a separate locking layer in smbus itself in (see
 smbus_request_bus(), etc.).

 --
 John Baldwin


I see. So, multiple accesses to bus are protected by upper smbus layer
itself. And the mutex encloses each single access in respect of
interrupt. I.e., an interrupt can be assigned to a command (bus is
either command processing or idle) and a wait to command result can be
done atomically (no wakeup is missed). Am I right?

BTW, a mutex priority propagation isn't too much exploited here.
Maybe, it will be better for me to not take this feature into account
when thinking about locking strategy and just take a mutex in most
cases as a low level locking primitive which is indeed. Well, it seems
that things become more clear.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ichsmb - correct locking strategy?

2011-02-21 Thread Svatopluk Kraus
On Fri, Feb 18, 2011 at 4:09 PM, Hans Petter Selasky hsela...@c2i.net wrote:
 On Friday 18 February 2011 15:10:47 Svatopluk Kraus wrote:
 Hi,

   I try to figure out locking strategy in FreeBSD and found 'ichsmb'
 device. There is a mutex which protects smb bus (ichsmb device). For
 example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is
 locked and a command is written to bus, then unbounded (but with
 timeout) sleep is done (mutex is unlocked during sleep). After sleep a
 word is read from bus and the mutex is unlocked.

   1. If an use of the device IS NOT serialized by layers around then
 more calls to this function (or others) can be started or even done
 before the first one is finished. The mutex don't protect sms bus.

   2. If an use of the device IS serialized by layers around then the
 mutex is useless.

   Moreover, I don't mension interrupt routine which uses the mutex and
 smb bus too.

   Am I right? Or did I miss something?

 man sx ?

 struct sx ?

 --HPS


Thanks for your reply. It seems that everybody knows that ichsmb
driver is not in good shape but nobody cares for ...

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


ichsmb - correct locking strategy?

2011-02-18 Thread Svatopluk Kraus
Hi,

  I try to figure out locking strategy in FreeBSD and found 'ichsmb'
device. There is a mutex which protects smb bus (ichsmb device). For
example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is
locked and a command is written to bus, then unbounded (but with
timeout) sleep is done (mutex is unlocked during sleep). After sleep a
word is read from bus and the mutex is unlocked.

  1. If an use of the device IS NOT serialized by layers around then
more calls to this function (or others) can be started or even done
before the first one is finished. The mutex don't protect sms bus.

  2. If an use of the device IS serialized by layers around then the
mutex is useless.

  Moreover, I don't mension interrupt routine which uses the mutex and
smb bus too.

  Am I right? Or did I miss something?

   Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Sleepable locks with priority propagation?

2011-02-18 Thread Svatopluk Kraus
Hi,

I deal with devices (i2c bus, flash memory), which are quite slow,
i.e. some or mostly all operations on it are quite slow. One must wait
for it for rather long time. Use of DELAY() is too expensive and an
inactive (so called unbound) wait isn't permited with mutexes. So, no
priority propagation locking primitive could be exploited.

 Typical simple operation (which must be locked) is following:

  HW_LOCK
  write request (fast)
  wait for processing (quite slow)
  read response (fast)
  HW_UNLOCK

Here, use of mutex with mtx_sleep() is impossible, as mutex is
unlocked during (unbound) sleep and somebody can start new operation.
An response which is read after sleep could be incorrect.

Well, I deal with a hardware, so all sleeps on it could be infinite.
It's driver writer responsibility to ensure that all situations which
can lead to infinite waits are treated correctly and can't happen. The
waits (if no error happen on hardware, what must be treat anyway)
could be rather long (but with known limits from specification). Long
but not unbounded.

I lack of a locking primitive with priority propagation on which
inactive waits are permited. I'm not too much familiar with locking
implementation strategy in FreeBSD, but am I one and only who needs
such a lock? Or such lock is not permitted for really good reasons?

Well, I know that only KASSERT in mtx_init() (mutex can't be made
SLEEPABLE) and witness in mtx_sleep() (can't sleep on UNSLEEPABLE
locks) quard the whole stuff. But should I hack it and use mutex as a
locking primitive I need? (Of course, I will always use it with
timeout.)

 Thanks for any response,

Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


A question about WARNING: attempt to domain_add(xyz) after domainfinalize()

2011-01-12 Thread Svatopluk Kraus
Hi,

I'd like to add a new network domain into kernel (and never remove it)
from loadable module. In fact, I did it, but I got following warning
from domain_add(): WARNING: attempt to domain_add(xyz) after
domainfinalize(). Now, I try to figure out what is behind the
warning, which seems to become KASSERT (now, in notyet section part of
code, which is 6 years old).

I found a few iteration on domains list and each domain protosw table,
which are not protected by any lock. OK, it is problem but when I only
add a domain (it's added at the head of domains list) and never remove
it then that could be safe. Moreover, it seems that without any
limits, it is possible to add a new protocol into domain on reserved
place labeled as PROTO_SPACER by pf_proto_register() function. Well,
it's not a list so it's a different case (but a copy into spacer isn't
atomic operation).

I found two global variables (max_hdr,max_datalen) which are evaluated
in each domain_init() from other variables (max_linkhdr,max_protohdr)
and a global variable (max_keylen) which is evaluated from all known
domains (dom_maxrtkey entry). The variables are used in other parts of
kernel. Futher, I know about 'dom_ifattach' and 'dom_ifdetach'
pointers to functions defined on each domain, which are responsible
for 'if_afdata' entry on ifnet structure.

Is there something more I didn't find in current kernel?
Will be there something more in future kernels, what legitimize
KASSERT in domain_add()?

My network domain doesn't influence any mentioned global variables,
doesn't define dom_ifattach() and dom_ifdetach() functions, and should
be only added from loadable module and never removed. So, I think it's
safe. But I'm a little bit nervous because of planned KASSERT in
domain_add().

Well, I can implement an empty domain with some spacers for protocols,
link it into kernel (thus shut down the warning), and make loadable
module in which I only register protocols I want on already added
domain, but why should I do it in that (for me good-for-nothing) way?

 Thanks for any response, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: page table fault, which should map kernel virtual address space

2010-10-05 Thread Svatopluk Kraus
On Mon, Oct 4, 2010 at 2:03 AM, Alan Cox alan.l@gmail.com wrote:
 On Thu, Sep 30, 2010 at 6:28 AM, Svatopluk Kraus onw...@gmail.com wrote:

 On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote:
  On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com
  wrote:
  Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
  pager_map,...) exist as result of 'kmem_suballoc' function call.
  When this submaps are used (for example 'kmem_alloc_nofault'
  function) and its virtual address subspace is at the end of
  used kernel virtual address space at the moment (and above 'NKPT'
  preallocation), then missing page tables are not allocated
  and double fault can happen.
 
 
  No, the page tables are allocated.  If you create a submap X of the
  kernel
  map using kmem_suballoc(), then a vm_map_findspace() is performed by
  vm_map_find() on the kernel map to find space for the submap X.  As you
  note
  above, the call to vm_map_findspace() on the kernel map will call
  pmap_growkernel() if needed to extend the kernel page table.
 
  If you create another submap X' of X, then that submap X' can only map
  addresses that fall within the range for X.  So, any necessary page
  table
  pages were allocated when X was created.

 You are right. Mea culpa. I was focused on a solution and made
 too quick conclusion. The page table fault hitted in 'pager_map',
 which is submap of 'clean_map' and when I debugged the problem
 I didn't see a submap stuff as a whole.

  That said, there may actually be a problem with the implementation of
  the
  superpage_align parameter to kmem_suballoc().  If a submap is created
  with
  superpage_align equal to TRUE, but the submap's size is not a multiple
  of
  the superpage size, then vm_map_find() may not allocate a page table
  page
  for the last megabyte or so of the submap.
 
  There are only a few places where kmem_suballoc() is called with
  superpage_align set to TRUE.  If you changed them to FALSE, that is an
  easy
  way to test this hypothesis.

 Yes, it helps.

 My story is that the problem started up when I updated a project
 ('coldfire' port)
 based on FreeBSD 8.0. to FreeBSD current version. In the current version
 the 'clean_map' submap is created with superpage_align set to TRUE.

 I have looked at vm_map_find() and debugged the page table fault once
 again.
 IMO, it looks that a do-while loop does not work in the function as
 intended.
 A vm_map_findspace() finds a space and calls pmap_growkernel() if needed.
 A pmap_align_superpage() arrange the space but never call
 pmap_growkernel().
 A vm_map_insert() inserts the aligned space into a map without error
 and never call pmap_growkernel() and does not invoke loop iteration.

 I don't know too much about an idea how a virtual memory model is
 implemented
 and used in other modules. But it seems that it could be more reasonable
 to
 align address space in vm_map_findspace() internally and not to loop
 externally.

 I have tried to add a check in vm_map_insert() that checks the 'end'
 parameter
 against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if
 needed.
 In this case the loop in vm_map_find() works and I have no problem with
 the page table fault. But 'kernel_vm_end' variable must be initializated
 properly before first use of vm_map_insert(). The 'kernel_vm_end' variable
 can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too
 late),
 but it was changed in current version ('i386' port).

 Thanks for your answer, but I'm still looking for permanent
 and approved solution.

 I have a patch that implements one possible fix for this problem.  I'll
 probably commit that patch in the next day or two.

 Regards,
 Alan

I tried your patch and it works. Many thanks.

Regards, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: page table fault, which should map kernel virtual address space

2010-09-30 Thread Svatopluk Kraus
On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote:
 On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote:
 Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
 pager_map,...) exist as result of 'kmem_suballoc' function call.
 When this submaps are used (for example 'kmem_alloc_nofault'
 function) and its virtual address subspace is at the end of
 used kernel virtual address space at the moment (and above 'NKPT'
 preallocation), then missing page tables are not allocated
 and double fault can happen.


 No, the page tables are allocated.  If you create a submap X of the kernel
 map using kmem_suballoc(), then a vm_map_findspace() is performed by
 vm_map_find() on the kernel map to find space for the submap X.  As you note
 above, the call to vm_map_findspace() on the kernel map will call
 pmap_growkernel() if needed to extend the kernel page table.

 If you create another submap X' of X, then that submap X' can only map
 addresses that fall within the range for X.  So, any necessary page table
 pages were allocated when X was created.

You are right. Mea culpa. I was focused on a solution and made
too quick conclusion. The page table fault hitted in 'pager_map',
which is submap of 'clean_map' and when I debugged the problem
I didn't see a submap stuff as a whole.

 That said, there may actually be a problem with the implementation of the
 superpage_align parameter to kmem_suballoc().  If a submap is created with
 superpage_align equal to TRUE, but the submap's size is not a multiple of
 the superpage size, then vm_map_find() may not allocate a page table page
 for the last megabyte or so of the submap.

 There are only a few places where kmem_suballoc() is called with
 superpage_align set to TRUE.  If you changed them to FALSE, that is an easy
 way to test this hypothesis.

Yes, it helps.

My story is that the problem started up when I updated a project
('coldfire' port)
based on FreeBSD 8.0. to FreeBSD current version. In the current version
the 'clean_map' submap is created with superpage_align set to TRUE.

I have looked at vm_map_find() and debugged the page table fault once again.
IMO, it looks that a do-while loop does not work in the function as intended.
A vm_map_findspace() finds a space and calls pmap_growkernel() if needed.
A pmap_align_superpage() arrange the space but never call pmap_growkernel().
A vm_map_insert() inserts the aligned space into a map without error
and never call pmap_growkernel() and does not invoke loop iteration.

I don't know too much about an idea how a virtual memory model is implemented
and used in other modules. But it seems that it could be more reasonable to
align address space in vm_map_findspace() internally and not to loop externally.

I have tried to add a check in vm_map_insert() that checks the 'end' parameter
against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed.
In this case the loop in vm_map_find() works and I have no problem with
the page table fault. But 'kernel_vm_end' variable must be initializated
properly before first use of vm_map_insert(). The 'kernel_vm_end' variable
can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late),
but it was changed in current version ('i386' port).

Thanks for your answer, but I'm still looking for permanent
and approved solution.

 Regards, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


page table fault, which should map kernel virtual address space

2010-09-20 Thread Svatopluk Kraus

Hallo,

this is about 'NKPT' definition, 'kernel_map' submaps,
and 'vm_map_findspace' function.

Variable 'kernel_map' is used to manage kernel virtual address
space. When 'vm_map_findspace' function deals with 'kernel_map'
then 'pmap_growkernel' function is called.

At least in 'i386' architecture, pmap implementation uses
'pmap_growkernel' function to allocate missing page tables.
Missing page tables are problem, because no one checks
'pte' pointer for validity after use of 'vtopte' macro.

'NKPT' definition defines a number of preallocated
page tables during system boot.

Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
pager_map,...) exist as result of 'kmem_suballoc' function call.
When this submaps are used (for example 'kmem_alloc_nofault'
function) and its virtual address subspace is at the end of
used kernel virtual address space at the moment (and above 'NKPT'
preallocation), then missing page tables are not allocated
and double fault can happen.

I have met this scenario and solved it by increasing
page tables preallocation count ('NKPT' definition).
It's temporary solution which works for the present.

Can someone more advanced and sacred in virtual memory module
solve it (in 'vm_map_findspace' function for example)? Or tell
me that the problem is elsewhere...

  Thanks, Svata
-- 
View this message in context: 
http://old.nabble.com/page-table-fault%2C-which-should-map-kernel-virtual-address-space-tp29760054p29760054.html
Sent from the freebsd-hackers mailing list archive at Nabble.com.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org