Re: ARM + CACHE_LINE_SIZE + DMA
On Mon, May 21, 2012 at 6:20 PM, Ian Lepore free...@damnhippie.dyndns.org wrote: ... Some more notes. SMP makes things worse and ARM11mpcore is about SMP too. For example, another thread could be open about that how to flush caches (exclusive L1 cache) in SMP case. I'm not sure how to correctly change memory attributes on page which is in use. Making new temporary mapping with different attributes is wrong and does not help at all. It's question how to do TLB and cache flushes on two and more processors and be sure that everything is OK. It could be slow and maybe, changing memory attributes on the fly is not a good idea at all. My suggestion of making a temporary writable mapping was the answer to how to correctly change memory attributes on a page which is in use, at least in the existing code, which is for a single processor. You don't need, and won't even use, the temporary mapping. You would make it just because doing so invokes logic in arm/arm/pmap.c which will find all existing virtual mappings of the given physical pages, and disable caching in each of those existing mappings. In effect, it makes all existing mappings of the physical pages DMA_COHERENT. When you later free the temporary mapping, all other existing mappings are changed back to being cacheable (as long as no more than one of the mappings that remain is writable). I don't know that making a temporary mapping just for its side effect of changing other existing mappings is a good idea, it's just a quick and easy thing to do if you want to try changing all existing mappings to non-cacheable. It could be that a better way would be to have the busdma_machdep code call directly to lower-level routines in pmap.c to change existing mappings without making a new temporary mapping in the kernel pmap. The actual changes to the existing mappings are made by pmap_fix_cache() but that routine isn't directly callable right now. Thanks for explanation. In fact, I known only a little about current ARM pmap implementation in FreeBSD tree. I took i386 pmap implementation and modified it according to arm11mpcore. Also, as far as I know all of this automatic disabling of cache for multiple writable mappings applies only to VIVT cache architectures. I'm not sure how the pmap code is going to change to support VIPT and PIPT caches, but it may no longer be true that making a second writable mapping of a page will lead to changing all existing mappings to non-cacheable. -- Ian Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ARM + CACHE_LINE_SIZE + DMA
Hi, with respect to your replies and among other things, the following summary could be made: There are three kinds of DMA buffers according to their origin: 1. driver buffers As Alexander wrote, the buffers should be allocated by bus_dmamap_alloc(). The function should be implemented to allocate the buffers correctly aligned with help of bus_dma_tag_t. For these buffers, we can avoid bouncing totally just by correct driver implementation. For badly implemented drivers, bouncing penalty is paid in case of unaligned buffers. For BUS_DMA_COHERENT allocations, as Mark wrote, an allocation pool of coherent pages is good optimalization. 2. well-known system buffers Mbufs and vfs buffers. The buffers should be aligned on CACHE_LINE_SIZE (start and size). It should be enough for vfs buffers as they are carring data only and only whole buffers should be accessed by DMA. The mbuf is a structure and data can be carried on three possible locations. The first one, the external buffer, should be aligned on CACHE_LINE_SIZE. The next two locations, which are parts of the mbuf structure, could be unaligned in general. If we assume that no one else is writing any part of the mbuf during DMA access, we can set BUS_DMA_UNALIGNED_SAFE flag in mbuf load functions. I.e., we don't bounce unaligned buffers if the flag is set in dmamap. A tunable can be implemented to suppres the flag for debugging purposes. 3. other buffers As we know nothing about these buffers, we must always bounce unaligned ones. Just two more notes. The DMA buffer should not be access by anyone (except DMA itself) after PRESYNC and before POSTSYNC. For DMA descriptors (for example), using bus_dmamap_alloc() with BUS_DMA_COHERENT flag could be inevitable. As I'm implementing bus dma for ARM11mpcore, I'm doing it with next assumptions: 1. ARMv6k and higher 2. PIPT data cache 3. SMP ready Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ARM + CACHE_LINE_SIZE + DMA
On Thu, May 17, 2012 at 10:07 PM, Ian Lepore free...@damnhippie.dyndns.org wrote: On Thu, 2012-05-17 at 15:20 +0200, Svatopluk Kraus wrote: Hi, I'm working on DMA bus implementation for ARM11mpcore platform. I've looked at implementation in ARM tree, but IMHO it only works with some assumptions. There is a problem with DMA on memory block which is not aligned on CACHE_LINE_SIZE (start and end) if memory is not coherent. Let's have a buffer for DMA which is no aligned on CACHE_LINE_SIZE. Then first cache line associated with the buffer can be divided into two parts, A and B, where A is a memory we know nothing about it and B is buffer memory. The same stands for last cache line associatted with the buffer. We have no problem if a memory is coherent. Otherwise it depends on memory attributes. 1. [no cache] attribute No problem as memory is coherent. 2. [write throught] attribute The part A can be invalidated without loss of any data. It's not problem too. 3. [write back] attribute In general, there is no way how to keep both parts consistent. At the start of DMA transaction, the cache line is written back and invalidated. However, as we know nothing about memory associated with part A of the cache line, the cache line can be filled again at any time and messing up DMA transaction if flushed. Even if the cache line is only filled but not flushed during DMA transaction, we must make it coherent with memory after that. There is a trick with saving part A of the line into temporary buffer, invalidating the line, and restoring part A in current ARM (MIPS) implementation. However, if somebody is writting to memory associated with part A of the line during this trick, the part A will be messed up. Moreover, the part A can be part of another DMA transaction. To safely use DMA with no coherent memory, a memory with [no cache] or [write throught] attributes can be used without problem. A memory with [write back] attribute must be aligned on CACHE_LINE_SIZE. However, for example mbuf, a buffer for DMA can be part of a structure which can be aligned on CACHE_LINE_SIZE, but not the buffer itself. We can know that nobody will write to the structure during DMA transaction, so it's safe to use the buffer event if it's not aligned on CACHE_LINE_SIZE. So, in practice, if DMA buffer is not aligned on CACHE_LINE_SIZE and we want to avoid bounce pages overhead, we must support additional information to DMA transaction. It should be easy to support the information about drivers data buffers. However, what about OS data buffers like mentioned mbufs? The question is following. Is or can be guaranteed for all or at least well-known OS data buffers which can be part of DMA access that the not CACHE_LINE_SIZE aligned buffers are surrounded by data which belongs to the same object as the buffer and the data is not written by OS when given to a driver? Any answer is appreciated. However, 'bounce pages' is not an answer. Thanks, Svata I'm adding freebsd-arm@ to the CC list; that's where this has been discussed before. Your analysis is correct... to the degree that it works at all right now, it's working by accident. At work we've been making the good accident a bit more likely by setting the minimum allocation size to arm_dcache_align in kern_malloc.c. This makes it somewhat less likely that unrelated objects in the kernel are sharing a cache line, but it also reduces the effectiveness of the cache somewhat. Another factor, not mentioned in your analysis, is the size of the IO operation. Even if the beginning of the DMA buffer is cache-aligned, if the size isn't exactly a multiple of the cache line size you still have the partial flush situation and all of its problems. It's not guaranteed that data surrounding a DMA buffer will be untouched during the DMA, even when that surrounding data is part of the same conceptual object as the IO buffer. It's most often true, but certainly not guaranteed. In addition, as Mark pointed out in a prior reply, sometimes the DMA buffer is on the stack, and even returning from the function that starts the IO operation affects the cacheline associated with the DMA buffer. Consider something like this: void do_io() { int buffer; start_read(buffer); // maybe do other stuff here wait_for_read_done(); } start_read() gets some IO going, so before it returns a call has been made to bus_dmamap_sync(..., BUS_DMASYNC_PREREAD) and an invalidate gets done on the cacheline containing the variable 'buffer'. The act of returning from the start_read() function causes that cacheline to get reloaded, so now the stale pre-DMA value of the variable 'buffer' is in cache again. Right after that, the DMA completes so that ram has a newer value that belongs in the buffer variable and the copy in the cacheline is stale. Before control gets into the wait_for_read_done() routine
ARM + CACHE_LINE_SIZE + DMA
Hi, I'm working on DMA bus implementation for ARM11mpcore platform. I've looked at implementation in ARM tree, but IMHO it only works with some assumptions. There is a problem with DMA on memory block which is not aligned on CACHE_LINE_SIZE (start and end) if memory is not coherent. Let's have a buffer for DMA which is no aligned on CACHE_LINE_SIZE. Then first cache line associated with the buffer can be divided into two parts, A and B, where A is a memory we know nothing about it and B is buffer memory. The same stands for last cache line associatted with the buffer. We have no problem if a memory is coherent. Otherwise it depends on memory attributes. 1. [no cache] attribute No problem as memory is coherent. 2. [write throught] attribute The part A can be invalidated without loss of any data. It's not problem too. 3. [write back] attribute In general, there is no way how to keep both parts consistent. At the start of DMA transaction, the cache line is written back and invalidated. However, as we know nothing about memory associated with part A of the cache line, the cache line can be filled again at any time and messing up DMA transaction if flushed. Even if the cache line is only filled but not flushed during DMA transaction, we must make it coherent with memory after that. There is a trick with saving part A of the line into temporary buffer, invalidating the line, and restoring part A in current ARM (MIPS) implementation. However, if somebody is writting to memory associated with part A of the line during this trick, the part A will be messed up. Moreover, the part A can be part of another DMA transaction. To safely use DMA with no coherent memory, a memory with [no cache] or [write throught] attributes can be used without problem. A memory with [write back] attribute must be aligned on CACHE_LINE_SIZE. However, for example mbuf, a buffer for DMA can be part of a structure which can be aligned on CACHE_LINE_SIZE, but not the buffer itself. We can know that nobody will write to the structure during DMA transaction, so it's safe to use the buffer event if it's not aligned on CACHE_LINE_SIZE. So, in practice, if DMA buffer is not aligned on CACHE_LINE_SIZE and we want to avoid bounce pages overhead, we must support additional information to DMA transaction. It should be easy to support the information about drivers data buffers. However, what about OS data buffers like mentioned mbufs? The question is following. Is or can be guaranteed for all or at least well-known OS data buffers which can be part of DMA access that the not CACHE_LINE_SIZE aligned buffers are surrounded by data which belongs to the same object as the buffer and the data is not written by OS when given to a driver? Any answer is appreciated. However, 'bounce pages' is not an answer. Thanks, Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd adr...@freebsd.org wrote: Hi, I'm interested in this, primarily because I'm tinkering with file storage stuff on my little (most wifi targetted) embedded MIPS platforms. So what's the story here? How can I reproduce your issue and do some of my own profiling/investigation? Adrian Hi, your interest has made me to do more solid/comparable investigation on my embedded ELAN486 platform. With more test results, I made full tracing of related VFS, filesystem, and disk function calls. It took some time to understand what about the issue really is. My test case: Single file copy (no O_FSYNC). It means that no other filesystem operation is served. The file size must be big enough according to hidirtybuffers value. Other processes on machine, where the test was run, almost were inactive. The real copy time was profiled. In all tests, a machine was booted, a file was copied, file was removed, the machine was rebooted. Thus, the the file was copied into same disk layout. The motivation is that my embedded machines don't do any writing to a disk mostly. Only during software update, a single process is writing to a disk (file by file). It doesn't need to be a problem at all, but an update must be successful even under full cpu load. So, the writing should be tuned up greatly to not affect other processes too much and to finish in finite time. On my embedded ELAN486 machines, a flash memory is used as a disk. It means that a reading is very fast, but a writing is slow. Further, a flash memory is divided into sectors and only complete sector can be erased at once. A sector erasure is very time expensive action. When I tried to tune up VFS by various parameters changing, I found out that real copy time depends on two things. Both of them are a subject of bufdaemon. Namely, its feature to try to work harder, if its buffers flushing mission is failing. It's not suprise that the best copy times were achived when bufdaemon was excluded from buffers flushing at all (by VFS parameters setting). This bufdaemon feature brings along (with respect to the real copy time): 1. bufdaemon runtime itself, 2. very frequent filesystem buffers flushing. What really happens in the test case on my machine: A copy program uses a buffer for coping. The default buffer size is 128 KiB in my case. The simplified sys_write() implementation for DTYPE_VNODE and VREG type is following: sys_write() { dofilewrite() { bwillwrite() fo_write() = vn_write() { bwillwrite() vn_lock() VOP_WRITE() VOP_UNLOCK() } } } So, all 128 KiB is written under VNODE lock. When I take back the machine defaults: hidirtybuffers: 134 lodirtybuffers: 67 bufdirtythresh: 120 buffer size (filesystem block size): 512 bytes and do some simple calculations: 134 * 512 = 68608 - high water bytes count 120 * 512 = 61440 67 * 512 = 34304 - low water byte count then it's obvious that bufdaemon has something to do during each sys_write(). However, almost all dirty buffers belong to new file VNODE and the VNODE is locked. What remains are filesystem buffers only. I.e., superblock buffer and free block bitmap buffers. So, bufdaemon iterates over all dirty buffers queue, what takes a SIGNIFICANT time on my machine, and does not find any buffer to be able to flush almost all time. If bufdaemon flushes one or two buffers, kern_yield() is called, and new iteration is started until no buffer is flushed. So, very often TWO full iteration over dirty buffers queue is done to flush only one or two filesystem buffers and to failed to reach lodirtybuffers threshold. A bufdaemon runtime is growing up. Moreover, the frequent filesystem buffers flushing brings along higher cpu load (geom down thread, geom up thread, disk thread scheduling) and a disk blocks writing re-ordering. The correct disk blocks writing order is important for the flash disk. Further, while the file data buffers are aged but not flushed, filesystem buffers are written repeatedly but flushed. Of course, I use a sector cache in the flash disk, but I can't cache too many sectors because of total memory size. So, filesystem disk blocks often are written and that evokes more disk sector flushes. A sector flush really takes long time, so real copy time grows up beyond control. Last but not least, the flash memory are going to be aged uselessly. Well, this is my old story. Just to be honest, I quite forgot that my kernel was compiled with FULL_PREEMPTION option. The things are very much worse in this case. However, the option just makes the issue worse, the issue doesn't disapper without it. In this old story, I played a game with and focused to bufdirtythresh value. However, bufdirtythresh is changing the way, how and by who buffers are flushed, too much. I recorded disk sector flush count and total disk_strategy() calls count with BIO_WRITE command (and total bytes count to write). I used a file with size 2235517 bytes. When I was caching
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
2012/3/21 Konstantin Belousov kostik...@gmail.com: On Thu, Mar 15, 2012 at 08:00:41PM +0100, Svatopluk Kraus wrote: 2012/3/15 Konstantin Belousov kostik...@gmail.com: On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote: On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote: Hi, I have solved a following problem. If a big file (according to 'hidirtybuffers') is being written, the write speed is very poor. It's observed on system with elan 486 and 32MB RAM (i.e., low speed CPU and not too much memory) running FreeBSD-9. Analysis: A file is being written. All or almost all dirty buffers belong to the file. The file vnode is almost all time locked by writing process. The buf_daemon() can not flush any dirty buffer as a chance to acquire the file vnode lock is very low. A number of dirty buffers grows up very slow and with each new dirty buffer slower, because buf_daemon() eats more and more CPU time by looping on dirty buffers queue (with very low or no effect). This slowing down effect is started by buf_daemon() itself, when 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon() is waked up by own timeout. The timeout fires at 'hz' period, but starts to fire at 'hz/10' immediately as buf_daemon() fails to reach 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly) reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the buf_daemon() can be waked up within bdwrite() too and it's much worse. Finally and with very slow speed, the 'hidirtybuffers' or 'dirtybufthresh' is reached, the dirty buffers are flushed, and everything starts from beginning... Note that for some time, bufdaemon work is distributed among bufdaemon thread itself and any thread that fails to allocate a buffer, esp. a thread that owns vnode lock and covers long queue of dirty buffers. However, the problem starts when numdirtybuffers reaches lodirtybuffers count and ends around hidirtybuffers count. There are still plenty of free buffers in system. On the system, a buffer size is 512 bytes and the default thresholds are following: vfs.hidirtybuffers = 134 vfs.lodirtybuffers = 67 vfs.dirtybufthresh = 120 For example, a 2MB file is copied into flash disk in about 3 minutes and 15 second. If dirtybufthresh is set to 40, the copy time is about 20 seconds. My solution is a mix of three things: 1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in the main buf_daemon() loop. I cannot understand this. Please provide a patch that shows what do you mean there. curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED; mtx_lock(bdlock); for (;;) { - bd_request = 0; + bd_request = 1; mtx_unlock(bdlock); Is this a complete patch ? The change just causes lost wakeups for bufdaemon, nothing more. Yes, it's a complete patch. And exactly, it causes lost wakeups which are: 1. !! UNREASONABLE !!, because bufdaemon is not sleeping, 2. not wanted, because it looks that it's correct behaviour for the sleep with hz/10 period. However, if the sleep with hz/10 period is expected to be waked up by bd_wakeup(), then bd_request should be set to 0 just before sleep() call, and then bufdaemon behaviour will be clear. No, your description is wrong. If bufdaemon is unable to flush enough buffers and numdirtybuffers still greater then lodirtybuffers, then bufdaemon enters qsleep state without resetting bd_request, with timeouts of one tens of second. Your patch will cause all wakeups for this case to be lost. This is exactly the situation when we want bufdaemon to run harder to avoid possible deadlocks, not to slow down. OK. Let's focus to bufdaemon implementation. Now, qsleep state is entered with random bd_request value. If someone calls bd_wakeup() during bufdaemon iteration over dirty buffers queues, then bd_request is set to 1. Otherwise, bd_request remains 0. I.e., sometimes qsleep state only can be timeouted, sometimes it can be waked up by bd_wakeup(). So, this random behaviour is what is wanted? All stuff around bd_request and bufdaemon sleep is under bd_lock, so if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are unreasonable! The patch is about that mainly. Wakeups itself are very cheap for the running process. Mostly, it comes down to locking sleepq and waking all threads that are present in the sleepq blocked queue. If there is no threads in queue, nothing is done. Are you serious? Is spin mutex really cheap? Many calls are cheap, but they are not any matter where. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
2012/3/15 Konstantin Belousov kostik...@gmail.com: On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote: On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote: Hi, I have solved a following problem. If a big file (according to 'hidirtybuffers') is being written, the write speed is very poor. It's observed on system with elan 486 and 32MB RAM (i.e., low speed CPU and not too much memory) running FreeBSD-9. Analysis: A file is being written. All or almost all dirty buffers belong to the file. The file vnode is almost all time locked by writing process. The buf_daemon() can not flush any dirty buffer as a chance to acquire the file vnode lock is very low. A number of dirty buffers grows up very slow and with each new dirty buffer slower, because buf_daemon() eats more and more CPU time by looping on dirty buffers queue (with very low or no effect). This slowing down effect is started by buf_daemon() itself, when 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon() is waked up by own timeout. The timeout fires at 'hz' period, but starts to fire at 'hz/10' immediately as buf_daemon() fails to reach 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly) reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the buf_daemon() can be waked up within bdwrite() too and it's much worse. Finally and with very slow speed, the 'hidirtybuffers' or 'dirtybufthresh' is reached, the dirty buffers are flushed, and everything starts from beginning... Note that for some time, bufdaemon work is distributed among bufdaemon thread itself and any thread that fails to allocate a buffer, esp. a thread that owns vnode lock and covers long queue of dirty buffers. However, the problem starts when numdirtybuffers reaches lodirtybuffers count and ends around hidirtybuffers count. There are still plenty of free buffers in system. On the system, a buffer size is 512 bytes and the default thresholds are following: vfs.hidirtybuffers = 134 vfs.lodirtybuffers = 67 vfs.dirtybufthresh = 120 For example, a 2MB file is copied into flash disk in about 3 minutes and 15 second. If dirtybufthresh is set to 40, the copy time is about 20 seconds. My solution is a mix of three things: 1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in the main buf_daemon() loop. I cannot understand this. Please provide a patch that shows what do you mean there. curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED; mtx_lock(bdlock); for (;;) { - bd_request = 0; + bd_request = 1; mtx_unlock(bdlock); Is this a complete patch ? The change just causes lost wakeups for bufdaemon, nothing more. Yes, it's a complete patch. And exactly, it causes lost wakeups which are: 1. !! UNREASONABLE !!, because bufdaemon is not sleeping, 2. not wanted, because it looks that it's correct behaviour for the sleep with hz/10 period. However, if the sleep with hz/10 period is expected to be waked up by bd_wakeup(), then bd_request should be set to 0 just before sleep() call, and then bufdaemon behaviour will be clear. All stuff around bd_request and bufdaemon sleep is under bd_lock, so if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are unreasonable! The patch is about that mainly. I read description of bd_request variable. However, bd_request should serve as an indicator that buf_daemon() is in sleep. I.e., the following paradigma should be used: mtx_lock(bdlock); bd_request = 0; /* now, it's only time when wakeup() will be meaningful */ sleep(bd_request, ..., hz/10); bd_request = 1; /* in case of timeout, we must set it (bd_wakeup() already set it) */ mtx_unlock(bdlock); My patch follows the paradigma. What happens without the patch in described problem: buf_daemon() fails in its job and goes to sleep with hz/10 period. It supposes that next early wakeup will do nothing too. bd_request is untouched but buf_daemon() doesn't know if its last wakeup was made by bd_wakeup() or by timeout. So, bd_request could be 0 and buf_daemon() can be waked up before hz/10 just by bd_wakeup(). Moreover, setting bd_request to 0 when buf_daemon() is not in sleep can cause time consuming and useless wakeup() calls without effect. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote: Hi, I have solved a following problem. If a big file (according to 'hidirtybuffers') is being written, the write speed is very poor. It's observed on system with elan 486 and 32MB RAM (i.e., low speed CPU and not too much memory) running FreeBSD-9. Analysis: A file is being written. All or almost all dirty buffers belong to the file. The file vnode is almost all time locked by writing process. The buf_daemon() can not flush any dirty buffer as a chance to acquire the file vnode lock is very low. A number of dirty buffers grows up very slow and with each new dirty buffer slower, because buf_daemon() eats more and more CPU time by looping on dirty buffers queue (with very low or no effect). This slowing down effect is started by buf_daemon() itself, when 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon() is waked up by own timeout. The timeout fires at 'hz' period, but starts to fire at 'hz/10' immediately as buf_daemon() fails to reach 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly) reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the buf_daemon() can be waked up within bdwrite() too and it's much worse. Finally and with very slow speed, the 'hidirtybuffers' or 'dirtybufthresh' is reached, the dirty buffers are flushed, and everything starts from beginning... Note that for some time, bufdaemon work is distributed among bufdaemon thread itself and any thread that fails to allocate a buffer, esp. a thread that owns vnode lock and covers long queue of dirty buffers. However, the problem starts when numdirtybuffers reaches lodirtybuffers count and ends around hidirtybuffers count. There are still plenty of free buffers in system. On the system, a buffer size is 512 bytes and the default thresholds are following: vfs.hidirtybuffers = 134 vfs.lodirtybuffers = 67 vfs.dirtybufthresh = 120 For example, a 2MB file is copied into flash disk in about 3 minutes and 15 second. If dirtybufthresh is set to 40, the copy time is about 20 seconds. My solution is a mix of three things: 1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in the main buf_daemon() loop. I cannot understand this. Please provide a patch that shows what do you mean there. curthread-td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED; mtx_lock(bdlock); for (;;) { - bd_request = 0; + bd_request = 1; mtx_unlock(bdlock); I read description of bd_request variable. However, bd_request should serve as an indicator that buf_daemon() is in sleep. I.e., the following paradigma should be used: mtx_lock(bdlock); bd_request = 0;/* now, it's only time when wakeup() will be meaningful */ sleep(bd_request, ..., hz/10); bd_request = 1; /* in case of timeout, we must set it (bd_wakeup() already set it) */ mtx_unlock(bdlock); My patch follows the paradigma. What happens without the patch in described problem: buf_daemon() fails in its job and goes to sleep with hz/10 period. It supposes that next early wakeup will do nothing too. bd_request is untouched but buf_daemon() doesn't know if its last wakeup was made by bd_wakeup() or by timeout. So, bd_request could be 0 and buf_daemon() can be waked up before hz/10 just by bd_wakeup(). Moreover, setting bd_request to 0 when buf_daemon() is not in sleep can cause time consuming and useless wakeup() calls without effect. 2. Increment of buf_daemon() fast timeout from hz/10 to hz/4. 3. Tuning dirtybufthresh to (((lodirtybuffers + hidirtybuffers) / 2) - 15) magic. Even hz / 10 is awfully long time on modern hardware. The dirtybufthresh is already the sysctl that you can change. Yes, I noted low-speed CPU. Don't forget that even if buf_daemon() sleeps for hz/4 period (and this is expected to be rare case), dirtybufthresh still works and helps. And I don't push the changes (except bd_request one (a little)). I'm just sharing my experience. The 32MB is indeed around the lowest amount of memory where recent FreeBSD can make an illusion of being useful. I am not sure how much should the system be tuned by default for such configuration. Even recent FreeBSD on this configuration is useful pretty much. Of course, file operations are not main concern ... IMHO, it's always good to know how the system works (and its parts) in various configurations. The mention copy time is about 30 seconds now. The described problem is just for information to anyone who can be interested in. Comments are welcome. However, the bd_request thing is more general. bd_request (despite its description) should be 0 only when buf_daemon() is in sleep(). Otherwise, wakeup() on bd_request channel is useless. Therefore, setting bd_request to 1 in the main buf_daemon() loop
[vfs] buf_daemon() slows down write() severely on low-speed CPU
Hi, I have solved a following problem. If a big file (according to 'hidirtybuffers') is being written, the write speed is very poor. It's observed on system with elan 486 and 32MB RAM (i.e., low speed CPU and not too much memory) running FreeBSD-9. Analysis: A file is being written. All or almost all dirty buffers belong to the file. The file vnode is almost all time locked by writing process. The buf_daemon() can not flush any dirty buffer as a chance to acquire the file vnode lock is very low. A number of dirty buffers grows up very slow and with each new dirty buffer slower, because buf_daemon() eats more and more CPU time by looping on dirty buffers queue (with very low or no effect). This slowing down effect is started by buf_daemon() itself, when 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon() is waked up by own timeout. The timeout fires at 'hz' period, but starts to fire at 'hz/10' immediately as buf_daemon() fails to reach 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly) reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the buf_daemon() can be waked up within bdwrite() too and it's much worse. Finally and with very slow speed, the 'hidirtybuffers' or 'dirtybufthresh' is reached, the dirty buffers are flushed, and everything starts from beginning... On the system, a buffer size is 512 bytes and the default thresholds are following: vfs.hidirtybuffers = 134 vfs.lodirtybuffers = 67 vfs.dirtybufthresh = 120 For example, a 2MB file is copied into flash disk in about 3 minutes and 15 second. If dirtybufthresh is set to 40, the copy time is about 20 seconds. My solution is a mix of three things: 1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in the main buf_daemon() loop. 2. Increment of buf_daemon() fast timeout from hz/10 to hz/4. 3. Tuning dirtybufthresh to (((lodirtybuffers + hidirtybuffers) / 2) - 15) magic. The mention copy time is about 30 seconds now. The described problem is just for information to anyone who can be interested in. Comments are welcome. However, the bd_request thing is more general. bd_request (despite its description) should be 0 only when buf_daemon() is in sleep(). Otherwise, wakeup() on bd_request channel is useless. Therefore, setting bd_request to 1 in the main buf_daemon() loop is correct and better as it saves time spent by wakeup() on not existing channel. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
pccbb device doesn't send device_shutdown() to childs (reboot freezes)
Hi, I solved very curious problem with rarely system (FreeBSD-9) freezing during reboot. Finally, I found out that system freezes in ep device callout. The part of device tree is following: - pccbb - pccard - ep cbb_pci_shutdown() method in pccbb device places the cards in reset, turns off the interrupts and powers down the socket. No child has a chance to know about it. Thus, if ep device callout fires between device_shutdown() is called on root device and interrupts are disabled, the callout freezes in never-ending while loop, which reads status from hardware (now without power). I propose following change (editted by hand) in cbb_pci_shutdown(): struct cbb_softc *sc = (struct cbb_softc *)device_get_softc(brdev); + + /* Inform all childs. */ + bus_generic_shutdown(brdev); + /* * We're about to pull the rug out from the card, so mark it as * gone to prevent harm. */ sc-cardok = 0; Futhermore, ep device (ed device too, ... ?) has not implemented device_shutdown method. So, fixing pccbb device is not enough to solve the freezing problem. I somehow patched the mentioned devices too, but maybe someone more competent should do it for FreeBSD tree. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
i386 - pmap_enter() superpage promotion on kernel addresses
Hi, I'm tuning pmap code for arm11 mpcore port, which is inspired by i386 one. My question is about superpage promotions on kernel addresses in i386 pmap code. pmap_promote_pde() is called from pmap_enter() only and if following conditions are fulfilled: 1. promotions are enabled, 2. all pages in a superpage are allocated (physical space condition), and for user addresses, 3. all pages in a superpage are mapped (virtual space condition). For kernel addresses, the third condition is not checked. I understand that it is not easy to evaluate the third condition for kernel addresses. However, pmap_promote_pde() often can be called unnecessarily now and it's rather expensive call. Or is there any other reason for that? Moreover, there are many temporary mappings (pmap_qenter(),...) in kernel and if pmap_promote_pde() is called without 3th condition, the promotion can be successfull. As temporary mappings do nothing with promotions and demotions, it is a fault. Or a superpage with temporary kernel mapping never can be promoted because of locking or something else? The third condition is evaluated on page table bases (wire_count is used) for user addresses. Page tables for kernel addresses have wire count set to 0 or 1. Page tables preallocated during boot are post-initialized in pmap_init() but wire_count is left untouched (wire_count is 0). Page tables allocated in pmap_growkernel() are allocated wired (wire_count is 1). [branch] If a kernel superpage is demoted in pmap_demote_pde() and corresponding page table wire_count is 1, the page table is re-initialized uselessly as a newly allocated one. My idea is that kernel address mappings made in pmap_enter() can be marked 'stable' (as opposite to 'temporary') and counted by wire_count in same way as for user addresses and then third condition could be applied and will be fulfilled only for this 'stable' mappings (which know about promotions and demotions). Is anything wrong with this idea? Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On Fri, Oct 28, 2011 at 7:38 AM, Alan Cox a...@rice.edu wrote: On 10/26/2011 06:23, Svatopluk Kraus wrote: Hi, well, I'm working on new port (arm11 mpcore) and pmap_enter_object() is what I'm debugging rigth now. And I did not find any way in userland how to force kernel to call pmap_enter_object() which makes SUPERPAGE mapping without promotion. I tried to call mmap() with MAP_PREFAULT_READ without success. I tried to call madvise() with MADV_WILLNEED without success too. mmap() should call pmap_enter_object() if MAP_PREFAULT_READ was specified. I'm surprised to hear that it's not happening for you. Yes, it's not happening for me really. mmap() with MAP_PREFAULT_READ case: vm_mmap() in sys/vm/vm_mmap.c (r225617) line 1501 - if MAP_ANON then docow = 0 line 1525 - vm_map_find() is called with zeroed docow It's propagated down the calling stack, so even vm_map_pmap_enter() is not called in vm_map_insert(). Most likely, this is correct. (Anonymous object - no physical memory allocation in advance - no SUPERPAGE mapping without promotion.) madvise() with MADV_WILLNEED case: -- vm_map_pmap_enter() in sys/vm/vm_map.c (r223825) line 1814 - vm_page_find_least() is called During madvise(), vm_map_pmap_enter() is called. However, in the call, vm_page_find_least() returns NULL. It returns NULL, if no page is allocated in object with pindex greater or equal to the parameter pindex. The following loop after the call says that if no page is allocated for SUPERPAGE (i.e. for given region), pmap_enter_object() is not called and this is correct. snip Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if I have SUPERPAGE mapping without promotion, the mapping is demoted after first write, and promoted again after all underlying pages are accessed by write. There is 4K page table saving no longer. Yes, that is all true. It is possible to change things so that the page table pages are reclaimed after a time, and not kept around indefinitely. However, this not high on my personal priority list. Before that, it is more likely that I will add an option to avoid the demotion on write, if we don't have to copy the entire superpage to do so. Well, I just wanted to remark that there is no 4K page table saving now. However, there is still big TLB entries saving with SUPERPAGE promotions. I'm not pushing you to do anything. I understand that physical pages allocation in advance is not good idea and it goes against great copy on write feature. However, something like MAP_PREFAULT_WRITE on MAP_ANON, which allocates all physical pages in advance and does SUPERPAGE mapping without promotion sounds like a good-but-really-specific feature, which can be utilized sometimes. Nevertheless, IMHO, it's not worth to do such specific feature. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
Hi, well, I'm working on new port (arm11 mpcore) and pmap_enter_object() is what I'm debugging rigth now. And I did not find any way in userland how to force kernel to call pmap_enter_object() which makes SUPERPAGE mapping without promotion. I tried to call mmap() with MAP_PREFAULT_READ without success. I tried to call madvise() with MADV_WILLNEED without success too. To make SUPERPAGE mapping, it's obvious that all physical pages under SUPERPAGE must be allocated in vm_object. And SUPERPAGE mapping must be done before first access to them, otherwise a promotion is on the way. MAP_PREFAULT_READ does nothing with it. If madvice() is used, vm_object_madvise() is called but only cached pages are allocated in advance. Of coarse, an allocation of all physical memory behind virtual address space in advance is not preferred in most situations. For example, I want to do some computation on 4M memory space (I know that each byte will be accessed) and want to utilize SUPERPAGE mapping without promotion, so save 4K page table (i386 machine). However, malloc() leads to promotion, mmap() with MAP_PREFAULT_READ doesn't do nothing so SUPERPAGE mapping is promoted, and madvice() with MADV_WILLNEED calls vm_object_madvise() but because the pages are not cached (how can be on anonymous memory), it is not work without promotion too. So, SUPERPAGE mapping without promotions is fine, but it can be done only if physical memory being mapped is already allocated. Is it really possible to force that in userland? Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if I have SUPERPAGE mapping without promotion, the mapping is demoted after first write, and promoted again after all underlying pages are accessed by write. There is 4K page table saving no longer. Svata On Wed, Oct 26, 2011 at 1:35 AM, Alan Cox a...@rice.edu wrote: On 10/10/2011 4:28 PM, Wojciech Puchar wrote: Notice that vm.pmap.pde.promotions increased by 31. This means that 31 superpage mappings were created by promotion from small page mappings. thank you. i looked at .mappings as it seemed logical for me that is shows total. In contrast, vm.pmap.pde.mappings counts superpage mappings that are created directly and not by promotion from small page mappings. For example, if a large executable, such as gcc, is resident in memory, the text segment will be pre-mapped using superpage mappings, avoiding soft fault and promotion overhead. Similarly, mmap(..., MAP_PREFAULT_READ) on a large, memory resident file may pre-map the file using superpage mappings. your options are not described in mmap manpage nor madvise (MAP_PREFAULT_READ). when can i find the up to date manpage or description? A few minutes ago, I merged the changes to support and document MAP_PREFAULT_READ into 8-STABLE. So, now it exists in HEAD, 9.0, and 8-STABLE. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
threads runtime value is incorrect (tc_cpu_ticks() problem)
Hi, I've tested FreeBSD-current from June 16 2011 on x86 (AMD Elan SC400). I found out that a sum of runtimes of all threads is about 120 minutes after 180 minutes of system uptime and the difference is getting worse with time. The problem is in tc_cpu_ticks() implementation which takes into acount just one timecounter overflow, but in tested BSP (16-bit hardware counter) very often more than one overflow occured between two tc_cpu_ticks() calls. I understand that 16-bit timecounter is a real relict nowadays, but I would like to solve the problem somehow reasonably. I have a few questions. According to description in definition of timecounter structure (sys/timetc.h), tc_get_timecount() should read the counter and tc_counter_mask should mask off any unimplemented bits. In tc_cpu_ticks(), if ticks count returned from tc_get_timecount() overflows then (tc_counter_mask + 1) is added to result. However, timecounter hardware can be initialized to value from interval (0, tc_counter_mask, so if the description of tc_get_timecount() doesn't lie then adding (tc_counter_mask + 1) value at all times is not correct. Better description which satisfies tc_cpu_ticks() implementation is that tc_get_timecount() should count the ticks in interval 0, tc_counter_mask. That's what i8254_get_timecount() (in sys/x86/isa/clock.c) does really. However, if tc_get_timecount() should count the ticks (and doesn't read the counter) then it can count the ticks in full uint64_t range? And tc_cpu_ticks() implementation could be very simple (not masking, not overflow checking). In i8254_get_timecount(), it is enough to change global variable 'i8254_offset' and local variable 'count' from uint16_t to uint64_t type. Now, cpu_ticks() (whichs point to tc_cpu_ticks() by default) is called from mi_switch() which must be called often enough to satisfy tc_cpu_ticks() implementation (recognize just one timecounter overflow). That limits some of system parameters (at least hz selection). It looks that tc_counter_mask is a little bit misused? Maybe, tc_cpu_ticks() is only used for back compatibility and new system should use set_cputicker() to change this default? Thanks for some help to better understand that. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: threads runtime value is incorrect (tc_cpu_ticks() problem)
On Wed, Jun 22, 2011 at 1:40 PM, Uffe Jakobsen u...@uffe.org wrote: On 2011-06-22 12:33, Svatopluk Kraus wrote: Hi, I've tested FreeBSD-current from June 16 2011 on x86 (AMD Elan SC400). I found out that a sum of runtimes of all threads is about 120 minutes after 180 minutes of system uptime and the difference is getting worse with time. The problem is in tc_cpu_ticks() implementation which takes into acount just one timecounter overflow, but in tested BSP (16-bit hardware counter) very often more than one overflow occured between two tc_cpu_ticks() calls. I understand that 16-bit timecounter is a real relict nowadays, but I would like to solve the problem somehow reasonably. I have a few questions. According to description in definition of timecounter structure (sys/timetc.h), tc_get_timecount() should read the counter and tc_counter_mask should mask off any unimplemented bits. In tc_cpu_ticks(), if ticks count returned from tc_get_timecount() overflows then (tc_counter_mask + 1) is added to result. However, timecounter hardware can be initialized to value from interval (0, tc_counter_mask, so if the description of tc_get_timecount() doesn't lie then adding (tc_counter_mask + 1) value at all times is not correct. Better description which satisfies tc_cpu_ticks() implementation is that tc_get_timecount() should count the ticks in interval0, tc_counter_mask. That's what i8254_get_timecount() (in sys/x86/isa/clock.c) does really. However, if tc_get_timecount() should count the ticks (and doesn't read the counter) then it can count the ticks in full uint64_t range? And tc_cpu_ticks() implementation could be very simple (not masking, not overflow checking). In i8254_get_timecount(), it is enough to change global variable 'i8254_offset' and local variable 'count' from uint16_t to uint64_t type. Now, cpu_ticks() (whichs point to tc_cpu_ticks() by default) is called from mi_switch() which must be called often enough to satisfy tc_cpu_ticks() implementation (recognize just one timecounter overflow). That limits some of system parameters (at least hz selection). It looks that tc_counter_mask is a little bit misused? Maybe, tc_cpu_ticks() is only used for back compatibility and new system should use set_cputicker() to change this default? Thanks for some help to better understand that. I'm by no means an expert in this field - but your mentioning of AMD Elan SC400 triggered some old knowledge about the AMD Elan SC520. If you have a look at the sys/i386/i386/elan-mmcr.c Function init_AMD_Elan_sc520() adresses the fact that the i8254 has a nonstandard frequency with the AMD Elan SC520 at least - could it be the same with the SC400 ? You are correct, AMD Elan SC400 i8254 has nonstandard frequency, but it's not the problem. After system startup, no new threads start and no threads exit, but sum of runtimes of all existing threads is much much less than system uptime and the difference is worse with time. Only one timecounter in system. System uptime is correct and respons to time measured by my watch. Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ichsmb - correct locking strategy?
On Tue, Feb 22, 2011 at 3:37 PM, John Baldwin j...@freebsd.org wrote: On Friday, February 18, 2011 9:10:47 am Svatopluk Kraus wrote: Hi, I try to figure out locking strategy in FreeBSD and found 'ichsmb' device. There is a mutex which protects smb bus (ichsmb device). For example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is locked and a command is written to bus, then unbounded (but with timeout) sleep is done (mutex is unlocked during sleep). After sleep a word is read from bus and the mutex is unlocked. 1. If an use of the device IS NOT serialized by layers around then more calls to this function (or others) can be started or even done before the first one is finished. The mutex don't protect sms bus. 2. If an use of the device IS serialized by layers around then the mutex is useless. Moreover, I don't mension interrupt routine which uses the mutex and smb bus too. Am I right? Or did I miss something? Hmm, the mutex could be useful if you have an smb controller with an interrupt handler (I think ichsmb or maybe intpm can support an interrupt handler) to prevent concurrent access to device registers. That is the purpose of the mutex at least. There is a separate locking layer in smbus itself in (see smbus_request_bus(), etc.). -- John Baldwin I see. So, multiple accesses to bus are protected by upper smbus layer itself. And the mutex encloses each single access in respect of interrupt. I.e., an interrupt can be assigned to a command (bus is either command processing or idle) and a wait to command result can be done atomically (no wakeup is missed). Am I right? BTW, a mutex priority propagation isn't too much exploited here. Maybe, it will be better for me to not take this feature into account when thinking about locking strategy and just take a mutex in most cases as a low level locking primitive which is indeed. Well, it seems that things become more clear. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ichsmb - correct locking strategy?
On Fri, Feb 18, 2011 at 4:09 PM, Hans Petter Selasky hsela...@c2i.net wrote: On Friday 18 February 2011 15:10:47 Svatopluk Kraus wrote: Hi, I try to figure out locking strategy in FreeBSD and found 'ichsmb' device. There is a mutex which protects smb bus (ichsmb device). For example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is locked and a command is written to bus, then unbounded (but with timeout) sleep is done (mutex is unlocked during sleep). After sleep a word is read from bus and the mutex is unlocked. 1. If an use of the device IS NOT serialized by layers around then more calls to this function (or others) can be started or even done before the first one is finished. The mutex don't protect sms bus. 2. If an use of the device IS serialized by layers around then the mutex is useless. Moreover, I don't mension interrupt routine which uses the mutex and smb bus too. Am I right? Or did I miss something? man sx ? struct sx ? --HPS Thanks for your reply. It seems that everybody knows that ichsmb driver is not in good shape but nobody cares for ... Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
ichsmb - correct locking strategy?
Hi, I try to figure out locking strategy in FreeBSD and found 'ichsmb' device. There is a mutex which protects smb bus (ichsmb device). For example in ichsmb_readw() in sys/dev/ichsmb/ichsmb.c, the mutex is locked and a command is written to bus, then unbounded (but with timeout) sleep is done (mutex is unlocked during sleep). After sleep a word is read from bus and the mutex is unlocked. 1. If an use of the device IS NOT serialized by layers around then more calls to this function (or others) can be started or even done before the first one is finished. The mutex don't protect sms bus. 2. If an use of the device IS serialized by layers around then the mutex is useless. Moreover, I don't mension interrupt routine which uses the mutex and smb bus too. Am I right? Or did I miss something? Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Sleepable locks with priority propagation?
Hi, I deal with devices (i2c bus, flash memory), which are quite slow, i.e. some or mostly all operations on it are quite slow. One must wait for it for rather long time. Use of DELAY() is too expensive and an inactive (so called unbound) wait isn't permited with mutexes. So, no priority propagation locking primitive could be exploited. Typical simple operation (which must be locked) is following: HW_LOCK write request (fast) wait for processing (quite slow) read response (fast) HW_UNLOCK Here, use of mutex with mtx_sleep() is impossible, as mutex is unlocked during (unbound) sleep and somebody can start new operation. An response which is read after sleep could be incorrect. Well, I deal with a hardware, so all sleeps on it could be infinite. It's driver writer responsibility to ensure that all situations which can lead to infinite waits are treated correctly and can't happen. The waits (if no error happen on hardware, what must be treat anyway) could be rather long (but with known limits from specification). Long but not unbounded. I lack of a locking primitive with priority propagation on which inactive waits are permited. I'm not too much familiar with locking implementation strategy in FreeBSD, but am I one and only who needs such a lock? Or such lock is not permitted for really good reasons? Well, I know that only KASSERT in mtx_init() (mutex can't be made SLEEPABLE) and witness in mtx_sleep() (can't sleep on UNSLEEPABLE locks) quard the whole stuff. But should I hack it and use mutex as a locking primitive I need? (Of course, I will always use it with timeout.) Thanks for any response, Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
A question about WARNING: attempt to domain_add(xyz) after domainfinalize()
Hi, I'd like to add a new network domain into kernel (and never remove it) from loadable module. In fact, I did it, but I got following warning from domain_add(): WARNING: attempt to domain_add(xyz) after domainfinalize(). Now, I try to figure out what is behind the warning, which seems to become KASSERT (now, in notyet section part of code, which is 6 years old). I found a few iteration on domains list and each domain protosw table, which are not protected by any lock. OK, it is problem but when I only add a domain (it's added at the head of domains list) and never remove it then that could be safe. Moreover, it seems that without any limits, it is possible to add a new protocol into domain on reserved place labeled as PROTO_SPACER by pf_proto_register() function. Well, it's not a list so it's a different case (but a copy into spacer isn't atomic operation). I found two global variables (max_hdr,max_datalen) which are evaluated in each domain_init() from other variables (max_linkhdr,max_protohdr) and a global variable (max_keylen) which is evaluated from all known domains (dom_maxrtkey entry). The variables are used in other parts of kernel. Futher, I know about 'dom_ifattach' and 'dom_ifdetach' pointers to functions defined on each domain, which are responsible for 'if_afdata' entry on ifnet structure. Is there something more I didn't find in current kernel? Will be there something more in future kernels, what legitimize KASSERT in domain_add()? My network domain doesn't influence any mentioned global variables, doesn't define dom_ifattach() and dom_ifdetach() functions, and should be only added from loadable module and never removed. So, I think it's safe. But I'm a little bit nervous because of planned KASSERT in domain_add(). Well, I can implement an empty domain with some spacers for protocols, link it into kernel (thus shut down the warning), and make loadable module in which I only register protocols I want on already added domain, but why should I do it in that (for me good-for-nothing) way? Thanks for any response, Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: page table fault, which should map kernel virtual address space
On Mon, Oct 4, 2010 at 2:03 AM, Alan Cox alan.l@gmail.com wrote: On Thu, Sep 30, 2010 at 6:28 AM, Svatopluk Kraus onw...@gmail.com wrote: On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote: On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote: Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, pager_map,...) exist as result of 'kmem_suballoc' function call. When this submaps are used (for example 'kmem_alloc_nofault' function) and its virtual address subspace is at the end of used kernel virtual address space at the moment (and above 'NKPT' preallocation), then missing page tables are not allocated and double fault can happen. No, the page tables are allocated. If you create a submap X of the kernel map using kmem_suballoc(), then a vm_map_findspace() is performed by vm_map_find() on the kernel map to find space for the submap X. As you note above, the call to vm_map_findspace() on the kernel map will call pmap_growkernel() if needed to extend the kernel page table. If you create another submap X' of X, then that submap X' can only map addresses that fall within the range for X. So, any necessary page table pages were allocated when X was created. You are right. Mea culpa. I was focused on a solution and made too quick conclusion. The page table fault hitted in 'pager_map', which is submap of 'clean_map' and when I debugged the problem I didn't see a submap stuff as a whole. That said, there may actually be a problem with the implementation of the superpage_align parameter to kmem_suballoc(). If a submap is created with superpage_align equal to TRUE, but the submap's size is not a multiple of the superpage size, then vm_map_find() may not allocate a page table page for the last megabyte or so of the submap. There are only a few places where kmem_suballoc() is called with superpage_align set to TRUE. If you changed them to FALSE, that is an easy way to test this hypothesis. Yes, it helps. My story is that the problem started up when I updated a project ('coldfire' port) based on FreeBSD 8.0. to FreeBSD current version. In the current version the 'clean_map' submap is created with superpage_align set to TRUE. I have looked at vm_map_find() and debugged the page table fault once again. IMO, it looks that a do-while loop does not work in the function as intended. A vm_map_findspace() finds a space and calls pmap_growkernel() if needed. A pmap_align_superpage() arrange the space but never call pmap_growkernel(). A vm_map_insert() inserts the aligned space into a map without error and never call pmap_growkernel() and does not invoke loop iteration. I don't know too much about an idea how a virtual memory model is implemented and used in other modules. But it seems that it could be more reasonable to align address space in vm_map_findspace() internally and not to loop externally. I have tried to add a check in vm_map_insert() that checks the 'end' parameter against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed. In this case the loop in vm_map_find() works and I have no problem with the page table fault. But 'kernel_vm_end' variable must be initializated properly before first use of vm_map_insert(). The 'kernel_vm_end' variable can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late), but it was changed in current version ('i386' port). Thanks for your answer, but I'm still looking for permanent and approved solution. I have a patch that implements one possible fix for this problem. I'll probably commit that patch in the next day or two. Regards, Alan I tried your patch and it works. Many thanks. Regards, Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: page table fault, which should map kernel virtual address space
On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote: On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote: Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, pager_map,...) exist as result of 'kmem_suballoc' function call. When this submaps are used (for example 'kmem_alloc_nofault' function) and its virtual address subspace is at the end of used kernel virtual address space at the moment (and above 'NKPT' preallocation), then missing page tables are not allocated and double fault can happen. No, the page tables are allocated. If you create a submap X of the kernel map using kmem_suballoc(), then a vm_map_findspace() is performed by vm_map_find() on the kernel map to find space for the submap X. As you note above, the call to vm_map_findspace() on the kernel map will call pmap_growkernel() if needed to extend the kernel page table. If you create another submap X' of X, then that submap X' can only map addresses that fall within the range for X. So, any necessary page table pages were allocated when X was created. You are right. Mea culpa. I was focused on a solution and made too quick conclusion. The page table fault hitted in 'pager_map', which is submap of 'clean_map' and when I debugged the problem I didn't see a submap stuff as a whole. That said, there may actually be a problem with the implementation of the superpage_align parameter to kmem_suballoc(). If a submap is created with superpage_align equal to TRUE, but the submap's size is not a multiple of the superpage size, then vm_map_find() may not allocate a page table page for the last megabyte or so of the submap. There are only a few places where kmem_suballoc() is called with superpage_align set to TRUE. If you changed them to FALSE, that is an easy way to test this hypothesis. Yes, it helps. My story is that the problem started up when I updated a project ('coldfire' port) based on FreeBSD 8.0. to FreeBSD current version. In the current version the 'clean_map' submap is created with superpage_align set to TRUE. I have looked at vm_map_find() and debugged the page table fault once again. IMO, it looks that a do-while loop does not work in the function as intended. A vm_map_findspace() finds a space and calls pmap_growkernel() if needed. A pmap_align_superpage() arrange the space but never call pmap_growkernel(). A vm_map_insert() inserts the aligned space into a map without error and never call pmap_growkernel() and does not invoke loop iteration. I don't know too much about an idea how a virtual memory model is implemented and used in other modules. But it seems that it could be more reasonable to align address space in vm_map_findspace() internally and not to loop externally. I have tried to add a check in vm_map_insert() that checks the 'end' parameter against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed. In this case the loop in vm_map_find() works and I have no problem with the page table fault. But 'kernel_vm_end' variable must be initializated properly before first use of vm_map_insert(). The 'kernel_vm_end' variable can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late), but it was changed in current version ('i386' port). Thanks for your answer, but I'm still looking for permanent and approved solution. Regards, Svata ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
page table fault, which should map kernel virtual address space
Hallo, this is about 'NKPT' definition, 'kernel_map' submaps, and 'vm_map_findspace' function. Variable 'kernel_map' is used to manage kernel virtual address space. When 'vm_map_findspace' function deals with 'kernel_map' then 'pmap_growkernel' function is called. At least in 'i386' architecture, pmap implementation uses 'pmap_growkernel' function to allocate missing page tables. Missing page tables are problem, because no one checks 'pte' pointer for validity after use of 'vtopte' macro. 'NKPT' definition defines a number of preallocated page tables during system boot. Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, pager_map,...) exist as result of 'kmem_suballoc' function call. When this submaps are used (for example 'kmem_alloc_nofault' function) and its virtual address subspace is at the end of used kernel virtual address space at the moment (and above 'NKPT' preallocation), then missing page tables are not allocated and double fault can happen. I have met this scenario and solved it by increasing page tables preallocation count ('NKPT' definition). It's temporary solution which works for the present. Can someone more advanced and sacred in virtual memory module solve it (in 'vm_map_findspace' function for example)? Or tell me that the problem is elsewhere... Thanks, Svata -- View this message in context: http://old.nabble.com/page-table-fault%2C-which-should-map-kernel-virtual-address-space-tp29760054p29760054.html Sent from the freebsd-hackers mailing list archive at Nabble.com. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org