Re: [gsoc2012] Port NetBSD's UDF implementation
Hi YongCon; The project would be very interesting for us. I am pretty sure you will not have problems finding a mentor. That said, let me point out an old thread: http://lists.freebsd.org/pipermail/freebsd-stable/2008-May/042565.html I think the biggest problem is that you will have to get acquainted with FreeBSD's Virtual Memory which is different from NetBSD's. In that same thread you will find some comments by Matt Dillon (no idea how up to date those are). It will not be an easy task but people find such challenges very rewarding. Pedro. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Startvation of realtime piority threads
On 2012/4/5 11:56, Konstantin Belousov wrote: On Wed, Apr 04, 2012 at 06:54:06PM -0700, Sushanth Rai wrote: I have a multithreaded user space program that basically runs at realtime priority. Synchronization between threads are done using spinlock. When running this program on a SMP system under heavy memory pressure I see that thread holding the spinlock is starved out of cpu. The cpus are effectively consumed by other threads that are spinning for lock to become available. After instrumenting the kernel a little bit what I found was that under memory pressure, when the user thread holding the spinlock traps into the kernel due to page fault, that thread sleeps until the free pages are available. The thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, it is queued at PUSER priority even thought it's base priority is realtime. The other siblings threads that are spinning at realtime priority to acquire the spinlock starves the owner of spinlock. I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic is the same in the trunk. It just so happen that your program stumbles upon a single sleep point in the kernel. If for whatever reason the thread in kernel is put off CPU due to failure to acquire any resource without priority propagation, you would get the same effect. Only blockable primitives do priority propagation, that are mutexes and rwlocks, AFAIR. In other words, any sx/lockmgr/sleep points are vulnerable to the same issue. This is why I suggested that POSIX realtime priority should not be boosted, it should be only higher than PRI_MIN_TIMESHARE but lower than any priority all msleep() callers provided. The problem is userland realtime thread 's busy looping code can cause starvation a thread in kernel which holding a critical resource. In kernel we can avoid to write dead-loop code, but userland code is not trustable. If you search "Realtime thread priorities" in 2010-december within @arch list. you may find the argument. Speaking of exactly your problem, did you considered wiring the memory of your realtime process ? This is a common practice, used e.g. by ntpd. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Startvation of realtime piority threads
On 2012/4/5 9:54, Sushanth Rai wrote: I have a multithreaded user space program that basically runs at realtime priority. Synchronization between threads are done using spinlock. When running this program on a SMP system under heavy memory pressure I see that thread holding the spinlock is starved out of cpu. The cpus are effectively consumed by other threads that are spinning for lock to become available. After instrumenting the kernel a little bit what I found was that under memory pressure, when the user thread holding the spinlock traps into the kernel due to page fault, that thread sleeps until the free pages are available. The thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, it is queued at PUSER priority even thought it's base priority is realtime. The other siblings threads that are spinning at realtime priority to acquire the spinlock starves the owner of spinlock. I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic is the same in the trunk. Thanks, Sushanth I think 7.2 still has libkse which supports static priority scheduling, if performance is not important but correctness, you may try libkse with process-scope threads, and use priority-inherit mutex to do locking. Kernel is known to be vulnerable to support user realtime threads. I think not every-locking primitive can support priority propagation, this is an issue. In userland, internal library mutexes are not priority-inherit, so starvation may happen too. If you know what you are doing, don't call such functions which uses internal mutexes, but this is rather difficult. Regards, David Xu ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: CAM disk I/O starvation
On Tue, 3 Apr 2012 14:27:43 -0700 Jerry Toung wrote: > On 4/3/12, Gary Jennejohn wrote: > > > It would be interesting to see your patch. I always run HEAD but > > maybe I could use it as a base for my own mods/tests. > > > > Here is the patch This looks fair if all your disks are working at the same time (e.g. RAID only setup), but if you have a setup where you have multiple disks and only one is doing something, you limit the amount of tags which can be used. No idea what kind of performance impact this would have. What about the case where you have more disks than tags? I also noticed that you do a strncmp for "da". What about "ada" (available in 9 and 10), I would assume it suffers from the same problem. Bye, Alexander. -- http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Startvation of realtime piority threads
On Wed, Apr 04, 2012 at 06:54:06PM -0700, Sushanth Rai wrote: > I have a multithreaded user space program that basically runs at realtime > priority. Synchronization between threads are done using spinlock. When > running this program on a SMP system under heavy memory pressure I see that > thread holding the spinlock is starved out of cpu. The cpus are effectively > consumed by other threads that are spinning for lock to become available. > > After instrumenting the kernel a little bit what I found was that under > memory pressure, when the user thread holding the spinlock traps into the > kernel due to page fault, that thread sleeps until the free pages are > available. The thread sleeps PUSER priority (within vm_waitpfault()). When it > is ready to run, it is queued at PUSER priority even thought it's base > priority is realtime. The other siblings threads that are spinning at > realtime priority to acquire the spinlock starves the owner of spinlock. > > I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, > PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic > is the same in the trunk. It just so happen that your program stumbles upon a single sleep point in the kernel. If for whatever reason the thread in kernel is put off CPU due to failure to acquire any resource without priority propagation, you would get the same effect. Only blockable primitives do priority propagation, that are mutexes and rwlocks, AFAIR. In other words, any sx/lockmgr/sleep points are vulnerable to the same issue. Speaking of exactly your problem, did you considered wiring the memory of your realtime process ? This is a common practice, used e.g. by ntpd. pgpH0WHCLjZxr.pgp Description: PGP signature
GSoC call for mentor
Hi, I've been communicating with the FreeBSD GSoC admins list for a few months now, not realizing only 4 people are on there. I have spoken with Ben Laurie (affiliated with OpenSSL) and Robert Watson regarding my GSoC idea for software implementations of SHA-3 hash algorithms for the purpose of inclusion within FreeBSD or OpenSSL. The timeline for applications is now almost upon us, so I would like to finalize my plan as soon as possible to allow me time to create a good proposal in time to submit it. It seems clear that the implementation and performance analysis of the SHA-3 candidate algorithm(s) is the interesting part of what I discussed in that earlier correspondence. Whether the code is written for FreeBSD or OpenSSL's specific framework is not interesting and more a strategic/political decision than a technical one. After pondering the previous suggestions, I think that my project proposal should be roughly as follows: - C Implementations of all 5 SHA-3 hash algorithm candidates. These implementations will operate in a standalone manner, with a reasonable interface such that the NIST SHA-3 selected algorithm's implementation could be easily adapted to work within OpenSSL or FreeBSD. - Expect that alternate implementations will be explored to determine possible performance tradeoffs and optimal implementations. - Formalized analysis and discussion, formatted in a conference-quality paper My motivation for this work is that I am currently working on PhD research in the field of cryptographic engineering, recently completed my MS CpE research on hardware FPGA implementations of SHA-3 candidates, and my undergraduate degree and personal experience is in computer science (C, C++, UNIX) so this project is very interesting to me and I feel I have the skills and experience to obtain meaningful results. I desire a mentor at this point in time because I understand that it would give me a better chance of my project proposal being accepted and successfully executed. My hope is that one of you will agree to be my mentor, at which point I will create a detailed project proposal to submit to the GSoC. If I do not have a willing mentor I do not intend to submit a proposal. Ben and Robert seemed enthusiastic regarding my idea (Ben commented that the current AES implementation began in this way) but are too busy or lack the interest to become my mentor. I see that FreeBSD has been accepted as a program to GSoC 2012; OpenSSL is not listed. Therefore it is my assumption that my proposed project would be done under the FreeBSD program - even if eventually this code ends up in OpenSSL and flows downstream to FreeBSD. If this assumption does not satisfy you, can you please suggest a modification to my proposal that would make it become eligible for sponsorship under the FreeBSD GSoC program? If anyone is willing to take me on for this, please send me a response. I am very easy to work with :) Thanks, Robert Lorentz ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
GSoC Call for Mentor
Hi, I've been communicating with the FreeBSD GSoC admins list for a few months now, not realizing only 4 people are on there. I have spoken with Ben Laurie (affiliated with OpenSSL) and Robert Watson regarding my GSoC idea for software implementations of SHA-3 hash algorithms for the purpose of inclusion within FreeBSD or OpenSSL. The timeline for applications is now almost upon us, so I would like to finalize my plan as soon as possible to allow me time to create a good proposal in time to submit it. It seems clear that the implementation and performance analysis of the SHA-3 candidate algorithm(s) is the interesting part of what I discussed in that earlier correspondence. Whether the code is written for FreeBSD or OpenSSL's specific framework is not interesting and more a strategic/political decision than a technical one. After pondering the previous suggestions, I think that my project proposal should be roughly as follows: - C Implementations of all 5 SHA-3 hash algorithm candidates. These implementations will operate in a standalone manner, with a reasonable interface such that the NIST SHA-3 selected algorithm's implementation could be easily adapted to work within OpenSSL or FreeBSD. - Expect that alternate implementations will be explored to determine possible performance tradeoffs and optimal implementations. - Formalized analysis and discussion, formatted in a conference-quality paper My motivation for this work is that I am currently working on PhD research in the field of cryptographic engineering, recently completed my MS CpE research on hardware FPGA implementations of SHA-3 candidates, and my undergraduate degree and personal experience is in computer science (C, C++, UNIX) so this project is very interesting to me and I feel I have the skills and experience to obtain meaningful results. I desire a mentor at this point in time because I understand that it would give me a better chance of my project proposal being accepted and successfully executed. My hope is that one of you will agree to be my mentor, at which point I will create a detailed project proposal to submit to the GSoC. If I do not have a willing mentor I do not intend to submit a proposal. Ben and Robert seemed enthusiastic regarding my idea (Ben commented that the current AES implementation began in this way) but are too busy or lack the interest to become my mentor. I see that FreeBSD has been accepted as a program to GSoC 2012; OpenSSL is not listed. Therefore it is my assumption that my proposed project would be done under the FreeBSD program - even if eventually this code ends up in OpenSSL and flows downstream to FreeBSD. If this assumption does not satisfy you, can you please suggest a modification to my proposal that would make it become eligible for sponsorship under the FreeBSD GSoC program? If anyone is willing to take me on for this, please send me a response. I am very easy to work with :) Thanks, Robert Lorentz ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Startvation of realtime piority threads
I have a multithreaded user space program that basically runs at realtime priority. Synchronization between threads are done using spinlock. When running this program on a SMP system under heavy memory pressure I see that thread holding the spinlock is starved out of cpu. The cpus are effectively consumed by other threads that are spinning for lock to become available. After instrumenting the kernel a little bit what I found was that under memory pressure, when the user thread holding the spinlock traps into the kernel due to page fault, that thread sleeps until the free pages are available. The thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, it is queued at PUSER priority even thought it's base priority is realtime. The other siblings threads that are spinning at realtime priority to acquire the spinlock starves the owner of spinlock. I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic is the same in the trunk. Thanks, Sushanth ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
2012/3/21 Konstantin Belousov : > On Thu, Mar 15, 2012 at 08:00:41PM +0100, Svatopluk Kraus wrote: >> 2012/3/15 Konstantin Belousov : >> > On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote: >> >> On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov >> >> wrote: >> >> > On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote: >> >> >> Hi, >> >> >> >> >> >> I have solved a following problem. If a big file (according to >> >> >> 'hidirtybuffers') is being written, the write speed is very poor. >> >> >> >> >> >> It's observed on system with elan 486 and 32MB RAM (i.e., low speed >> >> >> CPU and not too much memory) running FreeBSD-9. >> >> >> >> >> >> Analysis: A file is being written. All or almost all dirty buffers >> >> >> belong to the file. The file vnode is almost all time locked by >> >> >> writing process. The buf_daemon() can not flush any dirty buffer as a >> >> >> chance to acquire the file vnode lock is very low. A number of dirty >> >> >> buffers grows up very slow and with each new dirty buffer slower, >> >> >> because buf_daemon() eats more and more CPU time by looping on dirty >> >> >> buffers queue (with very low or no effect). >> >> >> >> >> >> This slowing down effect is started by buf_daemon() itself, when >> >> >> 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon() >> >> >> is waked up by own timeout. The timeout fires at 'hz' period, but >> >> >> starts to fire at 'hz/10' immediately as buf_daemon() fails to reach >> >> >> 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly) >> >> >> reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the >> >> >> buf_daemon() can be waked up within bdwrite() too and it's much worse. >> >> >> Finally and with very slow speed, the 'hidirtybuffers' or >> >> >> 'dirtybufthresh' is reached, the dirty buffers are flushed, and >> >> >> everything starts from beginning... >> >> > Note that for some time, bufdaemon work is distributed among bufdaemon >> >> > thread itself and any thread that fails to allocate a buffer, esp. >> >> > a thread that owns vnode lock and covers long queue of dirty buffers. >> >> >> >> However, the problem starts when numdirtybuffers reaches >> >> lodirtybuffers count and ends around hidirtybuffers count. There are >> >> still plenty of free buffers in system. >> >> >> >> >> >> >> >> On the system, a buffer size is 512 bytes and the default >> >> >> thresholds are following: >> >> >> >> >> >> vfs.hidirtybuffers = 134 >> >> >> vfs.lodirtybuffers = 67 >> >> >> vfs.dirtybufthresh = 120 >> >> >> >> >> >> For example, a 2MB file is copied into flash disk in about 3 >> >> >> minutes and 15 second. If dirtybufthresh is set to 40, the copy time >> >> >> is about 20 seconds. >> >> >> >> >> >> My solution is a mix of three things: >> >> >> 1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in >> >> >> the main buf_daemon() loop. >> >> > I cannot understand this. Please provide a patch that shows what do >> >> > you mean there. >> >> > >> >> curthread->td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED; >> >> mtx_lock(&bdlock); >> >> for (;;) { >> >> - bd_request = 0; >> >> + bd_request = 1; >> >> mtx_unlock(&bdlock); >> > Is this a complete patch ? The change just causes lost wakeups for >> > bufdaemon, >> > nothing more. >> Yes, it's a complete patch. And exactly, it causes lost wakeups which are: >> 1. !! UNREASONABLE !!, because bufdaemon is not sleeping, >> 2. not wanted, because it looks that it's correct behaviour for the >> sleep with hz/10 period. However, if the sleep with hz/10 period is >> expected to be waked up by bd_wakeup(), then bd_request should be set >> to 0 just before sleep() call, and then bufdaemon behaviour will be >> clear. > No, your description is wrong. > > If bufdaemon is unable to flush enough buffers and numdirtybuffers still > greater then lodirtybuffers, then bufdaemon enters qsleep state > without resetting bd_request, with timeouts of one tens of second. > Your patch will cause all wakeups for this case to be lost. This is > exactly the situation when we want bufdaemon to run harder to avoid > possible deadlocks, not to slow down. OK. Let's focus to bufdaemon implementation. Now, qsleep state is entered with random bd_request value. If someone calls bd_wakeup() during bufdaemon iteration over dirty buffers queues, then bd_request is set to 1. Otherwise, bd_request remains 0. I.e., sometimes qsleep state only can be timeouted, sometimes it can be waked up by bd_wakeup(). So, this random behaviour is what is wanted? >> All stuff around bd_request and bufdaemon sleep is under bd_lock, so >> if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are >> unreasonable! The patch is about that mainly. > Wakeups itself are very cheap for the running process. Mostly, it comes > down to locking sleepq and waking all threads that are present in the > sleepq
Re: question about amd64 pagetable page allocation.
Hello Mark, >From what I understand, the virtual address of a given page table should not change when accessing from vtopte() and pmap_pte(). However, with small code change in pmap_remove_pages(), I was able to print the values returned by these two functions. vtopte() and pmap_pte(), pte1 0x806432e0 pa1 346941425 m1 0xff04291cf600, pte2 0xff03463032e0 pa2 346941425 m2 0xff04291cf600 pte1 0x806432e8 pa1 346842425 m1 0xff04291c7e78, pte2 0xff03463032e8 pa2 346842425 m2 0xff04291c7e78 pte1 0x806432f0 pa1 346863425 m1 0xff04291c8df0, pte2 0xff03463032f0 pa2 346863425 m2 0xff04291c8df0 In the above result, the pte1 is the result of vtopte() and pte2 is the result of pmap_pte(). Interestingly, the value of these two virtual addresses pte1 and pte2, result in the same physical address pa1 == pa2. If I am not wrong, the page tables are now mapped in two different virtual addresses? Could you please clarify this? Thanks, Vasanth On Tue, Apr 3, 2012 at 3:18 PM, Mark Tinguely wrote: > On Tue, Apr 3, 2012 at 1:52 PM, vasanth rao naik sabavat > wrote: > > Hello Mark, > > > > I think pmap_remove_pages() is executed only for the current process. > > > >2549 #ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY > >2550 if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace)) { > >2551 printf("warning: pmap_remove_pages called with > non-current > > pmap\n"); > >2552 return; > >2553 } > >2554 #endif > > > > I dont still get it why this was removed? > > > > Thanks, > > Vasanth > > > That is pretty old code. Newer code does not make that assumption. > > Without the assumption that the pages are from the current map, then you > have to use the direct physical -> virtual mapping: > > 2547#ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY > 2548pte = vtopte(pv->pv_va); > 2549#else > 2550pte = pmap_pte(pmap, pv->pv_va); > 2551#endif > > --Mark Tinguely. > ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Is there any modern alternative to pstack?
On 4/4/12 5:44 AM, Eitan Adler wrote: On 4 April 2012 01:41, Julian Elischer wrote: should be in ports? Not unless someone decides to become the new upstream and make a release. We do not maintain software in ports. but we do add patches to make things work on FreeBSD. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd wrote: > Hi, > > I'm interested in this, primarily because I'm tinkering with file > storage stuff on my little (most wifi targetted) embedded MIPS > platforms. > > So what's the story here? How can I reproduce your issue and do some > of my own profiling/investigation? > > > Adrian Hi, your interest has made me to do more solid/comparable investigation on my embedded ELAN486 platform. With more test results, I made full tracing of related VFS, filesystem, and disk function calls. It took some time to understand what about the issue really is. My test case: Single file copy (no O_FSYNC). It means that no other filesystem operation is served. The file size must be big enough according to hidirtybuffers value. Other processes on machine, where the test was run, almost were inactive. The real copy time was profiled. In all tests, a machine was booted, a file was copied, file was removed, the machine was rebooted. Thus, the the file was copied into same disk layout. The motivation is that my embedded machines don't do any writing to a disk mostly. Only during software update, a single process is writing to a disk (file by file). It doesn't need to be a problem at all, but an update must be successful even under full cpu load. So, the writing should be tuned up greatly to not affect other processes too much and to finish in finite time. On my embedded ELAN486 machines, a flash memory is used as a disk. It means that a reading is very fast, but a writing is slow. Further, a flash memory is divided into sectors and only complete sector can be erased at once. A sector erasure is very time expensive action. When I tried to tune up VFS by various parameters changing, I found out that real copy time depends on two things. Both of them are a subject of bufdaemon. Namely, its feature to try to work harder, if its buffers flushing mission is failing. It's not suprise that the best copy times were achived when bufdaemon was excluded from buffers flushing at all (by VFS parameters setting). This bufdaemon feature brings along (with respect to the real copy time): 1. bufdaemon runtime itself, 2. very frequent filesystem buffers flushing. What really happens in the test case on my machine: A copy program uses a buffer for coping. The default buffer size is 128 KiB in my case. The simplified sys_write() implementation for DTYPE_VNODE and VREG type is following: sys_write() { dofilewrite() { bwillwrite() fo_write() => vn_write() { bwillwrite() vn_lock() VOP_WRITE() VOP_UNLOCK() } } } So, all 128 KiB is written under VNODE lock. When I take back the machine defaults: hidirtybuffers: 134 lodirtybuffers: 67 bufdirtythresh: 120 buffer size (filesystem block size): 512 bytes and do some simple calculations: 134 * 512 = 68608 -> high water bytes count 120 * 512 = 61440 67 * 512 = 34304 -> low water byte count then it's obvious that bufdaemon has something to do during each sys_write(). However, almost all dirty buffers belong to new file VNODE and the VNODE is locked. What remains are filesystem buffers only. I.e., superblock buffer and free block bitmap buffers. So, bufdaemon iterates over all dirty buffers queue, what takes a SIGNIFICANT time on my machine, and does not find any buffer to be able to flush almost all time. If bufdaemon flushes one or two buffers, kern_yield() is called, and new iteration is started until no buffer is flushed. So, very often TWO full iteration over dirty buffers queue is done to flush only one or two filesystem buffers and to failed to reach lodirtybuffers threshold. A bufdaemon runtime is growing up. Moreover, the frequent filesystem buffers flushing brings along higher cpu load (geom down thread, geom up thread, disk thread scheduling) and a disk blocks writing re-ordering. The correct disk blocks writing order is important for the flash disk. Further, while the file data buffers are aged but not flushed, filesystem buffers are written repeatedly but flushed. Of course, I use a sector cache in the flash disk, but I can't cache too many sectors because of total memory size. So, filesystem disk blocks often are written and that evokes more disk sector flushes. A sector flush really takes long time, so real copy time grows up beyond control. Last but not least, the flash memory are going to be aged uselessly. Well, this is my old story. Just to be honest, I quite forgot that my kernel was compiled with FULL_PREEMPTION option. The things are very much worse in this case. However, the option just makes the issue worse, the issue doesn't disapper without it. In this old story, I played a game with and focused to bufdirtythresh value. However, bufdirtythresh is changing the way, how and by who buffers are flushed, too much. I recorded disk sector flush count and total disk_strategy() calls count with BIO_WRITE command (and total bytes count to write). I used a file with size 2235517 bytes. When I was caching 4 dis
Re: Is there any modern alternative to pstack?
There are plenty of patches in the ports tree. At which point do you call it maintaining within the ports tree ? 8 files changed, 796 insertions(+), 233 deletions(-) Is hardly what someone should call maintaining considering the size of some of the other patches. And besides someone was willing to contribute the patch... no sense in degrading their work if they were willing to put it up for consumption. On Wed, Apr 04, 2012 at 08:44:31AM -0400, Eitan Adler wrote: > On 4 April 2012 01:41, Julian Elischer wrote: > > should be in ports? > > Not unless someone decides to become the new upstream and make a > release. We do not maintain software in ports. > > -- > Eitan Adler > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" -- ;s =; ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Is there any modern alternative to pstack?
On 04/04/2012 05:44, Eitan Adler wrote: Not unless someone decides to become the new upstream and make a release. We do not maintain software in ports. -- Eitan Adler But upstream is the sourceforge. Even though there is no activity there for a long while, it is easy to join that project, commit the change and make a release. It's better than to keep private patches. Yuri ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: Is there any modern alternative to pstack?
On 4 April 2012 01:41, Julian Elischer wrote: > should be in ports? Not unless someone decides to become the new upstream and make a release. We do not maintain software in ports. -- Eitan Adler ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: problems with mmap() and disk caching
I forgot to attach my test program. On 04.04.2012 13:36, Andrey Zonov wrote: On 04.04.2012 11:17, Konstantin Belousov wrote: Calling madvise(MADV_RANDOM) fixes the issue, because the code to deactivate/cache the pages is turned off. On the other hand, it also turns of read-ahead for faulting, and the first loop becomes eternally long. Now it takes 5 times longer. Anyway, thanks for explanation. Doing MADV_WILLNEED does not fix the problem indeed, since willneed reactivates the pages of the object at the time of call. To use MADV_WILLNEED, you would need to call it between faults/memcpy. I played with it, but no luck so far. I've also never seen super pages, how to make them work? They just work, at least for me. Look at the output of procstat -v after enough loops finished to not cause disk activity. The problem was in my test program. I fixed it, now I see super pages but I'm still not satisfied. There are several tests below: 1. With madvise(MADV_RANDOM) I see almost all super pages: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 26.438535 (none: 0; res: 262144; super: 511; other: 0) mmap: 2 pass took: 0.187311 (none: 0; res: 262144; super: 511; other: 0) mmap: 3 pass took: 0.184953 (none: 0; res: 262144; super: 511; other: 0) mmap: 4 pass took: 0.186007 (none: 0; res: 262144; super: 511; other: 0) mmap: 5 pass took: 0.185790 (none: 0; res: 262144; super: 511; other: 0) Should it be 512? 2. Without madvise(MADV_RANDOM): $ ./mmap /mnt/random-1024 50 mmap: 1 pass took: 7.629745 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.301720 (none: 261202; res: 942; super: 0; other: 0) mmap: 3 pass took: 7.261416 (none: 260226; res: 1918; super: 1; other: 0) [skip] mmap: 49 pass took: 0.155368 (none: 0; res: 262144; super: 323; other: 0) mmap: 50 pass took: 0.155438 (none: 0; res: 262144; super: 323; other: 0) Only 323 pages. 3. If I just re-run test I don't see super pages with any size of "block". $ ./mmap /mnt/random-1024 5 $((1<<30)) mmap: 1 pass took: 1.013939 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.267082 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.270711 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.268940 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.269634 (none: 0; res: 262144; super: 0; other: 0) 4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run test then I see super pages only if I use "block" greater than 2Mb. $ ./mmap /mnt/random-1024 1 $((1<<21)) mmap: 1 pass took: 0.299722 (none: 0; res: 262144; super: 0; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<22)) mmap: 1 pass took: 0.271828 (none: 0; res: 262144; super: 170; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<23)) mmap: 1 pass took: 0.333188 (none: 0; res: 262144; super: 258; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<24)) mmap: 1 pass took: 0.339250 (none: 0; res: 262144; super: 303; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<25)) mmap: 1 pass took: 0.418812 (none: 0; res: 262144; super: 324; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<26)) mmap: 1 pass took: 0.360892 (none: 0; res: 262144; super: 335; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<27)) mmap: 1 pass took: 0.401122 (none: 0; res: 262144; super: 342; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<28)) mmap: 1 pass took: 0.478764 (none: 0; res: 262144; super: 345; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<29)) mmap: 1 pass took: 0.607266 (none: 0; res: 262144; super: 346; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<30)) mmap: 1 pass took: 0.901269 (none: 0; res: 262144; super: 347; other: 0) 5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then I see some number of super pages (the number from test #2). $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.178666 (none: 0; res: 262144; super: 323; other: 0) mmap: 2 pass took: 0.158889 (none: 0; res: 262144; super: 323; other: 0) mmap: 3 pass took: 0.157229 (none: 0; res: 262144; super: 323; other: 0) mmap: 4 pass took: 0.156895 (none: 0; res: 262144; super: 323; other: 0) mmap: 5 pass took: 0.162938 (none: 0; res: 262144; super: 323; other: 0) 6. If I read file manually before test then I don't see super pages with any size of "block" and madvise(MADV_WILLNEED) doesn't help. $ ./mmap /mnt/random-1024 5 $((1<<30)) mmap: 1 pass took: 0.996767 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.311129 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.317430 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.314437 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.310757 (none: 0; res: 262144; super: 0; other: 0) -- Andrey Zonov /*_ * Andrey Zonov (c) 2011 */ #include #include #include #include #include #include #include #include #include int main(int argc, char **argv) { int i; int fd; int num; int block; int pagesize; size_t size; size_t none, incore, super, oth
Re: problems with mmap() and disk caching
On 04.04.2012 11:17, Konstantin Belousov wrote: Calling madvise(MADV_RANDOM) fixes the issue, because the code to deactivate/cache the pages is turned off. On the other hand, it also turns of read-ahead for faulting, and the first loop becomes eternally long. Now it takes 5 times longer. Anyway, thanks for explanation. Doing MADV_WILLNEED does not fix the problem indeed, since willneed reactivates the pages of the object at the time of call. To use MADV_WILLNEED, you would need to call it between faults/memcpy. I played with it, but no luck so far. I've also never seen super pages, how to make them work? They just work, at least for me. Look at the output of procstat -v after enough loops finished to not cause disk activity. The problem was in my test program. I fixed it, now I see super pages but I'm still not satisfied. There are several tests below: 1. With madvise(MADV_RANDOM) I see almost all super pages: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 26.438535 (none: 0; res: 262144; super: 511; other: 0) mmap: 2 pass took: 0.187311 (none: 0; res: 262144; super: 511; other: 0) mmap: 3 pass took: 0.184953 (none: 0; res: 262144; super: 511; other: 0) mmap: 4 pass took: 0.186007 (none: 0; res: 262144; super: 511; other: 0) mmap: 5 pass took: 0.185790 (none: 0; res: 262144; super: 511; other: 0) Should it be 512? 2. Without madvise(MADV_RANDOM): $ ./mmap /mnt/random-1024 50 mmap: 1 pass took: 7.629745 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.301720 (none: 261202; res:942; super: 0; other: 0) mmap: 3 pass took: 7.261416 (none: 260226; res: 1918; super: 1; other: 0) [skip] mmap: 49 pass took: 0.155368 (none: 0; res: 262144; super: 323; other: 0) mmap: 50 pass took: 0.155438 (none: 0; res: 262144; super: 323; other: 0) Only 323 pages. 3. If I just re-run test I don't see super pages with any size of "block". $ ./mmap /mnt/random-1024 5 $((1<<30)) mmap: 1 pass took: 1.013939 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.267082 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.270711 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.268940 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.269634 (none: 0; res: 262144; super: 0; other: 0) 4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run test then I see super pages only if I use "block" greater than 2Mb. $ ./mmap /mnt/random-1024 1 $((1<<21)) mmap: 1 pass took: 0.299722 (none: 0; res: 262144; super: 0; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<22)) mmap: 1 pass took: 0.271828 (none: 0; res: 262144; super: 170; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<23)) mmap: 1 pass took: 0.333188 (none: 0; res: 262144; super: 258; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<24)) mmap: 1 pass took: 0.339250 (none: 0; res: 262144; super: 303; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<25)) mmap: 1 pass took: 0.418812 (none: 0; res: 262144; super: 324; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<26)) mmap: 1 pass took: 0.360892 (none: 0; res: 262144; super: 335; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<27)) mmap: 1 pass took: 0.401122 (none: 0; res: 262144; super: 342; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<28)) mmap: 1 pass took: 0.478764 (none: 0; res: 262144; super: 345; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<29)) mmap: 1 pass took: 0.607266 (none: 0; res: 262144; super: 346; other: 0) $ ./mmap /mnt/random-1024 1 $((1<<30)) mmap: 1 pass took: 0.901269 (none: 0; res: 262144; super: 347; other: 0) 5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then I see some number of super pages (the number from test #2). $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.178666 (none: 0; res: 262144; super: 323; other: 0) mmap: 2 pass took: 0.158889 (none: 0; res: 262144; super: 323; other: 0) mmap: 3 pass took: 0.157229 (none: 0; res: 262144; super: 323; other: 0) mmap: 4 pass took: 0.156895 (none: 0; res: 262144; super: 323; other: 0) mmap: 5 pass took: 0.162938 (none: 0; res: 262144; super: 323; other: 0) 6. If I read file manually before test then I don't see super pages with any size of "block" and madvise(MADV_WILLNEED) doesn't help. $ ./mmap /mnt/random-1024 5 $((1<<30)) mmap: 1 pass took: 0.996767 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.311129 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.317430 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.314437 (none: 0; res: 262144; super: 0;
Re: Is there any modern alternative to pstack?
On 4 Apr 2012 06:41, "Julian Elischer" wrote: > > On 4/2/12 10:12 AM, John Baldwin wrote: >> >> On Monday, April 02, 2012 12:39:26 pm Yuri wrote: >>> >>> On 04/02/2012 05:31, John Baldwin wrote: Hmm, I don't know if the port has it, but I did some work on pstack a while ago to make it work with libthread_db so it at least handles i386 ok. It needs to be modified to use something like libunwind though or some other unwinder. And possibly it should use libelf instead of its own ELF-parsing code. >>> >>> I see pstack -1.2_1 failing even on i386: >>> >>> pstack: cannot read context for thread 0x1879f >>> pstack: failed to read more threads >> >> Yes, threads don't work for modern binaries (newer than 4.x) without my changes >> to make it use libthread_db. You can find the patch I used for this at >> http://www.freebsd.org/~jhb/patches/pstack_threads.patch > > > should be in ports? > > I'm on it. Chris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: problems with mmap() and disk caching
On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: > Hi, > > I open the file, then call mmap() on the whole file and get pointer, > then I work with this pointer. I expect that page should be only once > touched to get it into the memory (disk cache?), but this doesn't work! > > I wrote the test (attached) and ran it for the 1G file generated from > /dev/random, the result is the following: > > Prepare file: > # swapoff -a > # newfs /dev/ada0b > # mount /dev/ada0b /mnt > # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 > > Purge cache: > # umount /mnt > # mount /dev/ada0b /mnt > > Run test: > $ ./mmap /mnt/random-1024 30 > mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: > 0; other: 0) > mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: > 0; other: 0) > mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: > 0; other: 0) > mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: > 0; other: 0) > mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: > 0; other: 0) > mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: > 0; other: 0) > mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: > 0; other: 0) > mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: > 0; other: 0) > mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: > 0; other: 0) > mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: > 0; other: 0) > mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: > 0; other: 0) > mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: > 0; other: 0) > mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: > 0; other: 0) > mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: > 0; other: 0) > mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: > 0; other: 0) > mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: > 0; other: 0) > mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: > 0; other: 0) > mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: > 0; other: 0) > mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: > 0; other: 0) > mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: > 0; other: 0) > mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: > 0; other: 0) > mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: > 0; other: 0) > mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: > 0; other: 0) > mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: > 0; other: 0) > mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: > 0; other: 0) > mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: > 0; other: 0) > mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: > 0; other: 0) > mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: > 0; other: 0) > > If I ran this: > $ cat /mnt/random-1024 > /dev/null > before test, when result is the following: > > $ ./mmap /mnt/random-1024 5 > mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: > 0; other: 0) > mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: > 0; other: 0) > > This is what I expect. But why this doesn't work without reading file > manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. What happen is that fault handler deactivates or caches the pages previous to the one which would satisfy the fault. See the if() statement starting at line 463 of vm/vm_fault.c. Since all pages of the object in your test are clean, the pages are cached. Next fault would need to allocate some more pages for different index of the same object. What I see is that vm_reserv_alloc_page() returns a page that is from the cache for the same object, but different pindex. As an obvious result, the page is invalidated and repurposed. When next loop started, the page is not resident anymore, so it has to be re-read from disk. The behaviour of the allocator is not consistent, so some pages are not reused, allowing the test to converge and to collect all pages of the object eventually. Calling madvise(MADV_RANDOM) fixes the issue, because the code to deactivate/cache the pages is turned off. On the oth