Re: [gsoc2012] Port NetBSD's UDF implementation

2012-04-04 Thread Pedro Giffuni

Hi YongCon;

The project would be very interesting for us. I am pretty sure you will
not have problems finding a mentor.

That said, let me point out an old thread:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-May/042565.html

I think the biggest problem is that you will have to get acquainted
with FreeBSD's Virtual Memory which is different from NetBSD's.
In that same thread you will find some comments by Matt Dillon
(no idea how up to date those are).

It will not be an easy task but people find such challenges very
rewarding.

Pedro.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Startvation of realtime piority threads

2012-04-04 Thread David Xu

On 2012/4/5 11:56, Konstantin Belousov wrote:

On Wed, Apr 04, 2012 at 06:54:06PM -0700, Sushanth Rai wrote:

I have a multithreaded user space program that basically runs at realtime 
priority. Synchronization between threads are done using spinlock. When running 
this program on a SMP system under heavy memory pressure I see that thread 
holding the spinlock is starved out of cpu. The cpus are effectively consumed 
by other threads that are spinning for lock to become available.

After instrumenting the kernel a little bit what I found was that under memory 
pressure, when the user thread holding the spinlock traps into the kernel due 
to page fault, that thread sleeps until the free pages are available. The 
thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, 
it is queued at PUSER priority even thought it's base priority is realtime. The 
other siblings threads that are spinning at realtime priority to acquire the 
spinlock starves the owner of spinlock.

I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, 
PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic 
is the same in the trunk.

It just so happen that your program stumbles upon a single sleep point in
the kernel. If for whatever reason the thread in kernel is put off CPU
due to failure to acquire any resource without priority propagation,
you would get the same effect. Only blockable primitives do priority
propagation, that are mutexes and rwlocks, AFAIR. In other words, any
sx/lockmgr/sleep points are vulnerable to the same issue.
This is why I suggested that POSIX realtime priority should not be 
boosted, it should be
only higher than PRI_MIN_TIMESHARE but lower than any priority all 
msleep() callers
provided.  The problem is userland realtime thread 's busy looping code 
can cause

starvation a thread in kernel which holding a critical resource.
In kernel we can avoid to write dead-loop code, but userland code is not 
trustable.


If you search "Realtime thread priorities" in 2010-december within @arch 
list.

you may find the argument.



Speaking of exactly your problem, did you considered wiring the memory
of your realtime process ? This is a common practice, used e.g. by ntpd.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Startvation of realtime piority threads

2012-04-04 Thread David Xu

On 2012/4/5 9:54, Sushanth Rai wrote:

I have a multithreaded user space program that basically runs at realtime 
priority. Synchronization between threads are done using spinlock. When running 
this program on a SMP system under heavy memory pressure I see that thread 
holding the spinlock is starved out of cpu. The cpus are effectively consumed 
by other threads that are spinning for lock to become available.

After instrumenting the kernel a little bit what I found was that under memory 
pressure, when the user thread holding the spinlock traps into the kernel due 
to page fault, that thread sleeps until the free pages are available. The 
thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, 
it is queued at PUSER priority even thought it's base priority is realtime. The 
other siblings threads that are spinning at realtime priority to acquire the 
spinlock starves the owner of spinlock.

I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, 
PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic 
is the same in the trunk.

Thanks,
Sushanth


I think 7.2 still has libkse which supports static priority scheduling,  
if performance is not important
but correctness,  you may try libkse with process-scope threads, and use 
priority-inherit mutex to

do locking.

Kernel is known to be vulnerable to support user realtime threads.  I 
think not every-locking primitive

can support priority propagation, this is an issue.

In userland,  internal library mutexes are not priority-inherit, so 
starvation may happen too. If you
know what you are doing, don't call such functions which uses internal 
mutexes, but this is rather

difficult.

Regards,
David Xu

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: CAM disk I/O starvation

2012-04-04 Thread Alexander Leidinger
On Tue, 3 Apr 2012 14:27:43 -0700 Jerry Toung 
wrote:

> On 4/3/12, Gary Jennejohn  wrote:
> 
> > It would be interesting to see your patch.  I always run HEAD but
> > maybe I could use it as a base for my own mods/tests.
> >
> 
> Here is the patch

This looks fair if all your disks are working at the same time (e.g.
RAID only setup), but if you have a setup where you have multiple
disks and only one is doing something, you limit the amount of tags
which can be used. No idea what kind of performance impact this would
have.

What about the case where you have more disks than tags?

I also noticed that you do a strncmp for "da". What about
"ada" (available in 9 and 10), I would assume it suffers from the same
problem.

Bye,
Alexander.

-- 
http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org   netchild @ FreeBSD.org  : PGP ID = 72077137
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Startvation of realtime piority threads

2012-04-04 Thread Konstantin Belousov
On Wed, Apr 04, 2012 at 06:54:06PM -0700, Sushanth Rai wrote:
> I have a multithreaded user space program that basically runs at realtime 
> priority. Synchronization between threads are done using spinlock. When 
> running this program on a SMP system under heavy memory pressure I see that 
> thread holding the spinlock is starved out of cpu. The cpus are effectively 
> consumed by other threads that are spinning for lock to become available. 
> 
> After instrumenting the kernel a little bit what I found was that under 
> memory pressure, when the user thread holding the spinlock traps into the 
> kernel due to page fault, that thread sleeps until the free pages are 
> available. The thread sleeps PUSER priority (within vm_waitpfault()). When it 
> is ready to run, it is queued at PUSER priority even thought it's base 
> priority is realtime. The other siblings threads that are spinning at 
> realtime priority to acquire the spinlock starves the owner of spinlock. 
> 
> I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, 
> PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic 
> is the same in the trunk.

It just so happen that your program stumbles upon a single sleep point in
the kernel. If for whatever reason the thread in kernel is put off CPU
due to failure to acquire any resource without priority propagation,
you would get the same effect. Only blockable primitives do priority
propagation, that are mutexes and rwlocks, AFAIR. In other words, any
sx/lockmgr/sleep points are vulnerable to the same issue.

Speaking of exactly your problem, did you considered wiring the memory
of your realtime process ? This is a common practice, used e.g. by ntpd.


pgpH0WHCLjZxr.pgp
Description: PGP signature


GSoC call for mentor

2012-04-04 Thread Robert Lorentz
Hi,

I've been communicating with the FreeBSD GSoC admins list for a few months now, 
not realizing only 4 people are on there.  I have spoken with Ben Laurie 
(affiliated with OpenSSL) and Robert Watson regarding my GSoC idea for software 
implementations of SHA-3 hash algorithms for the purpose of inclusion within 
FreeBSD or OpenSSL.  The timeline for applications is now almost upon us, so I 
would like to finalize my plan as soon as possible to allow me time to create a 
good proposal in time to submit it. 

It seems clear that the implementation and performance analysis of the SHA-3 
candidate algorithm(s) is the interesting part of what I discussed in that 
earlier correspondence.  Whether the code is written for FreeBSD or OpenSSL's 
specific framework is not interesting and more a strategic/political decision 
than a technical one.  

After pondering the previous suggestions, I think that my project proposal 
should be roughly as follows:

- C Implementations of all 5 SHA-3 hash algorithm candidates. These 
implementations will operate in a standalone manner, with a reasonable 
interface such that the NIST SHA-3 selected algorithm's implementation could be 
easily adapted to work within OpenSSL or FreeBSD.
- Expect that alternate implementations will be explored to determine possible 
performance tradeoffs and optimal implementations. 
- Formalized analysis and discussion, formatted in a conference-quality paper  

My motivation for this work is that I am currently working on PhD research in 
the field of cryptographic engineering, recently completed my MS CpE research 
on hardware FPGA implementations of SHA-3 candidates, and my undergraduate 
degree and personal experience is in computer science (C, C++, UNIX) so this 
project is very interesting to me and I feel I have the skills and experience 
to obtain meaningful results.

I desire a mentor at this point in time because I understand that it would give 
me a better chance of my project proposal being accepted and successfully 
executed.  My hope is that one of you will agree to be my mentor, at which 
point I will create a detailed project proposal to submit to the GSoC. If I do 
not have a willing mentor I do not intend to submit a proposal. Ben and Robert 
seemed enthusiastic regarding my idea (Ben commented that the current AES 
implementation began in this way) but are too busy or lack the interest to 
become my mentor.   

I see that FreeBSD has been accepted as a program to GSoC 2012; OpenSSL is not 
listed.  Therefore it is my assumption that my proposed project would be done 
under the FreeBSD program - even if eventually this code ends up in OpenSSL and 
flows downstream to FreeBSD.  If this assumption does not satisfy you, can you 
please suggest a modification to my proposal that would make it become eligible 
for sponsorship under the FreeBSD GSoC program?

If anyone is willing to take me on for this, please send me a response. I am 
very easy to work with :)

Thanks,

Robert Lorentz
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


GSoC Call for Mentor

2012-04-04 Thread Robert Lorentz
Hi,

I've been communicating with the FreeBSD GSoC admins list for a few
months now, not realizing only 4 people are on there.  I have spoken
with Ben Laurie (affiliated with OpenSSL) and Robert Watson regarding
my GSoC idea for software implementations of SHA-3 hash algorithms for
the purpose of inclusion within FreeBSD or OpenSSL.  The timeline for
applications is now almost upon us, so I would like to finalize my
plan as soon as possible to allow me time to create a good proposal in
time to submit it.

It seems clear that the implementation and performance analysis of the
SHA-3 candidate algorithm(s) is the interesting part of what I
discussed in that earlier correspondence.  Whether the code is written
for FreeBSD or OpenSSL's specific framework is not interesting and
more a strategic/political decision than a technical one.

After pondering the previous suggestions, I think that my project
proposal should be roughly as follows:

- C Implementations of all 5 SHA-3 hash algorithm candidates. These
implementations will operate in a standalone manner, with a reasonable
interface such that the NIST SHA-3 selected algorithm's implementation
could be easily adapted to work within OpenSSL or FreeBSD.
- Expect that alternate implementations will be explored to determine
possible performance tradeoffs and optimal implementations.
- Formalized analysis and discussion, formatted in a conference-quality paper

My motivation for this work is that I am currently working on PhD
research in the field of cryptographic engineering, recently completed
my MS CpE research on hardware FPGA implementations of SHA-3
candidates, and my undergraduate degree and personal experience is in
computer science (C, C++, UNIX) so this project is very interesting to
me and I feel I have the skills and experience to obtain meaningful
results.

I desire a mentor at this point in time because I understand that it
would give me a better chance of my project proposal being accepted
and successfully executed.  My hope is that one of you will agree to
be my mentor, at which point I will create a detailed project proposal
to submit to the GSoC. If I do not have a willing mentor I do not
intend to submit a proposal. Ben and Robert seemed enthusiastic
regarding my idea (Ben commented that the current AES implementation
began in this way) but are too busy or lack the interest to become my
mentor.

I see that FreeBSD has been accepted as a program to GSoC 2012;
OpenSSL is not listed.  Therefore it is my assumption that my proposed
project would be done under the FreeBSD program - even if eventually
this code ends up in OpenSSL and flows downstream to FreeBSD.  If this
assumption does not satisfy you, can you please suggest a modification
to my proposal that would make it become eligible for sponsorship
under the FreeBSD GSoC program?

If anyone is willing to take me on for this, please send me a
response. I am very easy to work with :)

Thanks,

Robert Lorentz
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Startvation of realtime piority threads

2012-04-04 Thread Sushanth Rai
I have a multithreaded user space program that basically runs at realtime 
priority. Synchronization between threads are done using spinlock. When running 
this program on a SMP system under heavy memory pressure I see that thread 
holding the spinlock is starved out of cpu. The cpus are effectively consumed 
by other threads that are spinning for lock to become available. 

After instrumenting the kernel a little bit what I found was that under memory 
pressure, when the user thread holding the spinlock traps into the kernel due 
to page fault, that thread sleeps until the free pages are available. The 
thread sleeps PUSER priority (within vm_waitpfault()). When it is ready to run, 
it is queued at PUSER priority even thought it's base priority is realtime. The 
other siblings threads that are spinning at realtime priority to acquire the 
spinlock starves the owner of spinlock. 

I was wondering if the sleep in vm_waitpfault() should be a MAX(td_user_pri, 
PUSER) instead of just PUSER. I'm running on 7.2 and it looks like this logic 
is the same in the trunk.

Thanks,
Sushanth
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-04-04 Thread Svatopluk Kraus
2012/3/21 Konstantin Belousov :
> On Thu, Mar 15, 2012 at 08:00:41PM +0100, Svatopluk Kraus wrote:
>> 2012/3/15 Konstantin Belousov :
>> > On Tue, Mar 13, 2012 at 01:54:38PM +0100, Svatopluk Kraus wrote:
>> >> On Mon, Mar 12, 2012 at 7:19 PM, Konstantin Belousov
>> >>  wrote:
>> >> > On Mon, Mar 12, 2012 at 04:00:58PM +0100, Svatopluk Kraus wrote:
>> >> >> Hi,
>> >> >>
>> >> >>    I have solved a following problem. If a big file (according to
>> >> >> 'hidirtybuffers') is being written, the write speed is very poor.
>> >> >>
>> >> >>    It's observed on system with elan 486 and 32MB RAM (i.e., low speed
>> >> >> CPU and not too much memory) running FreeBSD-9.
>> >> >>
>> >> >>    Analysis: A file is being written. All or almost all dirty buffers
>> >> >> belong to the file. The file vnode is almost all time locked by
>> >> >> writing process. The buf_daemon() can not flush any dirty buffer as a
>> >> >> chance to acquire the file vnode lock is very low. A number of dirty
>> >> >> buffers grows up very slow and with each new dirty buffer slower,
>> >> >> because buf_daemon() eats more and more CPU time by looping on dirty
>> >> >> buffers queue (with very low or no effect).
>> >> >>
>> >> >>    This slowing down effect is started by buf_daemon() itself, when
>> >> >> 'numdirtybuffers' reaches 'lodirtybuffers' threshold and buf_daemon()
>> >> >> is waked up by own timeout. The timeout fires at 'hz' period, but
>> >> >> starts to fire at 'hz/10' immediately as buf_daemon() fails to reach
>> >> >> 'lodirtybuffers' threshold. When 'numdirtybuffers' (now slowly)
>> >> >> reaches ((lodirtybuffers + hidirtybuffers) / 2) threshold, the
>> >> >> buf_daemon() can be waked up within bdwrite() too and it's much worse.
>> >> >> Finally and with very slow speed, the 'hidirtybuffers' or
>> >> >> 'dirtybufthresh' is reached, the dirty buffers are flushed, and
>> >> >> everything starts from beginning...
>> >> > Note that for some time, bufdaemon work is distributed among bufdaemon
>> >> > thread itself and any thread that fails to allocate a buffer, esp.
>> >> > a thread that owns vnode lock and covers long queue of dirty buffers.
>> >>
>> >> However, the problem starts when numdirtybuffers reaches
>> >> lodirtybuffers count and ends around hidirtybuffers count. There are
>> >> still plenty of free buffers in system.
>> >>
>> >> >>
>> >> >>    On the system, a buffer size is 512 bytes and the default
>> >> >> thresholds are following:
>> >> >>
>> >> >>    vfs.hidirtybuffers = 134
>> >> >>    vfs.lodirtybuffers = 67
>> >> >>    vfs.dirtybufthresh = 120
>> >> >>
>> >> >>    For example, a 2MB file is copied into flash disk in about 3
>> >> >> minutes and 15 second. If dirtybufthresh is set to 40, the copy time
>> >> >> is about 20 seconds.
>> >> >>
>> >> >>    My solution is a mix of three things:
>> >> >>    1. Suppresion of buf_daemon() wakeup by setting bd_request to 1 in
>> >> >> the main buf_daemon() loop.
>> >> > I cannot understand this. Please provide a patch that shows what do
>> >> > you mean there.
>> >> >
>> >>       curthread->td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
>> >>       mtx_lock(&bdlock);
>> >>       for (;;) {
>> >> -             bd_request = 0;
>> >> +             bd_request = 1;
>> >>               mtx_unlock(&bdlock);
>> > Is this a complete patch ? The change just causes lost wakeups for 
>> > bufdaemon,
>> > nothing more.
>> Yes, it's a complete patch. And exactly, it causes lost wakeups which are:
>> 1. !! UNREASONABLE !!, because bufdaemon is not sleeping,
>> 2. not wanted, because it looks that it's correct behaviour for the
>> sleep with hz/10 period. However, if the sleep with hz/10 period is
>> expected to be waked up by bd_wakeup(), then bd_request should be set
>> to 0 just before sleep() call, and then bufdaemon behaviour will be
>> clear.
> No, your description is wrong.
>
> If bufdaemon is unable to flush enough buffers and numdirtybuffers still
> greater then lodirtybuffers, then bufdaemon enters qsleep state
> without resetting bd_request, with timeouts of one tens of second.
> Your patch will cause all wakeups for this case to be lost. This is
> exactly the situation when we want bufdaemon to run harder to avoid
> possible deadlocks, not to slow down.

OK. Let's focus to bufdaemon implementation. Now, qsleep state is
entered with random bd_request value. If someone calls bd_wakeup()
during bufdaemon iteration over dirty buffers queues, then bd_request
is set to 1. Otherwise, bd_request remains 0. I.e., sometimes qsleep
state only can be timeouted, sometimes it can be waked up by
bd_wakeup(). So, this random behaviour is what is wanted?

>> All stuff around bd_request and bufdaemon sleep is under bd_lock, so
>> if bd_request is 0 and bufdaemon is not sleeping, then all wakeups are
>> unreasonable! The patch is about that mainly.
> Wakeups itself are very cheap for the running process. Mostly, it comes
> down to locking sleepq and waking all threads that are present in the
> sleepq

Re: question about amd64 pagetable page allocation.

2012-04-04 Thread vasanth rao naik sabavat
Hello Mark,

>From what I understand, the virtual address of a given page table should
not change when accessing from vtopte() and pmap_pte().

However, with small code change in pmap_remove_pages(), I was able to print
the values returned by these two functions.

vtopte() and pmap_pte(),

pte1 0x806432e0 pa1 346941425 m1 0xff04291cf600, pte2
0xff03463032e0 pa2 346941425 m2 0xff04291cf600
pte1 0x806432e8 pa1 346842425 m1 0xff04291c7e78, pte2
0xff03463032e8 pa2 346842425 m2 0xff04291c7e78
pte1 0x806432f0 pa1 346863425 m1 0xff04291c8df0, pte2
0xff03463032f0 pa2 346863425 m2 0xff04291c8df0

In the above result, the pte1 is the result of vtopte() and pte2 is the
result of pmap_pte().

Interestingly, the value of these two virtual addresses pte1 and pte2,
result in the same physical address pa1 == pa2.

If I am not wrong, the page tables are now mapped in two different virtual
addresses?

Could you please clarify this?

Thanks,
Vasanth

On Tue, Apr 3, 2012 at 3:18 PM, Mark Tinguely wrote:

> On Tue, Apr 3, 2012 at 1:52 PM, vasanth rao naik sabavat
>  wrote:
> > Hello Mark,
> >
> > I think pmap_remove_pages() is executed only for the current process.
> >
> >2549 #ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY
> >2550 if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace)) {
> >2551 printf("warning: pmap_remove_pages called with
> non-current
> > pmap\n");
> >2552 return;
> >2553 }
> >2554 #endif
> >
> > I dont still get it why this was removed?
> >
> > Thanks,
> > Vasanth
>
>
> That is pretty old code. Newer code does not make that assumption.
>
> Without the assumption that the pages are from the current map, then you
> have to use the direct physical -> virtual mapping:
>
> 2547#ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY
> 2548pte = vtopte(pv->pv_va);
> 2549#else
> 2550pte = pmap_pte(pmap, pv->pv_va);
> 2551#endif
>
> --Mark Tinguely.
>
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Is there any modern alternative to pstack?

2012-04-04 Thread Julian Elischer

On 4/4/12 5:44 AM, Eitan Adler wrote:

On 4 April 2012 01:41, Julian Elischer  wrote:

should be in ports?

Not unless someone decides to become the new upstream and make a
release. We do not maintain software in ports.


but we do add patches to make things work on FreeBSD.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU

2012-04-04 Thread Svatopluk Kraus
On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd  wrote:
> Hi,
>
> I'm interested in this, primarily because I'm tinkering with file
> storage stuff on my little (most wifi targetted) embedded MIPS
> platforms.
>
> So what's the story here? How can I reproduce your issue and do some
> of my own profiling/investigation?
>
>
> Adrian

Hi,

your interest has made me to do more solid/comparable investigation on
my embedded ELAN486 platform. With more test results, I made full
tracing of related VFS, filesystem, and disk function calls. It took
some time to understand what about the issue really is.

My test case:
Single file copy (no O_FSYNC). It means that no other filesystem
operation is served. The file size must be big enough according to
hidirtybuffers value. Other processes on machine, where the test was
run, almost were inactive. The real copy time was profiled. In all
tests, a machine was booted, a file was copied, file was removed, the
machine was rebooted. Thus, the the file was copied into same disk
layout.

The motivation is that my embedded machines don't do any writing to a
disk mostly. Only during software update, a single process is writing
to a disk (file by file). It doesn't need to be a problem at all, but
an update must be successful even under full cpu load. So, the writing
should be tuned up greatly to not affect other processes too much and
to finish in finite time.

On my embedded ELAN486 machines, a flash memory is used as a disk. It
means that a reading is very fast, but a writing is slow. Further, a
flash memory is divided into sectors and only complete sector can be
erased at once. A sector erasure is very time expensive action.

When I tried to tune up VFS by various parameters changing, I found
out that real copy time depends on two things. Both of them are a
subject of bufdaemon. Namely, its feature to try to work harder, if
its buffers flushing mission is failing. It's not suprise that the
best copy times were achived when bufdaemon was excluded from buffers
flushing at all (by VFS parameters setting).

This bufdaemon feature brings along (with respect to the real copy time):
1. bufdaemon runtime itself,
2. very frequent filesystem buffers flushing.

What really happens in the test case on my machine:

A copy program uses a buffer for coping. The default buffer size is
128 KiB in my case. The simplified sys_write() implementation for
DTYPE_VNODE and VREG type is following:

sys_write()
{
 dofilewrite()
 {
  bwillwrite()
  fo_write() => vn_write()
  {
   bwillwrite()
   vn_lock()
   VOP_WRITE()
   VOP_UNLOCK()
  }
 }
}

So, all 128 KiB is written under VNODE lock. When I take back the
machine defaults:

hidirtybuffers: 134
lodirtybuffers: 67
bufdirtythresh: 120
buffer size (filesystem block size): 512 bytes

and do some simple calculations:

134 * 512 = 68608  -> high water bytes count
120 * 512 = 61440
67 * 512 = 34304   -> low water byte count

then it's obvious that bufdaemon has something to do during each
sys_write(). However, almost all dirty buffers belong to new file
VNODE and the VNODE is locked. What remains are filesystem buffers
only. I.e., superblock buffer and free block bitmap buffers. So,
bufdaemon iterates over all dirty buffers queue, what takes a
SIGNIFICANT time on my machine, and does not find any buffer to be
able to flush almost all time. If bufdaemon flushes one or two
buffers, kern_yield() is called, and new iteration is started until no
buffer is flushed. So, very often TWO full iteration over dirty
buffers queue is done to flush only one or two filesystem buffers and
to failed to reach lodirtybuffers threshold. A bufdaemon runtime is
growing up. Moreover, the frequent filesystem buffers flushing brings
along higher cpu load (geom down thread, geom up thread, disk thread
scheduling) and a disk blocks writing re-ordering. The correct disk
blocks writing order is important for the flash disk. Further, while
the file data buffers are aged but not flushed, filesystem buffers are
written repeatedly but flushed.

Of course, I use a sector cache in the flash disk, but I can't cache
too many sectors because of total memory size. So, filesystem disk
blocks often are written and that evokes more disk sector flushes. A
sector flush really takes long time, so real copy time grows up beyond
control. Last but not least, the flash memory are going to be aged
uselessly.

Well, this is my old story. Just to be honest, I quite forgot that my
kernel was compiled with FULL_PREEMPTION option. The things are very
much worse in this case. However, the option just makes the issue
worse, the issue doesn't disapper without it.

In this old story, I played a game with and focused to bufdirtythresh
value. However, bufdirtythresh is changing the way, how and by who
buffers are flushed, too much. I recorded disk sector flush count and
total disk_strategy() calls count with BIO_WRITE command (and total
bytes count to write). I used a file with size 2235517 bytes. When I
was caching 4 dis

Re: Is there any modern alternative to pstack?

2012-04-04 Thread Jason Hellenthal

There are plenty of patches in the ports tree. At which point do you
call it maintaining within the ports tree ?

8 files changed, 796 insertions(+), 233 deletions(-)

Is hardly what someone should call maintaining considering the size of
some of the other patches. And besides someone was willing to contribute
the patch... no sense in degrading their work if they were willing to
put it up for consumption.

On Wed, Apr 04, 2012 at 08:44:31AM -0400, Eitan Adler wrote:
> On 4 April 2012 01:41, Julian Elischer  wrote:
> > should be in ports?
> 
> Not unless someone decides to become the new upstream and make a
> release. We do not maintain software in ports.
> 
> -- 
> Eitan Adler
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

-- 
;s =;
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Is there any modern alternative to pstack?

2012-04-04 Thread Yuri

On 04/04/2012 05:44, Eitan Adler wrote:

Not unless someone decides to become the new upstream and make a
release. We do not maintain software in ports.

-- Eitan Adler


But upstream is the sourceforge. Even though there is no activity there 
for a long while, it is easy to join that project, commit the change and 
make a release.

It's better than to keep private patches.

Yuri
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Is there any modern alternative to pstack?

2012-04-04 Thread Eitan Adler
On 4 April 2012 01:41, Julian Elischer  wrote:
> should be in ports?

Not unless someone decides to become the new upstream and make a
release. We do not maintain software in ports.

-- 
Eitan Adler
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: problems with mmap() and disk caching

2012-04-04 Thread Andrey Zonov

I forgot to attach my test program.

On 04.04.2012 13:36, Andrey Zonov wrote:

On 04.04.2012 11:17, Konstantin Belousov wrote:


Calling madvise(MADV_RANDOM) fixes the issue, because the code to
deactivate/cache the pages is turned off. On the other hand, it also
turns of read-ahead for faulting, and the first loop becomes eternally
long.


Now it takes 5 times longer. Anyway, thanks for explanation.



Doing MADV_WILLNEED does not fix the problem indeed, since willneed
reactivates the pages of the object at the time of call. To use
MADV_WILLNEED, you would need to call it between faults/memcpy.



I played with it, but no luck so far.



I've also never seen super pages, how to make them work?

They just work, at least for me. Look at the output of procstat -v
after enough loops finished to not cause disk activity.



The problem was in my test program. I fixed it, now I see super pages
but I'm still not satisfied. There are several tests below:

1. With madvise(MADV_RANDOM) I see almost all super pages:
$ ./mmap /mnt/random-1024 5
mmap: 1 pass took: 26.438535 (none: 0; res: 262144; super: 511; other: 0)
mmap: 2 pass took: 0.187311 (none: 0; res: 262144; super: 511; other: 0)
mmap: 3 pass took: 0.184953 (none: 0; res: 262144; super: 511; other: 0)
mmap: 4 pass took: 0.186007 (none: 0; res: 262144; super: 511; other: 0)
mmap: 5 pass took: 0.185790 (none: 0; res: 262144; super: 511; other: 0)

Should it be 512?

2. Without madvise(MADV_RANDOM):
$ ./mmap /mnt/random-1024 50
mmap: 1 pass took: 7.629745 (none: 262112; res: 32; super: 0; other: 0)
mmap: 2 pass took: 7.301720 (none: 261202; res: 942; super: 0; other: 0)
mmap: 3 pass took: 7.261416 (none: 260226; res: 1918; super: 1; other: 0)
[skip]
mmap: 49 pass took: 0.155368 (none: 0; res: 262144; super: 323; other: 0)
mmap: 50 pass took: 0.155438 (none: 0; res: 262144; super: 323; other: 0)

Only 323 pages.

3. If I just re-run test I don't see super pages with any size of "block".

$ ./mmap /mnt/random-1024 5 $((1<<30))
mmap: 1 pass took: 1.013939 (none: 0; res: 262144; super: 0; other: 0)
mmap: 2 pass took: 0.267082 (none: 0; res: 262144; super: 0; other: 0)
mmap: 3 pass took: 0.270711 (none: 0; res: 262144; super: 0; other: 0)
mmap: 4 pass took: 0.268940 (none: 0; res: 262144; super: 0; other: 0)
mmap: 5 pass took: 0.269634 (none: 0; res: 262144; super: 0; other: 0)

4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run test
then I see super pages only if I use "block" greater than 2Mb.

$ ./mmap /mnt/random-1024 1 $((1<<21))
mmap: 1 pass took: 0.299722 (none: 0; res: 262144; super: 0; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<22))
mmap: 1 pass took: 0.271828 (none: 0; res: 262144; super: 170; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<23))
mmap: 1 pass took: 0.333188 (none: 0; res: 262144; super: 258; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<24))
mmap: 1 pass took: 0.339250 (none: 0; res: 262144; super: 303; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<25))
mmap: 1 pass took: 0.418812 (none: 0; res: 262144; super: 324; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<26))
mmap: 1 pass took: 0.360892 (none: 0; res: 262144; super: 335; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<27))
mmap: 1 pass took: 0.401122 (none: 0; res: 262144; super: 342; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<28))
mmap: 1 pass took: 0.478764 (none: 0; res: 262144; super: 345; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<29))
mmap: 1 pass took: 0.607266 (none: 0; res: 262144; super: 346; other: 0)
$ ./mmap /mnt/random-1024 1 $((1<<30))
mmap: 1 pass took: 0.901269 (none: 0; res: 262144; super: 347; other: 0)

5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then I
see some number of super pages (the number from test #2).

$ ./mmap /mnt/random-1024 5
mmap: 1 pass took: 0.178666 (none: 0; res: 262144; super: 323; other: 0)
mmap: 2 pass took: 0.158889 (none: 0; res: 262144; super: 323; other: 0)
mmap: 3 pass took: 0.157229 (none: 0; res: 262144; super: 323; other: 0)
mmap: 4 pass took: 0.156895 (none: 0; res: 262144; super: 323; other: 0)
mmap: 5 pass took: 0.162938 (none: 0; res: 262144; super: 323; other: 0)

6. If I read file manually before test then I don't see super pages with
any size of "block" and madvise(MADV_WILLNEED) doesn't help.

$ ./mmap /mnt/random-1024 5 $((1<<30))
mmap: 1 pass took: 0.996767 (none: 0; res: 262144; super: 0; other: 0)
mmap: 2 pass took: 0.311129 (none: 0; res: 262144; super: 0; other: 0)
mmap: 3 pass took: 0.317430 (none: 0; res: 262144; super: 0; other: 0)
mmap: 4 pass took: 0.314437 (none: 0; res: 262144; super: 0; other: 0)
mmap: 5 pass took: 0.310757 (none: 0; res: 262144; super: 0; other: 0)




--
Andrey Zonov
/*_
 * Andrey Zonov (c) 2011
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int
main(int argc, char **argv)
{
int i;
int fd;
int num;
int block;
int pagesize;
size_t size;
size_t none, incore, super, oth

Re: problems with mmap() and disk caching

2012-04-04 Thread Andrey Zonov

On 04.04.2012 11:17, Konstantin Belousov wrote:


Calling madvise(MADV_RANDOM) fixes the issue, because the code to
deactivate/cache the pages is turned off. On the other hand, it also
turns of read-ahead for faulting, and the first loop becomes eternally
long.


Now it takes 5 times longer.  Anyway, thanks for explanation.



Doing MADV_WILLNEED does not fix the problem indeed, since willneed
reactivates the pages of the object at the time of call. To use
MADV_WILLNEED, you would need to call it between faults/memcpy.



I played with it, but no luck so far.



I've also never seen super pages, how to make them work?

They just work, at least for me. Look at the output of procstat -v
after enough loops finished to not cause disk activity.



The problem was in my test program.  I fixed it, now I see super pages 
but I'm still not satisfied.  There are several tests below:


1. With madvise(MADV_RANDOM) I see almost all super pages:
$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:  26.438535 (none:  0; res: 262144; super: 
511; other:  0)
mmap:  2 pass took:   0.187311 (none:  0; res: 262144; super: 
511; other:  0)
mmap:  3 pass took:   0.184953 (none:  0; res: 262144; super: 
511; other:  0)
mmap:  4 pass took:   0.186007 (none:  0; res: 262144; super: 
511; other:  0)
mmap:  5 pass took:   0.185790 (none:  0; res: 262144; super: 
511; other:  0)


Should it be 512?

2. Without madvise(MADV_RANDOM):
$ ./mmap /mnt/random-1024 50
mmap:  1 pass took:   7.629745 (none: 262112; res: 32; super: 
0; other:  0)
mmap:  2 pass took:   7.301720 (none: 261202; res:942; super: 
0; other:  0)
mmap:  3 pass took:   7.261416 (none: 260226; res:   1918; super: 
1; other:  0)

[skip]
mmap: 49 pass took:   0.155368 (none:  0; res: 262144; super: 
323; other:  0)
mmap: 50 pass took:   0.155438 (none:  0; res: 262144; super: 
323; other:  0)


Only 323 pages.

3. If I just re-run test I don't see super pages with any size of "block".

$ ./mmap /mnt/random-1024 5 $((1<<30))
mmap:  1 pass took:   1.013939 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  2 pass took:   0.267082 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  3 pass took:   0.270711 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  4 pass took:   0.268940 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  5 pass took:   0.269634 (none:  0; res: 262144; super: 
0; other:  0)


4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run test 
then I see super pages only if I use "block" greater than 2Mb.


$ ./mmap /mnt/random-1024 1 $((1<<21))
mmap:  1 pass took:   0.299722 (none:  0; res: 262144; super: 
0; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<22))
mmap:  1 pass took:   0.271828 (none:  0; res: 262144; super: 
170; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<23))
mmap:  1 pass took:   0.333188 (none:  0; res: 262144; super: 
258; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<24))
mmap:  1 pass took:   0.339250 (none:  0; res: 262144; super: 
303; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<25))
mmap:  1 pass took:   0.418812 (none:  0; res: 262144; super: 
324; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<26))
mmap:  1 pass took:   0.360892 (none:  0; res: 262144; super: 
335; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<27))
mmap:  1 pass took:   0.401122 (none:  0; res: 262144; super: 
342; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<28))
mmap:  1 pass took:   0.478764 (none:  0; res: 262144; super: 
345; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<29))
mmap:  1 pass took:   0.607266 (none:  0; res: 262144; super: 
346; other:  0)

$ ./mmap /mnt/random-1024 1 $((1<<30))
mmap:  1 pass took:   0.901269 (none:  0; res: 262144; super: 
347; other:  0)


5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then I 
see some number of super pages (the number from test #2).


$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.178666 (none:  0; res: 262144; super: 
323; other:  0)
mmap:  2 pass took:   0.158889 (none:  0; res: 262144; super: 
323; other:  0)
mmap:  3 pass took:   0.157229 (none:  0; res: 262144; super: 
323; other:  0)
mmap:  4 pass took:   0.156895 (none:  0; res: 262144; super: 
323; other:  0)
mmap:  5 pass took:   0.162938 (none:  0; res: 262144; super: 
323; other:  0)


6. If I read file manually before test then I don't see super pages with 
any size of "block" and madvise(MADV_WILLNEED) doesn't help.


$ ./mmap /mnt/random-1024 5 $((1<<30))
mmap:  1 pass took:   0.996767 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  2 pass took:   0.311129 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  3 pass took:   0.317430 (none:  0; res: 262144; super: 
0; other:  0)
mmap:  4 pass took:   0.314437 (none:  0; res: 262144; super: 
0; 

Re: Is there any modern alternative to pstack?

2012-04-04 Thread Chris Rees
On 4 Apr 2012 06:41, "Julian Elischer"  wrote:
>
> On 4/2/12 10:12 AM, John Baldwin wrote:
>>
>> On Monday, April 02, 2012 12:39:26 pm Yuri wrote:
>>>
>>> On 04/02/2012 05:31, John Baldwin wrote:

 Hmm, I don't know if the port has it, but I did some work on pstack a
while
 ago to make it work with libthread_db so it at least handles i386 ok.
 It
 needs to be modified to use something like libunwind though or some
other
 unwinder.  And possibly it should use libelf instead of its own
ELF-parsing
 code.
>>>
>>> I see pstack -1.2_1 failing even on i386:
>>>
>>> pstack: cannot read context for thread 0x1879f
>>> pstack: failed to read more threads
>>
>> Yes, threads don't work for modern binaries (newer than 4.x) without my
changes
>> to make it use libthread_db.  You can find the patch I used for this at
>> http://www.freebsd.org/~jhb/patches/pstack_threads.patch
>
>
> should be in ports?
>
>

I'm on it.

Chris
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: problems with mmap() and disk caching

2012-04-04 Thread Konstantin Belousov
On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
> Hi,
> 
> I open the file, then call mmap() on the whole file and get pointer, 
> then I work with this pointer.  I expect that page should be only once 
> touched to get it into the memory (disk cache?), but this doesn't work!
> 
> I wrote the test (attached) and ran it for the 1G file generated from 
> /dev/random, the result is the following:
> 
> Prepare file:
> # swapoff -a
> # newfs /dev/ada0b
> # mount /dev/ada0b /mnt
> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024
> 
> Purge cache:
> # umount /mnt
> # mount /dev/ada0b /mnt
> 
> Run test:
> $ ./mmap /mnt/random-1024 30
> mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super: 
> 0; other:  0)
> mmap:  2 pass took:   7.356670 (none: 261648; res:496; super: 
> 0; other:  0)
> mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super: 
> 0; other:  0)
> mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super: 
> 0; other:  0)
> mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super: 
> 0; other:  0)
> mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super: 
> 0; other:  0)
> mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super: 
> 0; other:  0)
> mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super: 
> 0; other:  0)
> mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super: 
> 0; other:  0)
> mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super: 
> 0; other:  0)
> mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super: 
> 0; other:  0)
> mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super: 
> 0; other:  0)
> mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super: 
> 0; other:  0)
> mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super: 
> 0; other:  0)
> mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super: 
> 0; other:  0)
> mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super: 
> 0; other:  0)
> mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super: 
> 0; other:  0)
> mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super: 
> 0; other:  0)
> mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super: 
> 0; other:  0)
> mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super: 
> 0; other:  0)
> mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super: 
> 0; other:  0)
> mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super: 
> 0; other:  0)
> mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super: 
> 0; other:  0)
> mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super: 
> 0; other:  0)
> mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super: 
> 0; other:  0)
> mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super: 
> 0; other:  0)
> mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super: 
> 0; other:  0)
> mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super: 
> 0; other:  0)
> 
> If I ran this:
> $ cat /mnt/random-1024 > /dev/null
> before test, when result is the following:
> 
> $ ./mmap /mnt/random-1024 5
> mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super: 
> 0; other:  0)
> mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super: 
> 0; other:  0)
> 
> This is what I expect.  But why this doesn't work without reading file 
> manually?
Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

What happen is that fault handler deactivates or caches the pages
previous to the one which would satisfy the fault. See the if()
statement starting at line 463 of vm/vm_fault.c. Since all pages
of the object in your test are clean, the pages are cached.

Next fault would need to allocate some more pages for different index
of the same object. What I see is that vm_reserv_alloc_page() returns a
page that is from the cache for the same object, but different pindex.
As an obvious result, the page is invalidated and repurposed. When next
loop started, the page is not resident anymore, so it has to be re-read
from disk.

The behaviour of the allocator is not consistent, so some pages are not
reused, allowing the test to converge and to collect all pages of the
object eventually.

Calling madvise(MADV_RANDOM) fixes the issue, because the code to
deactivate/cache the pages is turned off. On the oth