Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev

On Oct 11, 2012, at 1:09 AM, Rick Macklem  wrote:

> Nikolay Denev wrote:
>> On Oct 10, 2012, at 3:18 AM, Rick Macklem 
>> wrote:
>> 
>>> Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem 
 wrote:
 
> Garrett Wollman wrote:
>> <>  said:
>> 
 Simple: just use a sepatate mutex for each list that a cache
 entry
 is on, rather than a global lock for everything. This would
 reduce
 the mutex contention, but I'm not sure how significantly since
 I
 don't have the means to measure it yet.
 
>>> Well, since the cache trimming is removing entries from the
>>> lists,
>>> I
>>> don't
>>> see how that can be done with a global lock for list updates?
>> 
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list
>> that
>> isn't being iterated over probably wouldn't hurt, unless there's
>> some
>> mechanism (that I didn't see) for entries to move from one list
>> to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be
>> cache-line
>> contention in the array of hash buckets; perhaps
>> NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>> 
> Yea, a separate mutex for each hash list might help. There is also
> the
> LRU list that all entries end up on, that gets used by the
> trimming
> code.
> (I think? I wrote this stuff about 8 years ago, so I haven't
> looked
> at
> it in a while.)
> 
> Also, increasing the hash table size is probably a good idea,
> especially
> if you reduce how aggressively the cache is trimmed.
> 
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>> 
>> My servers have 96 GB of memory so that's not a big deal for me.
>> 
> This code was originally "production tested" on a server with
> 1Gbyte,
> so times have changed a bit;-)
> 
>>> I'm not sure I see why doing it as a separate thread will
>>> improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if
>>> you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just
>>> increase
>>> contention, wouldn't it?
>> 
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has
>> any
>> need to, and having N threads doing that in parallel is why it's
>> so
>> heavily contended. If there's only one thread doing the trim,
>> then
>> the nfsd service threads aren't spending time either contending
>> on
>> the
>> mutex (it will be held less frequently and for shorter periods).
>> 
> I think the little drc2.patch which will keep the nfsd threads
> from
> acquiring the mutex and doing the trimming most of the time, might
> be
> sufficient. I still don't see why a separate trimming thread will
> be
> an advantage. I'd also be worried that the one cache trimming
> thread
> won't get the job done soon enough.
> 
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
> 
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in
>>> RPC
>>> response times (the time the nfsd thread spends trimming the
>>> cache).
>>> As noted, I think this time would be negligible compared to disk
>>> I/O
>>> and network transit times in the total RPC response time?
>> 
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>> 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the
>> NFS
>> server, it may actually be a win to turn off adaptive mutexes --
>> I
>> should give that a try once I'm able to do more testing.)
>> 
> Have fun with it. Let me know when you have what you think is a
> good
> patch.
> 
> rick
> 
>> -GAWollman
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscr...@freebsd.org"
> ___
> freebsd...@freebsd.org mailing list
> http://list

Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem
Nikolay Denev wrote:
> On Oct 10, 2012, at 3:18 AM, Rick Macklem 
> wrote:
> 
> > Nikolay Denev wrote:
> >> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
> >> wrote:
> >>
> >>> Garrett Wollman wrote:
>  <   said:
> 
> >> Simple: just use a sepatate mutex for each list that a cache
> >> entry
> >> is on, rather than a global lock for everything. This would
> >> reduce
> >> the mutex contention, but I'm not sure how significantly since
> >> I
> >> don't have the means to measure it yet.
> >>
> > Well, since the cache trimming is removing entries from the
> > lists,
> > I
> > don't
> > see how that can be done with a global lock for list updates?
> 
>  Well, the global lock is what we have now, but the cache trimming
>  process only looks at one list at a time, so not locking the list
>  that
>  isn't being iterated over probably wouldn't hurt, unless there's
>  some
>  mechanism (that I didn't see) for entries to move from one list
>  to
>  another. Note that I'm considering each hash bucket a separate
>  "list". (One issue to worry about in that case would be
>  cache-line
>  contention in the array of hash buckets; perhaps
>  NFSRVCACHE_HASHSIZE
>  ought to be increased to reduce that.)
> 
> >>> Yea, a separate mutex for each hash list might help. There is also
> >>> the
> >>> LRU list that all entries end up on, that gets used by the
> >>> trimming
> >>> code.
> >>> (I think? I wrote this stuff about 8 years ago, so I haven't
> >>> looked
> >>> at
> >>> it in a while.)
> >>>
> >>> Also, increasing the hash table size is probably a good idea,
> >>> especially
> >>> if you reduce how aggressively the cache is trimmed.
> >>>
> > Only doing it once/sec would result in a very large cache when
> > bursts of
> > traffic arrives.
> 
>  My servers have 96 GB of memory so that's not a big deal for me.
> 
> >>> This code was originally "production tested" on a server with
> >>> 1Gbyte,
> >>> so times have changed a bit;-)
> >>>
> > I'm not sure I see why doing it as a separate thread will
> > improve
> > things.
> > There are N nfsd threads already (N can be bumped up to 256 if
> > you
> > wish)
> > and having a bunch more "cache trimming threads" would just
> > increase
> > contention, wouldn't it?
> 
>  Only one cache-trimming thread. The cache trim holds the (global)
>  mutex for much longer than any individual nfsd service thread has
>  any
>  need to, and having N threads doing that in parallel is why it's
>  so
>  heavily contended. If there's only one thread doing the trim,
>  then
>  the nfsd service threads aren't spending time either contending
>  on
>  the
>  mutex (it will be held less frequently and for shorter periods).
> 
> >>> I think the little drc2.patch which will keep the nfsd threads
> >>> from
> >>> acquiring the mutex and doing the trimming most of the time, might
> >>> be
> >>> sufficient. I still don't see why a separate trimming thread will
> >>> be
> >>> an advantage. I'd also be worried that the one cache trimming
> >>> thread
> >>> won't get the job done soon enough.
> >>>
> >>> When I did production testing on a 1Gbyte server that saw a peak
> >>> load of about 100RPCs/sec, it was necessary to trim aggressively.
> >>> (Although I'd be tempted to say that a server with 1Gbyte is no
> >>> longer relevant, I recently recall someone trying to run FreeBSD
> >>> on a i486, although I doubt they wanted to run the nfsd on it.)
> >>>
> > The only negative effect I can think of w.r.t. having the nfsd
> > threads doing it would be a (I believe negligible) increase in
> > RPC
> > response times (the time the nfsd thread spends trimming the
> > cache).
> > As noted, I think this time would be negligible compared to disk
> > I/O
> > and network transit times in the total RPC response time?
> 
>  With adaptive mutexes, many CPUs, lots of in-memory cache, and
>  10G
>  network connectivity, spinning on a contended mutex takes a
>  significant amount of CPU time. (For the current design of the
>  NFS
>  server, it may actually be a win to turn off adaptive mutexes --
>  I
>  should give that a try once I'm able to do more testing.)
> 
> >>> Have fun with it. Let me know when you have what you think is a
> >>> good
> >>> patch.
> >>>
> >>> rick
> >>>
>  -GAWollman
>  ___
>  freebsd-hackers@freebsd.org mailing list
>  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>  To unsubscribe, send any mail to
>  "freebsd-hackers-unsubscr...@freebsd.org"
> >>> ___
> >>> freebsd...@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any ma

Re: No bus_space_read_8 on x86 ?

2012-10-10 Thread Carl Delsey
Sorry for the slow response. I was dealing with a bit of a family 
emergency. Responses inline below.


On 10/09/12 08:54, John Baldwin wrote:

On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote:

On Oct 5, 2012, at 10:08 AM, John Baldwin wrote:



I think cxgb* already have an implementation.  For amd64 we should certainly
have bus_space_*_8(), at least for SYS_RES_MEMORY.  I think they should fail
for SYS_RES_IOPORT.  I don't think we can force a compile-time error though,
would just have to return -1 on reads or some such?


Yes. Exactly what I was thinking.


I believe it was because bus reads weren't guaranteed to be atomic on i386.
don't know if that's still the case or a concern, but it was an intentional 
omission.

True.  If you are on a 32-bit system you can read the two 4 byte values and
then build a 64-bit value.  For 64-bit platforms we should offer bus_read_8()
however.


I believe there is still no way to perform a 64-bit read on a i386 (or 
at least without messing with SSE instructions), but if you have to read 
a 64-bit register, you are stuck with doing two 32-bit reads and 
concatenating them. I figure we may as well provide an implementation 
for those who have to do that as well as the implementation for 64-bit.


Anyhow, it sounds like we are basically in agreement. I'll put together 
a patch and send it out for review.


Thanks,
Carl

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: SMP Version of tar

2012-10-10 Thread Wojciech Puchar


Tim is correct in that gzip datastream allows for concatenation of
compressed blocks of data, so you might break the input stream into
a bunch of blocks [A, B, C, etc], and then can append those together
into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get
the original input stream.
I think that Wojciech's point is that the compressed data stream for
for the single datastream is different than the compressed data
stream of [A.gz, B.gz, C.gz, etc].  Both will decompress to the same
thing, but the intermediate compressed representation will be different.


So - after your response it is clear that parallel generated tar.gz will 
be different and have slightly (can be ignored) worse compression, and 
WILL be compatible with standard gzip as it can decompress from multiple 
streams which i wasn't aware of.


That's good. at the same time parallel tar will go back to single thread 
when unpacking standard .tar.gz - not a big deal, as gzip decompression is 
untrafast and I/O is usually a limit.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Kernel memory usage

2012-10-10 Thread Ryan Stone
On Mon, Oct 8, 2012 at 9:26 PM, Sushanth Rai  wrote:
> I was trying to co-relate the o/p from "top" to that I get from vmstat -z. I 
> don't have any user programs that wires memory. Given that, I'm assuming the 
> wired memory count shown by "top" is memory used by kernel. Now I would like 
> find out how the kernel is using this "wired" memory. So, I look at dynamic 
> memory allocated by kernel using "vmstat -z". I think memory allocated via 
> malloc() is serviced by zones if the allocation size is <4k. So, I'm not sure 
> how useful "vmstat -m" is. I also add up memory used by buffer cache. Is 
> there any other significant chunk I'm missing ? Does vmstat -m show memory 
> that is not accounted for in vmstat -z.

All allocations by malloc that are larger than a single page are
served by uma_large_malloc, and as far as I can tell these allocations
will not be accounted for in vmstat -z (they will, of course, be
accounted for in vmstat -m).  Similarly, all allocations through
contigmalloc will not be accounted for in vmstat -z.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev

On Oct 10, 2012, at 3:18 AM, Rick Macklem  wrote:

> Nikolay Denev wrote:
>> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
>> wrote:
>> 
>>> Garrett Wollman wrote:
 <>>>  said:
 
>> Simple: just use a sepatate mutex for each list that a cache
>> entry
>> is on, rather than a global lock for everything. This would
>> reduce
>> the mutex contention, but I'm not sure how significantly since I
>> don't have the means to measure it yet.
>> 
> Well, since the cache trimming is removing entries from the lists,
> I
> don't
> see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list
 that
 isn't being iterated over probably wouldn't hurt, unless there's
 some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 "list". (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps
 NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
>>> Yea, a separate mutex for each hash list might help. There is also
>>> the
>>> LRU list that all entries end up on, that gets used by the trimming
>>> code.
>>> (I think? I wrote this stuff about 8 years ago, so I haven't looked
>>> at
>>> it in a while.)
>>> 
>>> Also, increasing the hash table size is probably a good idea,
>>> especially
>>> if you reduce how aggressively the cache is trimmed.
>>> 
> Only doing it once/sec would result in a very large cache when
> bursts of
> traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
>>> This code was originally "production tested" on a server with
>>> 1Gbyte,
>>> so times have changed a bit;-)
>>> 
> I'm not sure I see why doing it as a separate thread will improve
> things.
> There are N nfsd threads already (N can be bumped up to 256 if you
> wish)
> and having a bunch more "cache trimming threads" would just
> increase
> contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has
 any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on
 the
 mutex (it will be held less frequently and for shorter periods).
 
>>> I think the little drc2.patch which will keep the nfsd threads from
>>> acquiring the mutex and doing the trimming most of the time, might
>>> be
>>> sufficient. I still don't see why a separate trimming thread will be
>>> an advantage. I'd also be worried that the one cache trimming thread
>>> won't get the job done soon enough.
>>> 
>>> When I did production testing on a 1Gbyte server that saw a peak
>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>> longer relevant, I recently recall someone trying to run FreeBSD
>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>> 
> The only negative effect I can think of w.r.t. having the nfsd
> threads doing it would be a (I believe negligible) increase in RPC
> response times (the time the nfsd thread spends trimming the
> cache).
> As noted, I think this time would be negligible compared to disk
> I/O
> and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
>>> Have fun with it. Let me know when you have what you think is a good
>>> patch.
>>> 
>>> rick
>>> 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 "freebsd-hackers-unsubscr...@freebsd.org"
>>> ___
>>> freebsd...@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to
>>> "freebsd-fs-unsubscr...@freebsd.org"
>> 
>> My quest for IOPS over NFS continues :)
>> So far I'm not able to achieve more than about 3000 8K read requests
>> over NFS,
>> while the server locally gives much more.
>> And this is all from a file that is completely in ARC cache, no disk
>> IO involved.
>> 
> Just out of curiousity, why do you use 8K reads i

Re: SMP Version of tar

2012-10-10 Thread Kurt Lidl
On Tue, Oct 09, 2012 at 09:54:03PM -0700, Tim Kientzle wrote:
> 
> On Oct 8, 2012, at 3:21 AM, Wojciech Puchar wrote:
> 
> >> Not necessarily.  If I understand correctly what Tim means, he's talking
> >> about an in-memory compression of several blocks by several separate
> >> threads, and then - after all the threads have compressed their
> > 
> > but gzip format is single stream. dictionary IMHO is not reset every X 
> > kilobytes.
> > 
> > parallel gzip is possible but not with same data format.
> 
> Yes, it is.
> 
> The following creates a compressed file that
> is completely compatible with the standard
> gzip/gunzip tools:
> 
>* Break file into blocks
>* Compress each block into a gzip file (with gzip header and trailer 
> information)
>* Concatenate the result.
> 
> This can be correctly decoded by gunzip.
> 
> In theory, you get slightly worse compression.  In practice, if your blocks 
> are reasonably large (a megabyte or so each), the difference is negligible.

I am not sure, but I think this conversation might have a slight
misunderstanding due to imprecisely specified language, while the
technical part is in agreement.

Tim is correct in that gzip datastream allows for concatenation of
compressed blocks of data, so you might break the input stream into
a bunch of blocks [A, B, C, etc], and then can append those together
into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get
the original input stream.

I think that Wojciech's point is that the compressed data stream for
for the single datastream is different than the compressed data
stream of [A.gz, B.gz, C.gz, etc].  Both will decompress to the same
thing, but the intermediate compressed representation will be different.

-Kurt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-10 Thread Garrett Wollman
< 
said:

> And, although this experiment seems useful for testing patches that try
> and reduce DRC CPU overheads, most "real" NFS servers will be doing disk
> I/O.

We don't always have control over what the user does.  I think the
worst-case for my users involves a third-party program (that they're
not willing to modify) that does line-buffered writes in append mode.
This uses nearly all of the CPU on per-RPC overhead (each write is
three RPCs: GETATTR, WRITE, COMMIT).

-GAWollman

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem
Garrett Wollman wrote:
> <  said:
> 
> > And, although this experiment seems useful for testing patches that
> > try
> > and reduce DRC CPU overheads, most "real" NFS servers will be doing
> > disk
> > I/O.
> 
> We don't always have control over what the user does. I think the
> worst-case for my users involves a third-party program (that they're
> not willing to modify) that does line-buffered writes in append mode.
> This uses nearly all of the CPU on per-RPC overhead (each write is
> three RPCs: GETATTR, WRITE, COMMIT).
> 
Yes. My comment was simply meant to imply that his testing isn't a
realistic load for most NFS servers. It was not meant to imply that
reducing the CPU overhead/lock contention of the DRC is a useless
exercise.

rick

> -GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"