Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-06-01 Thread Iwan Aucamp

On 06/01/2012 02:33 PM, Jeff Bacon wrote:

I'd be interested in the results of such tests. You can change the primarycache
parameter on the fly, so you could test it in less time than it
takes for me to type this email :-)
  -- Richard

Tried that. Performance headed south like a cat with its tail on fire. We 
didn't bother quantifying, it was just that hideous.

(You know, us northern-hemisphere people always use "south" as a "down" 
direction. Is it different for people in the southern hemisphere? :) )

There's just too many _other_ little things running around a normal system for 
which NOT having primarycache is just too painful to contemplate (even with 
L2ARC) that, while I can envisage situations where one might want to do that, 
they're very very few and far between.


Thanks for the valuable feedback Jeff, though I think you might 
misunderstand - the idea is to make a zfs filesystem just for the files 
being mmaped by mongo - the idea is to only disable ARC where there is 
double caching involved (i.e. for mmaped files) - leaving rest of the 
system with ARC and taking ARC out of the picture with MongoDB.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-06-01 Thread Jeff Bacon
> Anybody who has worked on a SPARC system for the past 15 years is well
> aware of NUMAness. We've been living in a NUMA world for a very long time,
> a world where the processors were slow and far memory latency is much, much
> worse than we see in the x86 world.
> 
> I look forward to seeing the results of your analysis and
> experiments.
>  -- Richard

like, um, seconded. Please.

I'm very curious to learn of a "VM2" effort. (Sadly, I spend more time nowadays 
with my nose stuck into Cisco kit than into Solaris - well, not sadly, they're 
both interesting - but I'm out of touch with much of what's going on in Solaris 
world anymore.) It makes sense though. And perhaps it's well overdue. The basic 
notions of the VM subsys haven't changed in what, 15 years? 
Ain't-broke-don't-fix sure but ... 

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-06-01 Thread Jeff Bacon
> I'd be interested in the results of such tests. You can change the 
> primarycache
> parameter on the fly, so you could test it in less time than it
> takes for me to type this email :-)
>  -- Richard

Tried that. Performance headed south like a cat with its tail on fire. We 
didn't bother quantifying, it was just that hideous.

(You know, us northern-hemisphere people always use "south" as a "down" 
direction. Is it different for people in the southern hemisphere? :) )

There's just too many _other_ little things running around a normal system for 
which NOT having primarycache is just too painful to contemplate (even with 
L2ARC) that, while I can envisage situations where one might want to do that, 
they're very very few and far between. 

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-06-01 Thread Jeff Bacon
> I'm getting sub-optimal performance with an mmap based database
> (mongodb) which is running on zfs of Solaris 10u9.
> 
> System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 *
> 4GB)
> ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
> 
>   - a few mongodb instances are running with with moderate IO and total
> rss of 50 GB
>   - a service which logs quite excessively (5GB every 20 mins) is also
> running (max 2GB ram use) - log files are compressed after some
> time to bzip2.
> 
> Database performance is quite horrid though - it seems that zfs does not
> know how to manage allocation between page cache and arc cache - and it
> seems arc cache wins most of the time.

Or to be more accurate, there is no coordination that I am aware of between the 
VM page cache and the ARC. Which, for all the glories of ZFS, strikes me as a 
*doh*face-in-palm* how-did-we-miss-this sorta thing. One of these days I need 
to ask Jeff and Bill what they were thinking. 

We went through this 9 months ago - we wrote MongoDB, which attempted to mmap() 
whole database files for the purpose of skimming back and forth through them 
quickly (think column-oriented database). Performance, um, sucked. 

There is a practical limit to the amount of RAM you can shove into a machine - 
and said RAM gets slower as you have to go to quad-rank DIMMs, which Nehalem 
can't run at full speed - for the sort of box you speak of, your top end of 
1333Mhz is 96G, last I checked. (We're at 192G in most cases.) So while yes 
copying the data around between VM and ARC is doable, in large quantities that 
are invariably going to blow the CPU L3, this may not be the most practical 
answer.

It didn't help of course that 
a) said DB was implemented in Java - _please_ don't ask - which is hardly a 
poster child for implementing any form of mmap(), not to mention spins a ton of 
threads
b) said machine _started_ with 72 2TB Constellations and a pack of Cheetahs 
arranged in 7 pools, resulting in ~700 additional kernel threads roaming 
around, all of which got woken up on any heavy disk access (yes they could have 
all been in one pool - and yes there is a specific reason for not doing so)

but and still. 

We managed to break ZFS as a result. There are a couple of cases filed. One is 
semi-patched, the other we're told simply can't be fixed in Solaris 10. 
Fortunately we understand the conditions that create the breakage, and work 
around it by Just Not Doing That(tm). In your configuration, I can almost 
guarantee you will not run into them. 


> 
> I'm thinking of doing the following:
>   - relocating mmaped (mongo) data to a zfs filesystem with only
> metadata cache
>   - reducing zfs arc cache to 16 GB
> 
> Is there any other recommendations - and is above likely to improve
> performance.

Well... we ended up 
(a) rewriting MongoDB to use in-process "buffer workspaces" and read()/write() 
to fill/dump the buffers to disk (essentially, giving up on mmap())
(b) moving most of the workload to CentOS and using the Solaris boxes as big 
fast NFSv3 fileservers (NFSv4 didn't work out so well for us) over 10G, because 
for most workloads it runs 5-8% faster on CentOS than Solaris, and we're 
primarily a CentOS shop anyway so it was just easier for everyone to deal with 
- but this has little to do with mmap() difficulties 

Given what I know of the Solaris VM, VFS and of ZFS as implemented - admittedly 
incomplete, and my VM knowledge is based mostly on SVR4 - it would seem to me 
that it is going to take some Really Creative Thinking to work around the 
mmap() problem - a tweak or two ain't gonna cut it. 

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-30 Thread Bob Friesenhahn

On Tue, 29 May 2012, Iwan Aucamp wrote:

 - Is there a  parameter similar to /proc/sys/vm/swappiness that can control 
how long unused pages in page cache stay in physical ram
if there is no shortage of physical ram ? And if not how long will unused pages 
stay in page cache stay in physical ram given there
is no shortage of physical ram ?


Absent pressure for memory, no longer referenced pages will stay in 
memory forever.  They can then be re-referenced in memory.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-29 Thread Iwan Aucamp

On 05/29/2012 03:29 AM, Daniel Carosone wrote:
For the mmap case: does the ARC keep a separate copy, or does the vm 
system map the same page into the process's address space? If a 
separate copy is made, that seems like a potential source of many 
kinds of problems - if it's the same page then the whole premise is 
essentially moot and there's no "double caching". 


As far as I understand, for mmap case, is that the page cache is 
distinct from ARC (i.e. normal simplified flow for reading from disk 
with mmap is DSK->ARC->PageCache) - and only page cache gets mapped into 
processes address space - which is what results in the double caching.


I have two other general questions regarding page cache with ZFS + Solaris:
 - Does anything else except mmap still use the page cache ?
 - Is there a parameter similar to /proc/sys/vm/swappiness that can 
control how long unused pages in page cache stay in physical ram if 
there is no shortage of physical ram ? And if not how long will unused 
pages stay in page cache stay in physical ram given there is no shortage 
of physical ram ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Daniel Carosone
On Mon, May 28, 2012 at 01:34:18PM -0700, Richard Elling wrote:
> I'd be interested in the results of such tests. 

Me too, especially for databases like postgresql where there's a
complementary cache size tunable within the db that often needs to be
turned up, since they implicitly rely on some filesystem caching as a L2. 

That's where this gets tricky: L2ARC has the opportunity to make a big
difference, where the entire db won't all fit in memory (regardless of
which subsystem has jurisdiction over that memory).  If you exclude
data from ARC, you can't spill it to L2ARC.

For the mmap case: does the ARC keep a separate copy, or does the vm
system map the same page into the process's address space?  If a
separate copy is made, that seems like a potential source of many
kinds of problems - if it's the same page then the whole premise is
essentially moot and there's no "double caching".

--
Dan.

pgp202dgDhYDG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Jim Klimov

2012-05-29 0:34, Richard Elling wrote:

I'd be interested in the results of such tests. You can change the
primarycache
parameter on the fly, so you could test it in less time than it takes
for me to type
this email :-)



I believe it would also take some time for memory distribution
to settle, expiring ARC data pages and actually claiming the
RAM for the application... Right? ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
[Apologies to the list, this has expanded past ZFS, if someone complains, we can
move the thread to another illumos dev list]

On May 28, 2012, at 2:18 PM, Lionel Cons wrote:

> On 28 May 2012 22:10, Richard Elling  wrote:
>> The only recommendation which will lead to results is to use a
>> different OS or filesystem. Your choices are
>> - FreeBSD with ZFS
>> - Linux with BTRFS
>> - Solaris with QFS
>> - Solaris with UFS
>> - Solaris with NFSv4, use ZFS on independent fileserver machines
>> 
>> There's a rather mythical rewrite of the Solaris virtual memory
>> subsystem called VM2 in progress but it will still take a long time
>> until this will become available for customers and there are no real
>> data yet whether this will help with mmap performance. It won't be
>> available for Opensolaris successors like Illumos available either
>> (likely never, at least the Illumos leadership doesn't see the need
>> for this and instead recommends to rewrite the applications to not use
>> mmap).
>> 
>> 
>> This is a mischaracterization of the statements given. The illumos team
>> says they will not implement Oracle's VM2 for valid, legal reasons.
>> That does not mean that mmap performance improvements for ZFS
>> cannot be implemented via other methods.
> 
> I'd like to hear what the other methods should be. The lack of mmap
> performance is only a symptom of a more severe disease. Just doing
> piecework and alter the VFS API to integrate ZFS/ARC/VM with each
> other doesn't fix the underlying problems.
> 
> I've assigned two of my staff, one familiar with the FreeBSD VM and
> one familiar with the Linux VM, to look at the current VM subsystem
> and their preliminary reports point to disaster. If Illumos does not
> initiate a VM rewrite project of it's own which will make the VM aware
> of NUMA, power management and other issues then I predict nothing less
> than the downfall of Illumos within a couple of years because the
> performance impact is dramatic and makes the Illumos kernel no longer
> competitive.
> Despite these findings, of which Sun was aware for a long time, and
> the number of ex-Sun employees working on Illumos, I miss the
> commitment to launch such a project. That's why I said "likely never",
> unless of course someone slams Garrett's head with sufficient force on
> a wooden table to make him see the reality.
> 
> The reality is:
> - The modern x86 server platforms are now all NUMA or NUMA-like. Lack
> of NUMA support leads to bad performance

SPARC has been NUMA since 1997 and Solaris changed the scheduler
long ago.

> - They all use some kind of serialized link between CPU nodes, let it
> be Hypertransport or Quickpath, with power management. If power
> management is active and has reduced the number of active links
> between nodes and the OS doesn't manage this correctly you'll get bad
> performance. Illumo's VM isn't even remotely aware of this fact
> - Based on simulator testing we see that in a simulated environment
> with 8 sockets almost 40% of kernel memory accesses are _REMOTE_
> accesses, i.e. it's not local to the node accessing it
> That are all preliminary results, I expect that the remainder of the
> analysis will take another 4-5 weeks until we present the findings to
> the Illumos community. But I can say already it will be a faceslap for
> those who think that Illumos doesn't need a better VM system.

Nobody said illumos doesn't need a better VM system. The statement was that 
illumos is not going to reverse-engineer Oracle's VM2.

>> The primary concern for mmap files is that the RAM footprint is doubled.
> 
> It's not only that RAM is doubled, the data are copied between both
> ARC and page cache multiple times. You can say memory and the in
> memory copy operation are cheap, but this and the lack of NUMA
> awareness is a real performance killer.

Anybody who has worked on a SPARC system for the past 15 years is well
aware of NUMAness. We've been living in a NUMA world for a very long time,
a world where the processors were slow and far memory latency is much, much
worse than we see in the x86 world.

I look forward to seeing the results of your analysis and experiments.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Lionel Cons
On 28 May 2012 22:10, Richard Elling  wrote:
> The only recommendation which will lead to results is to use a
> different OS or filesystem. Your choices are
> - FreeBSD with ZFS
> - Linux with BTRFS
> - Solaris with QFS
> - Solaris with UFS
> - Solaris with NFSv4, use ZFS on independent fileserver machines
>
> There's a rather mythical rewrite of the Solaris virtual memory
> subsystem called VM2 in progress but it will still take a long time
> until this will become available for customers and there are no real
> data yet whether this will help with mmap performance. It won't be
> available for Opensolaris successors like Illumos available either
> (likely never, at least the Illumos leadership doesn't see the need
> for this and instead recommends to rewrite the applications to not use
> mmap).
>
>
> This is a mischaracterization of the statements given. The illumos team
> says they will not implement Oracle's VM2 for valid, legal reasons.
> That does not mean that mmap performance improvements for ZFS
> cannot be implemented via other methods.

I'd like to hear what the other methods should be. The lack of mmap
performance is only a symptom of a more severe disease. Just doing
piecework and alter the VFS API to integrate ZFS/ARC/VM with each
other doesn't fix the underlying problems.

I've assigned two of my staff, one familiar with the FreeBSD VM and
one familiar with the Linux VM, to look at the current VM subsystem
and their preliminary reports point to disaster. If Illumos does not
initiate a VM rewrite project of it's own which will make the VM aware
of NUMA, power management and other issues then I predict nothing less
than the downfall of Illumos within a couple of years because the
performance impact is dramatic and makes the Illumos kernel no longer
competitive.
Despite these findings, of which Sun was aware for a long time, and
the number of ex-Sun employees working on Illumos, I miss the
commitment to launch such a project. That's why I said "likely never",
unless of course someone slams Garrett's head with sufficient force on
a wooden table to make him see the reality.

The reality is:
- The modern x86 server platforms are now all NUMA or NUMA-like. Lack
of NUMA support leads to bad performance
- They all use some kind of serialized link between CPU nodes, let it
be Hypertransport or Quickpath, with power management. If power
management is active and has reduced the number of active links
between nodes and the OS doesn't manage this correctly you'll get bad
performance. Illumo's VM isn't even remotely aware of this fact
- Based on simulator testing we see that in a simulated environment
with 8 sockets almost 40% of kernel memory accesses are _REMOTE_
accesses, i.e. it's not local to the node accessing it
That are all preliminary results, I expect that the remainder of the
analysis will take another 4-5 weeks until we present the findings to
the Illumos community. But I can say already it will be a faceslap for
those who think that Illumos doesn't need a better VM system.

> The primary concern for mmap files is that the RAM footprint is doubled.

It's not only that RAM is doubled, the data are copied between both
ARC and page cache multiple times. You can say memory and the in
memory copy operation are cheap, but this and the lack of NUMA
awareness is a real performance killer.

Lionel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
question below...

On May 28, 2012, at 1:25 PM, Iwan Aucamp wrote:

> On 05/28/2012 10:12 PM, Andrew Gabriel wrote:
>>  On 05/28/12 20:06, Iwan Aucamp wrote:
>>> I'm thinking of doing the following:
>>>  - relocating mmaped (mongo) data to a zfs filesystem with only
>>> metadata cache
>>>  - reducing zfs arc cache to 16 GB
>>> 
>>> Is there any other recommendations - and is above likely to improve
>>> performance.
>> 1. Upgrade to S10 Update 10 - this has various performance improvements,
>> in particular related to database type loads (but I don't know anything
>> about mongodb).
>> 
>> 2. Reduce the ARC size so RSS + ARC + other memory users<  RAM size.
>> I assume the RSS include's whatever caching the database does. In
>> theory, a database should be able to work out what's worth caching
>> better than any filesystem can guess from underneath it, so you want to
>> configure more memory in the DB's cache than in the ARC. (The default
>> ARC tuning is unsuitable for a database server.)
>> 
>> 3. If the database has some concept of blocksize or recordsize that it
>> uses to perform i/o, make sure the filesystems it is using configured to
>> be the same recordsize. The ZFS default recordsize (128kB) is usually
>> much bigger than database blocksizes. This is probably going to have
>> less impact with an mmaped database than a read(2)/write(2) database,
>> where it may prove better to match the filesystem's record size to the
>> system's page size (4kB, unless it's using some type of large pages). I
>> haven't tried playing with recordsize for memory mapped i/o, so I'm
>> speculating here.
>> 
>> Blocksize or recordsize may apply to the log file writer too, and it may
>> be that this needs a different recordsize and therefore has to be in a
>> different filesystem. If it uses write(2) or some variant rather than
>> mmap(2) and doesn't document this in detail, Dtrace is your friend.
>> 
>> 4. Keep plenty of free space in the zpool if you want good database
>> performance. If you're more than 60% full (S10U9) or 80% full (S10U10),
>> that could be a factor.
>> 
>> Anyway, there are a few things to think about.
> 
> Thanks for the Feedback, I cannot really do 1, but will look into points 3 
> and 4 - in addition to 2 - which is what I desire to achieve with my second 
> point - but I would still like to know if it is recommended to only do 
> metadata caching for mmaped files (mongodb data files) - the way I see it 
> this should get rid of the double caching which is being done for mmaped 
> files.

I'd be interested in the results of such tests. You can change the primarycache
parameter on the fly, so you could test it in less time than it takes for me to 
type
this email :-)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Iwan Aucamp

On 05/28/2012 10:12 PM, Andrew Gabriel wrote:

  On 05/28/12 20:06, Iwan Aucamp wrote:

I'm thinking of doing the following:
  - relocating mmaped (mongo) data to a zfs filesystem with only
metadata cache
  - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve
performance.

1. Upgrade to S10 Update 10 - this has various performance improvements,
in particular related to database type loads (but I don't know anything
about mongodb).

2. Reduce the ARC size so RSS + ARC + other memory users<  RAM size.
I assume the RSS include's whatever caching the database does. In
theory, a database should be able to work out what's worth caching
better than any filesystem can guess from underneath it, so you want to
configure more memory in the DB's cache than in the ARC. (The default
ARC tuning is unsuitable for a database server.)

3. If the database has some concept of blocksize or recordsize that it
uses to perform i/o, make sure the filesystems it is using configured to
be the same recordsize. The ZFS default recordsize (128kB) is usually
much bigger than database blocksizes. This is probably going to have
less impact with an mmaped database than a read(2)/write(2) database,
where it may prove better to match the filesystem's record size to the
system's page size (4kB, unless it's using some type of large pages). I
haven't tried playing with recordsize for memory mapped i/o, so I'm
speculating here.

Blocksize or recordsize may apply to the log file writer too, and it may
be that this needs a different recordsize and therefore has to be in a
different filesystem. If it uses write(2) or some variant rather than
mmap(2) and doesn't document this in detail, Dtrace is your friend.

4. Keep plenty of free space in the zpool if you want good database
performance. If you're more than 60% full (S10U9) or 80% full (S10U10),
that could be a factor.

Anyway, there are a few things to think about.


Thanks for the Feedback, I cannot really do 1, but will look into points 
3 and 4 - in addition to 2 - which is what I desire to achieve with my 
second point - but I would still like to know if it is recommended to 
only do metadata caching for mmaped files (mongodb data files) - the way 
I see it this should get rid of the double caching which is being done 
for mmaped files.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
On May 28, 2012, at 12:46 PM, Lionel Cons wrote:

> On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp  wrote:
>> I'm getting sub-optimal performance with an mmap based database (mongodb)
>> which is running on zfs of Solaris 10u9.
>> 
>> System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram
>> (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
>> 
>> - a few mongodb instances are running with with moderate IO and total rss
>> of 50 GB
>> - a service which logs quite excessively (5GB every 20 mins) is also
>> running (max 2GB ram use) - log files are compressed after some time to
>> bzip2.
>> 
>> Database performance is quite horrid though - it seems that zfs does not
>> know how to manage allocation between page cache and arc cache - and it
>> seems arc cache wins most of the time.
>> 
>> I'm thinking of doing the following:
>> - relocating mmaped (mongo) data to a zfs filesystem with only metadata
>> cache
>> - reducing zfs arc cache to 16 GB
>> 
>> Is there any other recommendations - and is above likely to improve
>> performance.
> 
> The only recommendation which will lead to results is to use a
> different OS or filesystem. Your choices are
> - FreeBSD with ZFS
> - Linux with BTRFS
> - Solaris with QFS
> - Solaris with UFS
> - Solaris with NFSv4, use ZFS on independent fileserver machines
> 
> There's a rather mythical rewrite of the Solaris virtual memory
> subsystem called VM2 in progress but it will still take a long time
> until this will become available for customers and there are no real
> data yet whether this will help with mmap performance. It won't be
> available for Opensolaris successors like Illumos available either
> (likely never, at least the Illumos leadership doesn't see the need
> for this and instead recommends to rewrite the applications to not use
> mmap).

This is a mischaracterization of the statements given. The illumos team
says they will not implement Oracle's VM2 for valid, legal reasons. 
That does not mean that mmap performance improvements for ZFS 
cannot be implemented via other methods.

The primary concern for mmap files is that the RAM footprint is doubled.
If you do not manage this via limits, there can be a fight between the 
page cache and ARC over a constrained RAM resource.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Lionel Cons
On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp  wrote:
> I'm getting sub-optimal performance with an mmap based database (mongodb)
> which is running on zfs of Solaris 10u9.
>
> System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram
> (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
>
>  - a few mongodb instances are running with with moderate IO and total rss
> of 50 GB
>  - a service which logs quite excessively (5GB every 20 mins) is also
> running (max 2GB ram use) - log files are compressed after some time to
> bzip2.
>
> Database performance is quite horrid though - it seems that zfs does not
> know how to manage allocation between page cache and arc cache - and it
> seems arc cache wins most of the time.
>
> I'm thinking of doing the following:
>  - relocating mmaped (mongo) data to a zfs filesystem with only metadata
> cache
>  - reducing zfs arc cache to 16 GB
>
> Is there any other recommendations - and is above likely to improve
> performance.

The only recommendation which will lead to results is to use a
different OS or filesystem. Your choices are
- FreeBSD with ZFS
- Linux with BTRFS
- Solaris with QFS
- Solaris with UFS
- Solaris with NFSv4, use ZFS on independent fileserver machines

There's a rather mythical rewrite of the Solaris virtual memory
subsystem called VM2 in progress but it will still take a long time
until this will become available for customers and there are no real
data yet whether this will help with mmap performance. It won't be
available for Opensolaris successors like Illumos available either
(likely never, at least the Illumos leadership doesn't see the need
for this and instead recommends to rewrite the applications to not use
mmap).

Lionel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Andrew Gabriel

On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database 
(mongodb) which is running on zfs of Solaris 10u9.


System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) 
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks


 - a few mongodb instances are running with with moderate IO and total 
rss of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also 
running (max 2GB ram use) - log files are compressed after some time 
to bzip2.


Database performance is quite horrid though - it seems that zfs does 
not know how to manage allocation between page cache and arc cache - 
and it seems arc cache wins most of the time.


I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only 
metadata cache

 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve 
performance.


1. Upgrade to S10 Update 10 - this has various performance improvements, 
in particular related to database type loads (but I don't know anything 
about mongodb).


2. Reduce the ARC size so RSS + ARC + other memory users < RAM size.
I assume the RSS include's whatever caching the database does. In 
theory, a database should be able to work out what's worth caching 
better than any filesystem can guess from underneath it, so you want to 
configure more memory in the DB's cache than in the ARC. (The default 
ARC tuning is unsuitable for a database server.)


3. If the database has some concept of blocksize or recordsize that it 
uses to perform i/o, make sure the filesystems it is using configured to 
be the same recordsize. The ZFS default recordsize (128kB) is usually 
much bigger than database blocksizes. This is probably going to have 
less impact with an mmaped database than a read(2)/write(2) database, 
where it may prove better to match the filesystem's record size to the 
system's page size (4kB, unless it's using some type of large pages). I 
haven't tried playing with recordsize for memory mapped i/o, so I'm 
speculating here.


Blocksize or recordsize may apply to the log file writer too, and it may 
be that this needs a different recordsize and therefore has to be in a 
different filesystem. If it uses write(2) or some variant rather than 
mmap(2) and doesn't document this in detail, Dtrace is your friend.


4. Keep plenty of free space in the zpool if you want good database 
performance. If you're more than 60% full (S10U9) or 80% full (S10U10), 
that could be a factor.


Anyway, there are a few things to think about.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Iwan Aucamp
I'm getting sub-optimal performance with an mmap based database 
(mongodb) which is running on zfs of Solaris 10u9.


System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) 
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks


 - a few mongodb instances are running with with moderate IO and total 
rss of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also 
running (max 2GB ram use) - log files are compressed after some time to 
bzip2.


Database performance is quite horrid though - it seems that zfs does not 
know how to manage allocation between page cache and arc cache - and it 
seems arc cache wins most of the time.


I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only 
metadata cache

 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve 
performance.


--
Iwan Aucamp
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss