Just a note, I see zfs-0.6.3 has just been annoounced:
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4
I also see it is upgraded in the zfs/lustre repo.
The changelog notes the default as changed to 3/4
arc_c_max and a variety of other fixes, many focusing on
performance.
So Anjana this is probably worth testing, especially if
you're considering drastic measures.
We upgraded for our MDS, so this file create issue is
harder for us to test now (literally started testing
writes this afternoon, and it's not degraded yet, so far
at 20 million writes). Since your problem still happens
fairly quickly I'm sure any information you have will be
very helpful to add to LU-2476. And if it helps, it may
save you some pain.
We will likely install the upgrade but may not be able to
test millions of writes any time soon, as the filesystem
is needed for production.
Regards,
Scott
On Thu, 12 Jun 2014 16:41:14 +0000
"Dilger, Andreas" <andreas.dil...@intel.com> wrote:
It looks like you've already increased arc_meta_limit
beyond the default, which is c_max / 4. That was critical
to performance in our testing.
There is also a patch from Brian that should help
performance in your case:
http://review.whamcloud.com/10237
Cheers, Andreas
On Jun 11, 2014, at 12:53, "Scott Nolin"
<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu>>
wrote:
We tried a few arc tunables as noted here:
https://jira.hpdd.intel.com/browse/LU-2476
However, I didn't find any clear benefit in the long
term. We were just trying a few things without a lot of
insight.
Scott
On 6/9/2014 12:37 PM, Anjana Kar wrote:
Thanks for all the input.
Before we move away from zfs MDT, I was wondering if we
can try setting zfs
tunables to test the performance. Basically what's a
value we can use for
arc_meta_limit for our system? Are there are any others
settings that can
be changed?
Generating small files on our current system, things
started off at 500
files/sec,
then declined so it was about 1/20th of that after 2.45
million files.
-Anjana
On 06/09/2014 10:27 AM, Scott Nolin wrote:
We ran some scrub performance tests, and even without
tunables set it
wasn't too bad, for our specific configuration. The main
thing we did
was verify it made sense to scrub all OSTs
simultaneously.
Anyway, indeed scrub or resilver aren't about Defrag.
Further, the mds performance issues aren't about
fragmentation.
A side note, it's probably ideal to stay below 80% due
to
fragmentation for ldiskfs too or performance degrades.
Sean, note I am dealing with specific issues for a very
create intense
workload, and this is on the mds only where we may
change. The data
integrity features of Zfs make it very attractive too. I
fully expect
things will improve too with Zfs.
If you want a lot of certainty in your choices, you may
want to
consult various vendors if lustre systems.
Scott
On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>>
wrote:
Scrub and resilver have nothing to so with defrag.
Scrub is scanning of all the data blocks in the pool
to verify their checksums and parity to detect silent
data corruption, and rewrite the bad blocks if necessary.
Resilver is reconstructing a failed disk onto a new
disk using parity or mirror copies of all the blocks on
the failed disk. This is similar to scrub.
Both scrub and resilver can be done online, though
resilver of course requires a spare disk to rebuild onto,
which may not be possible to add to a running system if
your hardware does not support it.
Both of them do not "improve" the performance or
layout of data on disk. They do impact performance
because they cause a lot if random IO to the disks,
though this impact can be limited by tunables on the
pool.
Cheers, Andreas
On Jun 8, 2014, at 4:21, "Sean Brisbane"
<s.brisba...@physics.ox.ac.uk<mailto:s.brisba...@physics.ox.ac.uk><mailto:s.brisba...@physics.ox.ac.uk>>
wrote:
Hi Scott,
We are considering running zfs backed lustre and the
factor of 10ish performance hit you see worries me. I
know zfs can splurge bits of files all over the place by
design. The oracle docs do recommend scrubbing the
volumes and keeping usage below 80% for maintenance and
performance reasons, I'm going to call it 'defrag' but
I'm sure someone who knows better will probably correct
me as to why it is not the same.
So are these performance issues after scubbing and is
it possible to scrub online - I.e. some reasonable level
of performance is maintained while the scrub happens?
Resilvering is also recommended. Not sure if that is
for performance reasons.
http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html
Sent from my HTC Desire C on Three
----- Reply message -----
From: "Scott Nolin"
<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu><mailto:scott.no...@ssec.wisc.edu>>
To: "Anjana Kar"
<k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>>,
"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>"
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
Subject: [Lustre-discuss] number of inodes in zfs MDT
Date: Fri, Jun 6, 2014 3:23 AM
Looking at some of our existing zfs filesystems, we
have a couple with zfs mdts
One has 103M inodes and uses 152G of MDT space,
another 12M and 19G. I’d plan for less than that I guess
as Mr. Dilger suggests. It all depends on your expected
average file size and number of files for what will work.
We have run into some unpleasant surprises with zfs
for the MDT, I believe mostly documented in bug reports,
or at least hinted at.
A serious issue we have is performance of the zfs arc
cache over time. This is something we didn’t see in early
testing, but with enough use it grinds things to a crawl.
I believe this may be addressed in the newer version of
ZFS, which we’re hopefully awaiting.
Another thing we’ve seen, which is mysterious to me is
this it appears hat as the MDT begins to fill up file
create rates go down. We don’t really have a strong
handle on this (not enough for a bug report I think), but
we see this:
1.
The aforementioned 104M inode / 152GB MDT system has 4
SAS drives raid10. On initial testing file creates were
about 2500 to 3000 IOPs per second. Follow up testing in
it’s current state (about half full..) shows them at
about 500 IOPs now, but with a few iterations of mdtest
those IOPs plummet quickly to unbearable levels (like
30…).
2.
We took a snapshot of the filesystem and sent it to
the backup MDS, this time with the MDT built on 4 SAS
drives in a raid0 - really not for performance so much as
“extra headroom” if that makes any sense. Testing this
the IOPs started higher, at maybe 800 or 1000 (this is
from memory, I don’t have my data in front of me). That
initial faster speed could just be writing to 4 spindles
I suppose, but surprising to me, the performance degraded
at a slower rate. It took much longer to get painfully
slow. It still got there. The performance didn’t degrade
at the same rate if that makes sense - the same number of
writes on the smaller/slower mdt degraded the performance
more quickly. My guess is that had something to do with
the total space available. Who knows. I believe
restarting lustre (and certainly rebooting) ‘resets the
clock’ on the file create performance degradation.
For that problem we’re just going to try adding 4
SSD’s, but it’s an ugly problem. Also are once again
hopeful new zfs version addresses it.
And finally, we’ve got a real concern with snapshot
backups of the MDT that my colleague posted about - the
problem we see manifests in essentially a read-only
recovered file system, so it’s a concern and not quite
terrifying.
All in all, the next lustre file system we bring up
(in a couple weeks) we are very strongly considering
going with ldiskfs for the MDT this time.
Scott
From: Anjana Kar<mailto:k...@psc.edu>
Sent: Tuesday, June 3, 2014 7:38 PM
To:lustre-discuss@lists.lustre.org<mailto:disc...@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>
Is there a way to set the number of inodes for zfs
MDT?
I've tried using --mkfsoptions="-N value" mentioned in
lustre 2.0
manual, but it
fails to accept it. We are mirroring 2 80GB SSDs for
the MDT, but the
number of
inodes is getting set to 7 million, which is not
enough for a 100TB
filesystem.
Thanks in advance.
-Anjana Kar
Pittsburgh Supercomputing Center
k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>
------------------------------------------------------------------------
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org><mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
------------------------------------------------------------------------
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org><mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss