We just finished testing zfs 0.6.3 and lustre 2.5.2 with zfs MDT & OSTs, but
still ran into the problem of running out of inodes on the MDT. The number
started at about 7.9M, and grew to 8.9M, but not beyond that, the MDT
being on a mirrored zpool of 2 80GB SSD drives. The filesystem size was 97TB
with 8 13TB raidz2 OSTs on a shared MDS/OSS node, and a second OSS.
It took ~5000 seconds to run out of inodes in our empty file test. But,
during that time it averaged about 1650/sec which is the best we've seen.
I'm not sure why the inodes have been an issue, but we ran out of time to
pursue this further.
Instead we have moved to ldiskfs MDT and zfs OSTs, with the same lustre/zfs
versions, and have a lot more inodes available.
Filesystem Inodes IUsed IFree IUse% Mounted on
x.x.x.x@o2ib:/iconfs
39049920 7455386 31594534 20% /iconfs
Performance has been reportedly better, but one problem was that when
the OSS nodes went down before the OSTs could be taken offline (as would
happen during a power outage), OSTs failed to mount after the reboot.
To get around that we added a zpool import -f line after the message
"Unexpected return code from import of pool $pool" in the lustre startup
script
so the pools mount, then ran the lustre startup script to start the
OSTs. If
there is a better way to handle this please let me know.
Another problem we ran into is that our 1.8.9 clients could not write
into the new
filesystem with lustre 2.5.60 which came from
git.hpdd.intel.com/fs/lustre-release.git.
Things worked after checking out "track -b b2_5 origin/b2_5", and
rebuilding kernel
for ldiskfs. OS on the lustre servers is CentOS 6.5, kernel 2.6.32-431.17.1.
Thanks again for all the responses.
-Anjana
On 06/12/2014 09:43 PM, Scott Nolin wrote:
Just a note, I see zfs-0.6.3 has just been annoounced:
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4
I also see it is upgraded in the zfs/lustre repo.
The changelog notes the default as changed to 3/4 arc_c_max and a
variety of other fixes, many focusing on performance.
So Anjana this is probably worth testing, especially if you're
considering drastic measures.
We upgraded for our MDS, so this file create issue is harder for us to
test now (literally started testing writes this afternoon, and it's
not degraded yet, so far at 20 million writes). Since your problem
still happens fairly quickly I'm sure any information you have will be
very helpful to add to LU-2476. And if it helps, it may save you some
pain.
We will likely install the upgrade but may not be able to test
millions of writes any time soon, as the filesystem is needed for
production.
Regards,
Scott
On Thu, 12 Jun 2014 16:41:14 +0000
"Dilger, Andreas" <andreas.dil...@intel.com> wrote:
It looks like you've already increased arc_meta_limit beyond the
default, which is c_max / 4. That was critical to performance in our
testing.
There is also a patch from Brian that should help performance in your
case:
http://review.whamcloud.com/10237
Cheers, Andreas
On Jun 11, 2014, at 12:53, "Scott Nolin"
<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu>> wrote:
We tried a few arc tunables as noted here:
https://jira.hpdd.intel.com/browse/LU-2476
However, I didn't find any clear benefit in the long term. We were
just trying a few things without a lot of insight.
Scott
On 6/9/2014 12:37 PM, Anjana Kar wrote:
Thanks for all the input.
Before we move away from zfs MDT, I was wondering if we can try
setting zfs
tunables to test the performance. Basically what's a value we can use
for
arc_meta_limit for our system? Are there are any others settings that
can
be changed?
Generating small files on our current system, things started off at 500
files/sec,
then declined so it was about 1/20th of that after 2.45 million files.
-Anjana
On 06/09/2014 10:27 AM, Scott Nolin wrote:
We ran some scrub performance tests, and even without tunables set it
wasn't too bad, for our specific configuration. The main thing we did
was verify it made sense to scrub all OSTs simultaneously.
Anyway, indeed scrub or resilver aren't about Defrag.
Further, the mds performance issues aren't about fragmentation.
A side note, it's probably ideal to stay below 80% due to
fragmentation for ldiskfs too or performance degrades.
Sean, note I am dealing with specific issues for a very create intense
workload, and this is on the mds only where we may change. The data
integrity features of Zfs make it very attractive too. I fully expect
things will improve too with Zfs.
If you want a lot of certainty in your choices, you may want to
consult various vendors if lustre systems.
Scott
On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:
Scrub and resilver have nothing to so with defrag.
Scrub is scanning of all the data blocks in the pool to verify
their checksums and parity to detect silent data corruption, and
rewrite the bad blocks if necessary.
Resilver is reconstructing a failed disk onto a new disk using
parity or mirror copies of all the blocks on the failed disk. This is
similar to scrub.
Both scrub and resilver can be done online, though resilver of
course requires a spare disk to rebuild onto, which may not be
possible to add to a running system if your hardware does not support
it.
Both of them do not "improve" the performance or layout of data on
disk. They do impact performance because they cause a lot if random
IO to the disks, though this impact can be limited by tunables on the
pool.
Cheers, Andreas
On Jun 8, 2014, at 4:21, "Sean Brisbane"
<s.brisba...@physics.ox.ac.uk<mailto:s.brisba...@physics.ox.ac.uk><mailto:s.brisba...@physics.ox.ac.uk>>
wrote:
Hi Scott,
We are considering running zfs backed lustre and the factor of
10ish performance hit you see worries me. I know zfs can splurge bits
of files all over the place by design. The oracle docs do recommend
scrubbing the volumes and keeping usage below 80% for maintenance and
performance reasons, I'm going to call it 'defrag' but I'm sure
someone who knows better will probably correct me as to why it is not
the same.
So are these performance issues after scubbing and is it possible
to scrub online - I.e. some reasonable level of performance is
maintained while the scrub happens?
Resilvering is also recommended. Not sure if that is for
performance reasons.
http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html
Sent from my HTC Desire C on Three
----- Reply message -----
From: "Scott Nolin"
<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu><mailto:scott.no...@ssec.wisc.edu>>
To: "Anjana Kar"
<k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>>,
"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>"
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
Subject: [Lustre-discuss] number of inodes in zfs MDT
Date: Fri, Jun 6, 2014 3:23 AM
Looking at some of our existing zfs filesystems, we have a couple
with zfs mdts
One has 103M inodes and uses 152G of MDT space, another 12M and
19G. I’d plan for less than that I guess as Mr. Dilger suggests. It
all depends on your expected average file size and number of files
for what will work.
We have run into some unpleasant surprises with zfs for the MDT, I
believe mostly documented in bug reports, or at least hinted at.
A serious issue we have is performance of the zfs arc cache over
time. This is something we didn’t see in early testing, but with
enough use it grinds things to a crawl. I believe this may be
addressed in the newer version of ZFS, which we’re hopefully awaiting.
Another thing we’ve seen, which is mysterious to me is this it
appears hat as the MDT begins to fill up file create rates go down.
We don’t really have a strong handle on this (not enough for a bug
report I think), but we see this:
1.
The aforementioned 104M inode / 152GB MDT system has 4 SAS drives
raid10. On initial testing file creates were about 2500 to 3000 IOPs
per second. Follow up testing in it’s current state (about half
full..) shows them at about 500 IOPs now, but with a few iterations
of mdtest those IOPs plummet quickly to unbearable levels (like 30…).
2.
We took a snapshot of the filesystem and sent it to the backup MDS,
this time with the MDT built on 4 SAS drives in a raid0 - really not
for performance so much as “extra headroom” if that makes any sense.
Testing this the IOPs started higher, at maybe 800 or 1000 (this is
from memory, I don’t have my data in front of me). That initial
faster speed could just be writing to 4 spindles I suppose, but
surprising to me, the performance degraded at a slower rate. It took
much longer to get painfully slow. It still got there. The
performance didn’t degrade at the same rate if that makes sense - the
same number of writes on the smaller/slower mdt degraded the
performance more quickly. My guess is that had something to do with
the total space available. Who knows. I believe restarting lustre
(and certainly rebooting) ‘resets the clock’ on the file create
performance degradation.
For that problem we’re just going to try adding 4 SSD’s, but it’s
an ugly problem. Also are once again hopeful new zfs version
addresses it.
And finally, we’ve got a real concern with snapshot backups of the
MDT that my colleague posted about - the problem we see manifests in
essentially a read-only recovered file system, so it’s a concern and
not quite terrifying.
All in all, the next lustre file system we bring up (in a couple
weeks) we are very strongly considering going with ldiskfs for the
MDT this time.
Scott
From: Anjana Kar<mailto:k...@psc.edu>
Sent: Tuesday, June 3, 2014 7:38 PM
To:lustre-discuss@lists.lustre.org<mailto:disc...@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>
Is there a way to set the number of inodes for zfs MDT?
I've tried using --mkfsoptions="-N value" mentioned in lustre 2.0
manual, but it
fails to accept it. We are mirroring 2 80GB SSDs for the MDT, but the
number of
inodes is getting set to 7 million, which is not enough for a 100TB
filesystem.
Thanks in advance.
-Anjana Kar
Pittsburgh Supercomputing Center
k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>
------------------------------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss