Re: [Lustre-discuss] number of inodes in zfs MDT

Anjana Kar Wed, 18 Jun 2014 03:13:04 -0700

We just finished testing zfs 0.6.3 and lustre 2.5.2 with zfs MDT & OSTs, but
still ran into the problem of running out of inodes on the MDT. The number
started at about 7.9M, and grew to 8.9M, but not beyond that, the MDT
being on a mirrored zpool of 2 80GB SSD drives. The filesystem size was 97TB
with 8 13TB raidz2 OSTs on a shared MDS/OSS node, and a second OSS.


It took ~5000 seconds to run out of inodes in our empty file test. But,
during that time it averaged about 1650/sec which is the best we've seen.
I'm not sure why the inodes have been an issue, but we ran out of time to
pursue this further.

Instead we have moved to ldiskfs MDT and zfs OSTs, with the same lustre/zfs
versions, and have a lot more inodes available.

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
x.x.x.x@o2ib:/iconfs
                     39049920 7455386 31594534   20% /iconfs

Performance has been reportedly better, but one problem was that when
the OSS nodes went down before the OSTs could be taken offline (as would
happen during a power outage), OSTs failed to mount after the reboot.

To get around that we added a zpool import -f line after the message

"Unexpected return code from import of pool $pool" in the lustre startupscriptso the pools mount, then ran the lustre startup script to start theOSTs. If

there is a better way to handle this please let me know.

Another problem we ran into is that our 1.8.9 clients could not writeinto the newfilesystem with lustre 2.5.60 which came fromgit.hpdd.intel.com/fs/lustre-release.git.Things worked after checking out "track -b b2_5 origin/b2_5", andrebuilding kernel

for ldiskfs. OS on the lustre servers is CentOS 6.5, kernel 2.6.32-431.17.1.

Thanks again for all the responses.

-Anjana

On 06/12/2014 09:43 PM, Scott Nolin wrote:

Just a note, I see zfs-0.6.3 has just been annoounced:
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4
I also see it is upgraded in the zfs/lustre repo.
The changelog notes the default as changed to 3/4 arc_c_max and avariety of other fixes, many focusing on performance.
So Anjana this is probably worth testing, especially if you'reconsidering drastic measures.
We upgraded for our MDS, so this file create issue is harder for us totest now (literally started testing writes this afternoon, and it'snot degraded yet, so far at 20 million writes). Since your problemstill happens fairly quickly I'm sure any information you have will bevery helpful to add to LU-2476. And if it helps, it may save you somepain.
We will likely install the upgrade but may not be able to testmillions of writes any time soon, as the filesystem is needed forproduction.
Regards,
Scott


On Thu, 12 Jun 2014 16:41:14 +0000
 "Dilger, Andreas" <andreas.dil...@intel.com> wrote:
It looks like you've already increased arc_meta_limit beyond thedefault, which is c_max / 4. That was critical to performance in ourtesting.
There is also a patch from Brian that should help performance in yourcase:
http://review.whamcloud.com/10237

Cheers, Andreas
On Jun 11, 2014, at 12:53, "Scott Nolin"<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu>> wrote:
We tried a few arc tunables as noted here:

https://jira.hpdd.intel.com/browse/LU-2476
However, I didn't find any clear benefit in the long term. We werejust trying a few things without a lot of insight.
Scott

On 6/9/2014 12:37 PM, Anjana Kar wrote:
Thanks for all the input.
Before we move away from zfs MDT, I was wondering if we can trysetting zfstunables to test the performance. Basically what's a value we can useforarc_meta_limit for our system? Are there are any others settings thatcan
be changed?

Generating small files on our current system, things started off at 500
files/sec,
then declined so it was about 1/20th of that after 2.45 million files.

-Anjana

On 06/09/2014 10:27 AM, Scott Nolin wrote:
We ran some scrub performance tests, and even without tunables set it
wasn't too bad, for our specific configuration. The main thing we did
was verify it made sense to scrub all OSTs simultaneously.

Anyway, indeed scrub or resilver aren't about Defrag.

Further, the mds performance issues aren't about fragmentation.

A side note, it's probably ideal to stay below 80% due to
fragmentation for ldiskfs too or performance degrades.

Sean, note I am dealing with specific issues for a very create intense
workload, and this is on the mds only where we may change. The data
integrity features of Zfs make it very attractive too. I fully expect
things will improve too with Zfs.

If you want a lot of certainty in your choices, you may want to
consult various vendors if lustre systems.

Scott




On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:

  Scrub and resilver have nothing to so with defrag.
Scrub is scanning of all the data blocks in the pool to verifytheir checksums and parity to detect silent data corruption, andrewrite the bad blocks if necessary.
Resilver is reconstructing a failed disk onto a new disk usingparity or mirror copies of all the blocks on the failed disk. This issimilar to scrub.
Both scrub and resilver can be done online, though resilver ofcourse requires a spare disk to rebuild onto, which may not bepossible to add to a running system if your hardware does not supportit.
Both of them do not "improve" the performance or layout of data ondisk. They do impact performance because they cause a lot if randomIO to the disks, though this impact can be limited by tunables on thepool.
  Cheers, Andreas
On Jun 8, 2014, at 4:21, "Sean Brisbane"<s.brisba...@physics.ox.ac.uk<mailto:s.brisba...@physics.ox.ac.uk><mailto:s.brisba...@physics.ox.ac.uk>>wrote:
  Hi Scott,
We are considering running zfs backed lustre and the factor of10ish performance hit you see worries me. I know zfs can splurge bitsof files all over the place by design. The oracle docs do recommendscrubbing the volumes and keeping usage below 80% for maintenance andperformance reasons, I'm going to call it 'defrag' but I'm suresomeone who knows better will probably correct me as to why it is notthe same.So are these performance issues after scubbing and is it possibleto scrub online - I.e. some reasonable level of performance ismaintained while the scrub happens?Resilvering is also recommended. Not sure if that is forperformance reasons.
http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html



  Sent from my HTC Desire C on Three

  ----- Reply message -----
From: "Scott Nolin"<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu><mailto:scott.no...@ssec.wisc.edu>>To: "Anjana Kar"<k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>>,"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>"<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
  Subject: [Lustre-discuss] number of inodes in zfs MDT
  Date: Fri, Jun 6, 2014 3:23 AM
Looking at some of our existing zfs filesystems, we have a couplewith zfs mdts
One has 103M inodes and uses 152G of MDT space, another 12M and19G. I’d plan for less than that I guess as Mr. Dilger suggests. Itall depends on your expected average file size and number of filesfor what will work.
We have run into some unpleasant surprises with zfs for the MDT, Ibelieve mostly documented in bug reports, or at least hinted at.
A serious issue we have is performance of the zfs arc cache overtime. This is something we didn’t see in early testing, but withenough use it grinds things to a crawl. I believe this may beaddressed in the newer version of ZFS, which we’re hopefully awaiting.
Another thing we’ve seen, which is mysterious to me is this itappears hat as the MDT begins to fill up file create rates go down.We don’t really have a strong handle on this (not enough for a bugreport I think), but we see this:
     1.
The aforementioned 104M inode / 152GB MDT system has 4 SAS drivesraid10. On initial testing file creates were about 2500 to 3000 IOPsper second. Follow up testing in it’s current state (about halffull..) shows them at about 500 IOPs now, but with a few iterationsof mdtest those IOPs plummet quickly to unbearable levels (like 30…).
     2.
We took a snapshot of the filesystem and sent it to the backup MDS,this time with the MDT built on 4 SAS drives in a raid0 - really notfor performance so much as “extra headroom” if that makes any sense.Testing this the IOPs started higher, at maybe 800 or 1000 (this isfrom memory, I don’t have my data in front of me). That initialfaster speed could just be writing to 4 spindles I suppose, butsurprising to me, the performance degraded at a slower rate. It tookmuch longer to get painfully slow. It still got there. Theperformance didn’t degrade at the same rate if that makes sense - thesame number of writes on the smaller/slower mdt degraded theperformance more quickly. My guess is that had something to do withthe total space available. Who knows. I believe restarting lustre(and certainly rebooting) ‘resets the clock’ on the file createperformance degradation.
For that problem we’re just going to try adding 4 SSD’s, but it’san ugly problem. Also are once again hopeful new zfs versionaddresses it.
And finally, we’ve got a real concern with snapshot backups of theMDT that my colleague posted about - the problem we see manifests inessentially a read-only recovered file system, so it’s a concern andnot quite terrifying.
All in all, the next lustre file system we bring up (in a coupleweeks) we are very strongly considering going with ldiskfs for theMDT this time.
  Scott








  From: Anjana Kar<mailto:k...@psc.edu>
  Sent: ‎Tuesday‎, ‎June‎ ‎3‎, ‎2014 ‎7‎:‎38‎ ‎PM
To:lustre-discuss@lists.lustre.org<mailto:disc...@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>

  Is there a way to set the number of inodes for zfs MDT?

  I've tried using --mkfsoptions="-N value" mentioned in lustre 2.0
  manual, but it
  fails to accept it. We are mirroring 2 80GB SSDs for the MDT, but the
  number of
  inodes is getting set to 7 million, which is not enough for a 100TB
  filesystem.

  Thanks in advance.

  -Anjana Kar
     Pittsburgh Supercomputing Center
k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>
------------------------------------------------------------------------


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] number of inodes in zfs MDT

Reply via email to