Re: [Lustre-discuss] number of inodes in zfs MDT

Scott Nolin Thu, 12 Jun 2014 18:45:07 -0700

Just a note, I see zfs-0.6.3 has just been annoounced:


https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4

I also see it is upgraded in the zfs/lustre repo.

The changelog notes the default as changed to 3/4arc_c_max and a variety of other fixes, many focusing onperformance.

So Anjana this is probably worth testing, especially ifyou're considering drastic measures.

We upgraded for our MDS, so this file create issue isharder for us to test now (literally started testingwrites this afternoon, and it's not degraded yet, so farat 20 million writes). Since your problem still happensfairly quickly I'm sure any information you have will bevery helpful to add to LU-2476. And if it helps, it maysave you some pain.

We will likely install the upgrade but may not be able totest millions of writes any time soon, as the filesystemis needed for production.


Regards,
Scott


On Thu, 12 Jun 2014 16:41:14 +0000
 "Dilger, Andreas" <andreas.dil...@intel.com> wrote:

It looks like you've already increased arc_meta_limitbeyond the default, which is c_max / 4. That was criticalto performance in our testing.
There is also a patch from Brian that should helpperformance in your case:
http://review.whamcloud.com/10237

Cheers, Andreas
On Jun 11, 2014, at 12:53, "Scott Nolin"<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu>>wrote:
We tried a few arc tunables as noted here:

https://jira.hpdd.intel.com/browse/LU-2476
However, I didn't find any clear benefit in the longterm. We were just trying a few things without a lot ofinsight.
Scott

On 6/9/2014 12:37 PM, Anjana Kar wrote:
Thanks for all the input.
Before we move away from zfs MDT, I was wondering if wecan try setting zfstunables to test the performance. Basically what's avalue we can use forarc_meta_limit for our system? Are there are any otherssettings that can
be changed?
Generating small files on our current system, thingsstarted off at 500
files/sec,
then declined so it was about 1/20th of that after 2.45million files.
-Anjana

On 06/09/2014 10:27 AM, Scott Nolin wrote:
We ran some scrub performance tests, and even withouttunables set itwasn't too bad, for our specific configuration. The mainthing we didwas verify it made sense to scrub all OSTssimultaneously.
Anyway, indeed scrub or resilver aren't about Defrag.
Further, the mds performance issues aren't aboutfragmentation.
A side note, it's probably ideal to stay below 80% dueto
fragmentation for ldiskfs too or performance degrades.
Sean, note I am dealing with specific issues for a verycreate intenseworkload, and this is on the mds only where we maychange. The dataintegrity features of Zfs make it very attractive too. Ifully expect
things will improve too with Zfs.
If you want a lot of certainty in your choices, you maywant to
consult various vendors if lustre systems.

Scott




On June 8, 2014 11:42:15 AM CDT, "Dilger, Andreas"
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>>wrote:
  Scrub and resilver have nothing to so with defrag.
Scrub is scanning of all the data blocks in the poolto verify their checksums and parity to detect silentdata corruption, and rewrite the bad blocks if necessary.
Resilver is reconstructing a failed disk onto a newdisk using parity or mirror copies of all the blocks onthe failed disk. This is similar to scrub.
Both scrub and resilver can be done online, thoughresilver of course requires a spare disk to rebuild onto,which may not be possible to add to a running system ifyour hardware does not support it.
Both of them do not "improve" the performance orlayout of data on disk. They do impact performancebecause they cause a lot if random IO to the disks,though this impact can be limited by tunables on thepool.
  Cheers, Andreas
On Jun 8, 2014, at 4:21, "Sean Brisbane"<s.brisba...@physics.ox.ac.uk<mailto:s.brisba...@physics.ox.ac.uk><mailto:s.brisba...@physics.ox.ac.uk>>wrote:
  Hi Scott,
We are considering running zfs backed lustre and thefactor of 10ish performance hit you see worries me. Iknow zfs can splurge bits of files all over the place bydesign. The oracle docs do recommend scrubbing thevolumes and keeping usage below 80% for maintenance andperformance reasons, I'm going to call it 'defrag' butI'm sure someone who knows better will probably correctme as to why it is not the same.So are these performance issues after scubbing and isit possible to scrub online - I.e. some reasonable levelof performance is maintained while the scrub happens?Resilvering is also recommended. Not sure if that isfor performance reasons.
  http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html



  Sent from my HTC Desire C on Three

  ----- Reply message -----
From: "Scott Nolin"<scott.no...@ssec.wisc.edu<mailto:scott.no...@ssec.wisc.edu><mailto:scott.no...@ssec.wisc.edu>>To: "Anjana Kar"<k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>>,"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>"<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
  Subject: [Lustre-discuss] number of inodes in zfs MDT
  Date: Fri, Jun 6, 2014 3:23 AM
Looking at some of our existing zfs filesystems, wehave a couple with zfs mdts
One has 103M inodes and uses 152G of MDT space,another 12M and 19G. I’d plan for less than that I guessas Mr. Dilger suggests. It all depends on your expectedaverage file size and number of files for what will work.
We have run into some unpleasant surprises with zfsfor the MDT, I believe mostly documented in bug reports,or at least hinted at.
A serious issue we have is performance of the zfs arccache over time. This is something we didn’t see in earlytesting, but with enough use it grinds things to a crawl.I believe this may be addressed in the newer version ofZFS, which we’re hopefully awaiting.
Another thing we’ve seen, which is mysterious to me isthis it appears hat as the MDT begins to fill up filecreate rates go down. We don’t really have a stronghandle on this (not enough for a bug report I think), butwe see this:
     1.
The aforementioned 104M inode / 152GB MDT system has 4SAS drives raid10. On initial testing file creates wereabout 2500 to 3000 IOPs per second. Follow up testing init’s current state (about half full..) shows them atabout 500 IOPs now, but with a few iterations of mdtestthose IOPs plummet quickly to unbearable levels (like30…).
     2.
We took a snapshot of the filesystem and sent it tothe backup MDS, this time with the MDT built on 4 SASdrives in a raid0 - really not for performance so much as“extra headroom” if that makes any sense. Testing thisthe IOPs started higher, at maybe 800 or 1000 (this isfrom memory, I don’t have my data in front of me). Thatinitial faster speed could just be writing to 4 spindlesI suppose, but surprising to me, the performance degradedat a slower rate. It took much longer to get painfullyslow. It still got there. The performance didn’t degradeat the same rate if that makes sense - the same number ofwrites on the smaller/slower mdt degraded the performancemore quickly. My guess is that had something to do withthe total space available. Who knows. I believerestarting lustre (and certainly rebooting) ‘resets theclock’ on the file create performance degradation.
For that problem we’re just going to try adding 4SSD’s, but it’s an ugly problem. Also are once againhopeful new zfs version addresses it.
And finally, we’ve got a real concern with snapshotbackups of the MDT that my colleague posted about - theproblem we see manifests in essentially a read-onlyrecovered file system, so it’s a concern and not quiteterrifying.
All in all, the next lustre file system we bring up(in a couple weeks) we are very strongly consideringgoing with ldiskfs for the MDT this time.
  Scott








  From: Anjana Kar<mailto:k...@psc.edu>
  Sent: ‎Tuesday‎, ‎June‎ ‎3‎, ‎2014 ‎7‎:‎38‎ ‎PM
  
To:lustre-discuss@lists.lustre.org<mailto:disc...@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>
Is there a way to set the number of inodes for zfsMDT?
I've tried using --mkfsoptions="-N value" mentioned inlustre 2.0
  manual, but it
fails to accept it. We are mirroring 2 80GB SSDs forthe MDT, but the
  number of
inodes is getting set to 7 million, which is notenough for a 100TB
  filesystem.

  Thanks in advance.

  -Anjana Kar
     Pittsburgh Supercomputing Center
     k...@psc.edu<mailto:k...@psc.edu><mailto:k...@psc.edu>
  ------------------------------------------------------------------------

  Lustre-discuss mailing list
  
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org><mailto:Lustre-discuss@lists.lustre.org>
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
  ------------------------------------------------------------------------

  Lustre-discuss mailing list
  
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org><mailto:Lustre-discuss@lists.lustre.org>
  http://lists.lustre.org/mailman/listinfo/lustre-discuss




_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss



_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org<mailto:Lustre-discuss@lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] number of inodes in zfs MDT

Reply via email to