[lustre-discuss] IOR input for pathologic file system abuse
Hi all, I am looking for IOR scripts that represent pathological use cases for file systems. Something like shared file access with a small, unaligned block size or random I/O to a shared file. Does anyone has some input for me there that he/she is willing to share? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Falkenbrunnen, Room 240 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] What happens if my stripe count is set to more than my number of stripes
Hi Oleg, I tried it and looks like it actually stores the 128 stripe size (at least for dirs). lfs getstripe tells me that my dir is now striped over 128 OSTs (I have 48). [/scratch/mkluge] lctl dl | grep osc | wc -l 48 [/scratch/mkluge] mkdir p [/scratch/mkluge] lfs setstripe -c 128 p [/scratch/mkluge] lfs getstripe p p stripe_count: 128 stripe_size:1048576 stripe_offset: -1 Regards, Michael Am 20.04.2015 um 18:44 schrieb Drokin, Oleg: Hello! Current allocator behaviour is such that when you specify more stripes than you have OSTs, it'll treat it the same as if you set stripe count to -1 (that is - the maximum possible stripes). Bye, Oleg On Apr 20, 2015, at 4:47 AM, wrote: Hi, I have a doubt regarding Lustre file system. If I have a file of size 64 GB and I set stripe size to 1GB, my number of stripes become 64. But if I set my stripe count as 128, what does the Lustre do in that case? Thanks and Regards, Prakrati ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] New community release model and 2.5.3 (and 2.x.0) patch lists?
On Wed, Apr 15, 2015 at 11:44 AM, Scott Nolin wrote: Since Intel will not be making community releases for 2.5.4 or 2.x.0 releases now, it seems the community will need to maintain some sort of patch list against these releases. I don't think this is how I understood it a LUG. What took with me: Intel will make 2.x.0 releases every 6 month including fixes. New releases may or may have not new features. But there will be a regular release cycle. Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] [HPDD-discuss] will obdfilter-survey destroy an already formatted file system
Hi Cory, I am running this stuff now since a few weeks. Only a few users are using the file system up to now. Either I was lucky or Andreas is right. No one has complained yet that data got lost. I am running integrity checks in parallel and they did not find anything yet. So we can say it is "most probably safe" :) Regards, Michael > Michael, > > Unfortunately, the current Lustre Ops Manual indicates the opposite. From > section 24.3 "Testing OST Performance (obdfilter_survey)": > > "The obdfilter_survey script is destructive and should not be run on > devices that containing existing data that needs to be preserved. Thus, > tests using obdfilter_survey should be run before the Lustre file system > is placed in production." > > I opened LUDOC-146 to track the issue previously and updated the details > to include Andreas' explanation. > > Thanks, > -Cory > > > On 3/21/13 7:18 PM, "Dilger, Andreas" wrote: > >> On 2013/21/03 4:09 AM, "Michael Kluge" >> wrote: >>> I have read through the documentation for obdfilter-survey but could not >>> found any information on how invasive the test is. Will it destroy an >>> already formatted OST or render user data unusable? >> >> It shouldn't - the obdfilter-survey uses a different object sequence (2) >> compared to normal filesystem objects (currently always 0), so the two do >> not collide. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> >> Lustre Software Architect >> Intel High Performance Data Division >> >> >> ___ >> HPDD-discuss mailing list >> hpdd-disc...@lists.01.org >> https://lists.01.org/mailman/listinfo/hpdd-discuss > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre On Two Clusters
Hi Mark, I remember that the NRL used them. They had a couple of presentations at the Lustre User Group. Here is some pretty old stuff: http://wiki.lustre.org/images/3/3a/JamesHoffman.pdf Regards, Michael Am 09.05.2013 17:15, schrieb Mr. Mark L. Dotson (Contractor): > Thanks, Lee. > > Has anyone done any work with Lustre and IB WAN extenders? I need help > with my configuration. > > Thanks, > > Mark > > On 05/08/13 11:03, Lee, Brett wrote: >>> -Original Message- >>> From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss- >>> boun...@lists.lustre.org] On Behalf Of Mr. Mark L. Dotson (Contractor) >>> Sent: Tuesday, May 07, 2013 9:16 AM >>> To: lustre-discuss@lists.lustre.org >>> Subject: [Lustre-discuss] Lustre On Two Clusters >>> >>> I have Lustre installed and working on 1 cluster. Everything is IB. I can >>> mount >>> clients in this cluster with no problems. I want to mount this Lustre FS on >>> another cluster that is attached to a separate IB switch. >>> What's the best way to do this? Does it require a separate subnet for the IB >>> interfaces, or does it matter? >> >> Hi Mark, >> >> Good to hear from you on the list. >> >> Regarding your question, a couple options jump out at me. >> >> 1. Add additional interfaces to the servers. This will allow the Lustre >> servers to be on both IB networks, and able to directly serve the file >> system to the clients. >> 2. Use LNet router(s), the basics of which is documented in the operations >> manuals. >> >> Either way, you'll need to perform some network configuration in (at least) >> the servers "lustre.conf". >> >> -Brett >> >>> >>> Currently, my /etc/modprobe.d/lustre.conf has the following: >>> >>> options lnet networks="o2ib0(ib0)" >>> >>> Lustre version is 2.3 >>> OS's are CentOS 6.4. >>> >>> Any help would be much appreciated. Thanks. >>> >>> Mark >>> >>> >>> -- >>> Mark Dotson >>> Systems Administrator >>> Lockheed-Martin >>> dotsonml@afrl.hpc.mil >>> ___ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> -- >> Brett Lee >> Sr. Systems Engineer >> Intel High Performance Data Division >> > > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] will obdfilter-survey destroy an already formatted file system
Hi, I have read through the documentation for obdfilter-survey but could not found any information on how invasive the test is. Will it destroy an already formatted OST or render user data unusable? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] df -h question
Dear list, we are in the process of copying the whole content of a 1.6.7 Lustre FS to a 1.8.7 Lustre FS. For this I precreated all individual directories on the new FS to set striping information based on the #bytes/#files ratio. Then we used a parallel rsync to copy all directories over. All of this worked fine. Now, on the old FS the user data consumed 63 TB while on the new FS 'df -h' reports only 56 TB as used. I'm sure we copied all dirs and all rsyncs finished successfully. Is this difference expected if one moves from 1.6->1.8? Or did I miss something? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] wrong free inode count on the client side with 1.8.7
Hi list, the number of free inodes seems to be reported wrongly on the client side. If I create files, the number of free inodes does not change. If I delete the files, the number of free inodes increases. So, from a client perspective, if I repeat to create and remove files, I can have more and more free inodes. I tried to find a bug for this in Whamcloud's database but could not find one. 'df -i' for the mdt on the MDS looks OK. I think behaviour is depicted here: http://lists.lustre.org/pipermail/lustre-discuss/2011-July/015789.html Right now I don't think this is a big problem. Can this turn into a real problem? Like when the number of free inodes as seen by the client exceeds 2^64 or whatever is the limit there? Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey and /dev/dm-0
Hi Frank, thanks a lot, that helped. Regards, Michael Am Dienstag, 12. Juni 2012, 14:24:27 schrieb Frank Riley: > Mount your OSTs as a raw devices using raw. Do a "man raw". I can't remember > if you create the raw device from the /dev/mapper/* device or the /dev/dm-N > device, but one of those works. Then run sgpdd_survey on the /dev/rawN > devices. > > From: lustre-discuss-boun...@lists.lustre.org > [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge > Sent: Tuesday, June 12, 2012 5:51 AM > To: lustre-discuss > Subject: [Lustre-discuss] sgpdd-survey and /dev/dm-0 > > Hi list, > > is there way to run sgpdd-survey on device mapper disks? > > > Regards, Michael > > -- > > Dr.-Ing. Michael Kluge > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room WIL A 208 > Phone: (+49) 351 463-34217 > Fax:(+49) 351 463-37773 > e-mail: michael.kl...@tu-dresden.de<mailto:michael.kl...@tu-dresden.de> > WWW:http://www.tu-dresden.de/zih -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] sgpdd-survey and /dev/dm-0
Hi list, is there way to run sgpdd-survey on device mapper disks? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] performance: hard vs. soft links
> Hard links are only directory entries with refcounts on the target inode, so > that when the last link to an inode is removed the inode will be deleted. > > Symlinks are inodes with a string that points to the original name. They are > not recounted on the target, but require a new inode to be allocated for each > one. > > It isn't obvious which one would be slower, since they both have some > overhead. > > Is your sample size large enough? 1000 may only take 1s to complete and may > not provide consistent results. The 1000 creates need between 2.9 and 3.0 s (3 runs) for the hard links and 2.2-2.3 s (3 runs as well) for the soft links. I think the numbers are "not so bad" in terms of accuracy. Thanks for the explanation. Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] performance: hard vs. soft links
Hi list, for creating hard links instead of soft links (1.6.7, 1000 links created by one process, all in the same subdir, the node is behind one lnet router) I see about 25% overhead (time) on the client side. Is this OK/normal/expected? Lustre probably needs to increment some ref. counter on the link target if hard links are used? Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Most recent Linux Kernel on the client for a 1.8.7 server
Hi Adrian, OK, thanks. Then the state is the same as I remember. Regards, Michael On 16.05.2012 20:14, Adrian Ulrich wrote: > >> could someone please tell me what the most recent kernel version (and lustre >> version) is on the client side, if I have to stick to 1.8.7 on the server >> side? > > 2.x clients will refuse to talk to 1.8.x servers. > > You can build the 1.8.x client with a few patches on CentOS6 (2.6.32), but > you should really consider to upgrade to 2.x in the future. > > Regards, > Adrian > > > -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Most recent Linux Kernel on the client for a 1.8.7 server
Hi list, could someone please tell me what the most recent kernel version (and lustre version) is on the client side, if I have to stick to 1.8.7 on the server side? I think Lustre 2.1 will is not compatible, the 1.8.8 client can be compiled with 2.6.32 but I do not know how 2.0 is doing ... Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] IOR writing to a shared file, performance does not scale
Hi Kshitij, I would recommend to run sgpdd-survey on the servers for one and for multiple disks and then obdfilter-survey. Then you know what your storage can deliver. Then you could do lnet tests as well to see wether the network works fine. If the disks and the network deliver the expected performance, IOR will most probably run with good performance as well. Please see: http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf Regards, Michael On 10.02.2012 23:27, Kshitij Mehta wrote: > We have lustre 1.6.7 configured using 64 OSTs. > I am testing the performance using IOR, which is a file system benchmark. > > When I run IOR using mpi such that processes write to a shared file, > performance does not scale. I tested with 1,2 and 4 processes, and the > performance remains constant at 230 MBps. > > When processes write to separate files, performance improves greatly, > reaching 475 MBps. > > Note that all processes are spawned on a single node. > > Here is the output: > Writing to a shared file: > >> Command line used: ./IOR -a POSIX -b 2g -e -t 32m -w -o >> /fastfs/gabriel/ss_64/km_ior.out >> Machine: Linux deimos102 >> >> Summary: >> api= POSIX >> test filename = /fastfs/gabriel/ss_64/km_ior.out >> access = single-shared-file >> ordering in a file = sequential offsets >> ordering inter file= no tasks offsets >> clients= 4 (4 per node) >> repetitions= 1 >> xfersize = 32 MiB >> blocksize = 2 GiB >> aggregate filesize = 8 GiB >> >> Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min >> (OPs) Mean (OPs) Std Dev Mean (s) >> - - - -- --- - >> - -- --- >> write 233.61 233.61 233.61 0.00 7.30 >> 7.307.30 0.00 35.06771 EXCEL >> >> Max Write: 233.61 MiB/sec (244.95 MB/sec) > > Writing to separate files: > >> Command line used: ./IOR -a POSIX -b 2g -e -t 32m -w -o >> /fastfs/gabriel/ss_64/km_ior.out -F >> Machine: Linux deimos102 >> >> Summary: >> api= POSIX >> test filename = /fastfs/gabriel/ss_64/km_ior.out >> access = file-per-process >> ordering in a file = sequential offsets >> ordering inter file= no tasks offsets >> clients= 4 (4 per node) >> repetitions= 1 >> xfersize = 32 MiB >> blocksize = 2 GiB >> aggregate filesize = 8 GiB >> >> Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min >> (OPs) Mean (OPs) Std Dev Mean (s) >> - - - -- --- - >> - -- --- ---- >> write 475.95 475.95 475.95 0.00 14.87 >> 14.87 14.87 0.00 17.21191 EXCEL >> >> Max Write: 475.95 MiB/sec (499.07 MB/sec) > > I am trying to understand where the bottleneck is, when processes write > to a shared file. > Your help is appreciated. > -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] 1.8 client loses contact to 1.6 router
Hi list, we have a 1.6.7 fs running which still works nicely. One node exports this FS (via 10GE) to another cluster that has some 1.8.5 patchless clients. These clients at some point (randomly, I think) mark the router as down (lctl show_route). It is always a different client and usually a few clients each week that do this. Despite that we configured the clients to ping the router again from time to time, the route never comes back. On these clients I can still "ping" the IP of the router but "lctl ping" gives me an Input/Output error. If I do somthing like: lctl --net o2ib set_route 172.30.128.241@tcp1 down sleep 45 lctl --net o2ib del_route 172.30.128.241@tcp1 sleep 45 lctl --net o2ib add_route 172.30.128.241@tcp1 sleep 45 lctl --net o2ib set_route 172.30.128.241@tcp1 up the route comes back, sometimes the client works again but sometimes the clients issue an "unexpected aliveness of peer .." and need a reboot. I looked around and could not find a note whether 1.8. clients and 1.6 routers will work together as expexted. Has anyone experience with this kind of setup or an idea for further debugging? Regards, Michael modprobe.d/luste.conf on the 1.8.5 clients -8<-- options lnet networks=tcp1(eth0) options lnet routes="o2ib 172.30.128.241@tcp1;" options lnet dead_router_check_interval=60 router_ping_timeout=30 -8<---------- -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS failover: SSD+DRDB or shared 15K-SAS-Storage RAID with approx. 10 disks
Hi Carlos, > In my experience SSDs didn't help much, since the MDS bottleneck is not > only a disk problem rather than the entire lustre metadata mechanism. Yes, but one does not need much space on the MDS and four SSDs (as MDT) are way cheaper than a RAID controller with 10 15K disks. So the question is basically how the DRDB latency will influence the MDT performance. I know sync/async makes a big difference here, but I have no idea about the performance impact of both or how the reliability is influenced. > One remark about DRDB: I've seen customers using it, but IMHO, if > Active/standby HA type configuration would be more reliable and will > provide you a better resilience. Again, don't know about your uptime and > reliability needs, but the customers I've worked with that requires > minimum downtime on production, always go for RAID controllers rather than > DRDB replication. OK, thanks. That is a good information. So SSD+DRDB are considered to be the "cheap" solution. Even for small clusters? Regards, Michael > > Regards, > Carlos. > > > -- > Carlos Thomaz | Systems Architect > Mobile: +1 (303) 519-0578 > ctho...@ddn.com | Skype ID: carlosthomaz > DataDirect Networks, Inc. > 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 > ddn.com<http://www.ddn.com/> | Twitter: @ddn_limitless > <http://twitter.com/ddn_limitless> | 1.800.TERABYTE > > > > > > On 1/22/12 12:04 PM, "Michael Kluge" wrote: > >> Hi, >> >> I have been asked, which one of the two I would chose for two MDS >> servers (active/passive). Whether I would like to have SSDs, maybe two >> (mirrored) in both servers and DRDB for synching, or a RAID controller >> that has a 15K disks. I have not done benchmarks on this topic myself >> and would like to ask if anyone has an idea or numbers? The cluster will >> be pretty small, about 50 clients. >> >> >> Regards, Michael >> >> -- >> Dr.-Ing. Michael Kluge >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax:(+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de >> WWW:http://www.tu-dresden.de/zih >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] MDS failover: SSD+DRDB or shared 15K-SAS-Storage RAID with approx. 10 disks
Hi, I have been asked, which one of the two I would chose for two MDS servers (active/passive). Whether I would like to have SSDs, maybe two (mirrored) in both servers and DRDB for synching, or a RAID controller that has a 15K disks. I have not done benchmarks on this topic myself and would like to ask if anyone has an idea or numbers? The cluster will be pretty small, about 50 clients. Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Client behind Router can't mount with failover mgs
Hi Colin, > > our mgs server (Lustre 1.6.7) failed and we mounted it on the failover > > node. Our clients (1.6.7) on the same IB network are still functional. > > Ok.. Well aside from the fact that 1.6.7 is long since deprecated, what > else isn't functional after failover? Nothing. Everything is fine. Just the 1.8.5. clients behind a IB<->10GE router can't mount anymore. > > We have exported the fs via a Lustre/10GE router to another cluster > > with a patchless 1.8.5. The router works , we can ping around and get > > the usual protocol errors. But mounting the fs from the failover node > > does not work on these clients. Is this expected or is this supposed > > to work? > > Sorry, what are you actually trying to do here??? We have a (pretty old) SDR IB based Cluster with ~700 nodes and 10 Lustre servers. We use an IB<->10GE router to attach this Lustre FS to another cluster. This works pretty well. But only, when the MGS is mounted on the primary node, not when the MGS is mounted on the failover node. I just want to know if this is an expected behaviour or not. Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Client behind Router can't mount with failover mgs
Hi list, our mgs server (Lustre 1.6.7) failed and we mounted it on the failover node. Our clients (1.6.7) on the same IB network are still functional. We have exported the fs via a Lustre/10GE router to another cluster with a patchless 1.8.5. The router works , we can ping around and get the usual protocol errors. But mounting the fs from the failover node does not work on these clients. Is this expected or is this supposed to work? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] most recent kernel version for patchless client?
Hi list, I am looking for information what the most recent kernel version is that I can use to build a patchless client for. OFED for example refuses to build on kernels >3.0.0. Has someone recently tried newer kernels with 1.8.7 ? Regards, Michael -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Interpreting iozone measurements
Hi Jeremy, > > I write a single 64 byte file (with one I/O request), iozone tells me > > something like '295548', which means ~295 MB/s. Dividing the file size > > by the bandwidth, I get the time that was needed to create the file and > > write the 64 bytes (with a single request). In this case, the time is > > about 0,2 micro seconds which is way below the RTT. > > That seems oddly fast for such a small file over any latency. Since you > shouldn't even be able to lock the file in that time. OK, thanks. So I have to see at least a latency of one RTT. I think I need to dig through the data again. I might have made a mistake in one of the formulas ... > > That mean for a Lustre file system, if I create a file and write 64 > > bytes, the client sends two(?) RPCs to the server and does not wait for > > the completion. Is this correct? But it will wait for the completion of > > both RPCs when the file ist closed? > You can see what Lustre is doing if the client isn't doing any other > activity and you enable some tracing. "sysctl -w lnet.debug='+rpctrace > vfstrace'" should allow you to see the VFS ops ll_file_open, > ll_file_writev/ll_file_aio_write, ll_file_release, along with any RPCs > generated by them. You should see an RPC for the file open which will > be a 101 opcode for requesting the lock and you should see a reply AFAIK > before the client actually attempts to write any data. So that should > bring your time upto at least 4 ms for 1 RTT. The initial write should > request a lock from the first stripe followed by a OST write RPC (opcode > 4) followed by a file close (opcode 35). I ran a test over 4 ms > latency so you can see what I'm referring to. I thought that there was > a patch in Lustre a few months back that forced a flush before a file > close, but this is from a 1.8.5 client so I'm guessing that isn't how it > works because between when I closed the file and the end I had to "sync" > for the OST write to show up. OK, understood. > > The numbers look different when I disable the client side cache by > > setting max_dirty_mb to 0. > > Without any grant I think all RPCs have to be synchronous so you'll see > a huge performance hit over latency. These numbers look different :) I'am still trying to make sense of a couple of measurements and to put some useful data into some charts for the LUG this year. Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Interpreting iozone measurements
Hi all, we have a testbed running with Lustre 1.8.3 and a RTT of ~4ms (10GE network cards everywhere) for a ping between client and servers. If I have read the iozone source code correctly, iozone reports bandwidth in KB/s and includes the time for the open() call, but not for close(). If I write a single 64 byte file (with one I/O request), iozone tells me something like '295548', which means ~295 MB/s. Dividing the file size by the bandwidth, I get the time that was needed to create the file and write the 64 bytes (with a single request). In this case, the time is about 0,2 micro seconds which is way below the RTT. That mean for a Lustre file system, if I create a file and write 64 bytes, the client sends two(?) RPCs to the server and does not wait for the completion. Is this correct? But it will wait for the completion of both RPCs when the file ist closed? The numbers look different when I disable the client side cache by setting max_dirty_mb to 0. Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS replacement
Hi Johann, interesting. Is there no need to set the file system volume of the new OSS name via tune2fs to the same string? Michael Am Donnerstag, den 24.02.2011, 10:48 +0100 schrieb Johann Lombardi: > Hi, > > On Thu, Feb 24, 2011 at 10:39:32AM +0100, Gizo Nanava wrote: > >we need to replace one of the OSS in the cluster. We wounder whether > > simply copying(eg. rsync) over network > > the content of all /dev/sdX(ldiskfs mounted) from OSS to be replaced to > > the new, already lustre formatted OSS > > (all /dev/sdX on both servers are the same) will work? > > Yes, the procedure is detailed in the manual: > http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTroubleshooting.html#50651190_pgfId-1291458 > > Johann > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Running MGS and OSS on the same machine
Hi Arya, if I remember well, Lustre uses 0@lo for the localhost address. Does using the other NID 192.168.0.10@tcp0 give any error message? Michael Am 18.02.2011 16:10, schrieb Arya Mazaheri: > Hi again, > I have planned to use one server as MGS and OSS simultaneously. But how > can I format the OSTs as lustre FS? > for example, the line below tells the ost which it's mgsnode is at > 192.168.0.10@tcp0: > mkfs.lustre --fsname lustre --ost --mgsnode=192.168.0.10@tcp0 /dev/vg00/ost1 > > But, now mgsnode is the same machine. I tried to put localhost instead > the ip address. but I didn't work. > > What shoud I do? > > Arya > > > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] How to detect process owner on client
But it does not give you PIDs or user names? Or is there a way to find these with standard lustre tools? Michael Am 11.02.2011 17:34, schrieb Andreas Dilger: > On 2011-02-10, at 23:18, Michael Kluge wrote: >> I am not aware of any possibility to map the current statistics in /proc >> to UIDs. But I might be wrong. We had a script like this a while ago >> which did not kill the I/O intensive processes but told us the PIDs. >> >> What we did is collecting for ~30 seconds the number of I/O operations >> per node via /proc on all nodes. Then we attached an strace process to >> each process on nodes with heavy I/O load. This strace intercepted only >> the I/O calls and wrote one log file per process. If this strace is >> running for the same amount of time for each process on a host, you just >> need to sort the log files for size. > > On the OSS and MDS nodes there are per-client statistics that allow this kind > of tracking. They can be seen in /proc/fs/lustre/obdfilter/*/exports/*/stats > for detailed information (e.g. broken down by RPC type, bytes read/written), > or /proc/fs/lustre/ost/OSS/*/req_history to just get a dump of the recent > RPCs sent by each client. > > A little script was discussed in the thread "How to determine which lustre > clients are loading filesystem" (2010-07-08): > >> Another way that I heard some sites were doing this is to use the "rpc >> history". They may already have a script to do this, but the basics are >> below: >> >> oss# lctl set_param ost.OSS.*.req_buffer_history_max=10240 >> {wait a few seconds to collect some history} >> oss# lctl get_param ost.OSS.*.req_history >> >> This will give you a list of the past (up to) 10240 RPCs for the "ost_io" >> RPC service, which is what you are observing the high load on: >> >> 3436037:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957534353:448:Complete:1278612656:0s(-6s) >> opc 3 >> 3436038:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957536190:448:Complete:1278615489:1s(-41s) >> opc 3 >> 3436039:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957536193:448:Complete:1278615490:0s(-6s) >> opc 3 >> >> This output is in the format: >> >> identifier:target_nid:source_nid:rpc_xid:rpc_size:rpc_status:arrival_time:service_time(deadline) >> opcode >> >> Using some shell scripting, one can find the clients sending the most RPC >> requests: >> >> oss# lctl get_param ost.OSS.*.req_history | tr ":" " " | cut -d" " -f3,9,10 >> | sort | uniq -c | sort -nr | head -20 >> >> >>3443 12345-192.168.20.159@tcp opc 3 >>1215 12345-192.168.20.157@tcp opc 3 >> 121 12345-192.168.20.157@tcp opc 4 >> >> This will give you a sorted list of the top 20 clients that are sending the >> most RPCs to the ost and ost_io services, along with the operation being >> done (3 = OST_READ, 4 = OST_WRITE, etc. see >> lustre/include/lustre/lustre_idl.h). > > >> Am Donnerstag, den 10.02.2011, 21:16 -0600 schrieb Satoshi Isono: >>> Dear members, >>> >>> I am looking into the way which can detect userid or jobid on the Lustre >>> client. Assumed the following condition; >>> >>> 1) Any users run any jobs through scheduler like PBS Pro, LSF or SGE. >>> 2) A users processes occupy Lustre I/O. >>> 3) Some Lustre servers (MDS?/OSS?) can detect high I/O stress on each >>> server. >>> 4) But Lustre server cannot make the mapping between jobid/userid and >>> Lustre I/O processes having heavy stress, because there aren't userid on >>> Lustre servers. >>> 5) I expect that Lustre can monitor and can make the mapping. >>> 6) If possible for (5), we can make a script which launches scheduler >>> command like as qdel. >>> 7) Heavy users job will be killed by job scheduler. >>> >>> I want (5) for Lustre capability, but I guess current Lustre 1.8 cannot >>> perform (5). On the other hand, in order to map Lustre process to >>> userid/jobid, are there any ways using like rpctrace or nid stats? Can you >>> please your advice or comments? >>> >>> Regards, >>> Satoshi Isono >>> ___ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> -- >> >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Servi
Re: [Lustre-discuss] How to detect process owner on client
Hi Satoshi, I am not aware of any possibility to map the current statistics in /proc to UIDs. But I might be wrong. We had a script like this a while ago which did not kill the I/O intensive processes but told us the PIDs. What we did is collecting for ~30 seconds the number of I/O operations per node via /proc on all nodes. Then we attached an strace process to each process on nodes with heavy I/O load. This strace intercepted only the I/O calls and wrote one log file per process. If this strace is running for the same amount of time for each process on a host, you just need to sort the log files for size. Regards, Michael Am Donnerstag, den 10.02.2011, 21:16 -0600 schrieb Satoshi Isono: > Dear members, > > I am looking into the way which can detect userid or jobid on the Lustre > client. Assumed the following condition; > > 1) Any users run any jobs through scheduler like PBS Pro, LSF or SGE. > 2) A users processes occupy Lustre I/O. > 3) Some Lustre servers (MDS?/OSS?) can detect high I/O stress on each server. > 4) But Lustre server cannot make the mapping between jobid/userid and Lustre > I/O processes having heavy stress, because there aren't userid on Lustre > servers. > 5) I expect that Lustre can monitor and can make the mapping. > 6) If possible for (5), we can make a script which launches scheduler > command like as qdel. > 7) Heavy users job will be killed by job scheduler. > > I want (5) for Lustre capability, but I guess current Lustre 1.8 cannot > perform (5). On the other hand, in order to map Lustre process to > userid/jobid, are there any ways using like rpctrace or nid stats? Can you > please your advice or comments? > > Regards, > Satoshi Isono > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] "up" a router that is marked "down"
Hi Jeremy, yup, it's marked " obsolete (DANGEROUS) ", whatever, it did the trick :) Thanks a lot, Michael Am Dienstag, den 25.01.2011, 18:55 -0500 schrieb Jeremy Filizetti: > Though I think its marked as development or experimental in the Lustre > documention or source "lctl set_route" has worked fine for me in the > past with no issues. > > lctl set_route up > > is the syntax I believe. > > Jeremy > > > On Tue, Jan 25, 2011 at 9:52 AM, Michael Kluge > wrote: > Jason, Michael, > > thanks y lot for your replies. I pinged everone from all > directions but > the router is still marked "down" on the client. I even > removed and > re-added the router entry via lctl --net tcp1 del_route > xyz@o2ib and > lctl --net tcp1 add_route xyz@o2ib . No luck. So I think I'll > wait for > the next maintenance window. Oh, and I forgot to mention that > the > servers run a 1.6.7.2, the router as well and the clients > 1.8.5. Works > good so far. > > > Thanks, Michael > > > Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple > Jason: > > > I've found that even with the Protocal Error, it still > works. > > > > -Jason > > > > -Original Message- > > From: lustre-discuss-boun...@lists.lustre.org > [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of > Michael Shuey > > Sent: martedì, 25. gennaio 2011 14:45 > > To: Michael Kluge > > Cc: Lustre Diskussionsliste > > Subject: Re: [Lustre-discuss] "up" a router that is marked > "down" > > > > You'll want to add the "dead_router_check_interval" lnet > module > > parameter as soon as you are able. As near as I can tell, > without > > that there's no automatic check to make sure the router is > alive. > > > > I've had some success in getting machines to recognize that > a router > > is alive again by doing an lctl ping of their side of a > router (e.g., > > on a tcp0 client, `lctl ping @tcp0`, then `lctl > ping > > @o2ib0` from an o2ib0 client). If you have a > server/client > > version mismatch, where lctl ping returns a protocol error, > you may be > > out of luck. > > > > -- > > Mike Shuey > > > > > > > > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge > > wrote: > > > Hi list, > > > > > > if a Lustre router is down, comes back to life and the > servers do not > > > actively test the routers periodically: is it possible to > mark a Lustre > > > router as "up"? Or to tell the servers to ping the router? > > > > > > Or can I enable the "router pinger" in a live system > without unloading > > > and loading the Lustre kernel modules? > > > > > > > > > Regards, Michael > > > > > > -- > > > > > > Michael Kluge, M.Sc. > > > > > > Technische Universität Dresden > > > Center for Information Services and > > > High Performance Computing (ZIH) > > > D-01062 Dresden > > > Germany > > > > > > Contact: > > > Willersbau, Room A 208 > > > Phone: (+49) 351 463-34217 > > > Fax:(+49) 351 463-37773 > > > e-mail: michael.kl...@tu-dresden.de > > > WWW:http://www.tu-dresden.de/zih > > > > > > ___ > > > Lustre-discuss mailing list > > > Lustre-discuss@lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > ___ > > Lustre-discuss mailing list > > Lustre-discuss@lists.lustre.org > > http://lists.lustre.org
Re: [Lustre-discuss] "up" a router that is marked "down"
Jason, Michael, thanks y lot for your replies. I pinged everone from all directions but the router is still marked "down" on the client. I even removed and re-added the router entry via lctl --net tcp1 del_route xyz@o2ib and lctl --net tcp1 add_route xyz@o2ib . No luck. So I think I'll wait for the next maintenance window. Oh, and I forgot to mention that the servers run a 1.6.7.2, the router as well and the clients 1.8.5. Works good so far. Thanks, Michael Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple Jason: > I've found that even with the Protocal Error, it still works. > > -Jason > > -Original Message- > From: lustre-discuss-boun...@lists.lustre.org > [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Shuey > Sent: martedì, 25. gennaio 2011 14:45 > To: Michael Kluge > Cc: Lustre Diskussionsliste > Subject: Re: [Lustre-discuss] "up" a router that is marked "down" > > You'll want to add the "dead_router_check_interval" lnet module > parameter as soon as you are able. As near as I can tell, without > that there's no automatic check to make sure the router is alive. > > I've had some success in getting machines to recognize that a router > is alive again by doing an lctl ping of their side of a router (e.g., > on a tcp0 client, `lctl ping @tcp0`, then `lctl ping > @o2ib0` from an o2ib0 client). If you have a server/client > version mismatch, where lctl ping returns a protocol error, you may be > out of luck. > > -- > Mike Shuey > > > > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge > wrote: > > Hi list, > > > > if a Lustre router is down, comes back to life and the servers do not > > actively test the routers periodically: is it possible to mark a Lustre > > router as "up"? Or to tell the servers to ping the router? > > > > Or can I enable the "router pinger" in a live system without unloading > > and loading the Lustre kernel modules? > > > > > > Regards, Michael > > > > -- > > > > Michael Kluge, M.Sc. > > > > Technische Universität Dresden > > Center for Information Services and > > High Performance Computing (ZIH) > > D-01062 Dresden > > Germany > > > > Contact: > > Willersbau, Room A 208 > > Phone: (+49) 351 463-34217 > > Fax:(+49) 351 463-37773 > > e-mail: michael.kl...@tu-dresden.de > > WWW:http://www.tu-dresden.de/zih > > > > _______ > > Lustre-discuss mailing list > > Lustre-discuss@lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] "up" a router that is marked "down"
Hi list, if a Lustre router is down, comes back to life and the servers do not actively test the routers periodically: is it possible to mark a Lustre router as "up"? Or to tell the servers to ping the router? Or can I enable the "router pinger" in a live system without unloading and loading the Lustre kernel modules? Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet rounter immediatelly marked as down
Hi Liang, sure, but my current question is: Why are the nodes within o2ib considering the router as down? I add the route to a node within o2ib and instantly afterwards lctl show_route say the router is down. That does not make much sense to me. And if I try to send a message through the router from this node I see that it can't send the message beause all routers are down. Regards, Michael Am 03.12.2010 16:29, schrieb liang Zhen: > Hi Michael, > > To add router dynamically, you also have to run "--net o2ib add_route > a.b@tcp1" on all nodes of tcp1, so the better choice is using > universal modprobe.conf by define "ip2nets" and "routes", you can see > some example at here: > http://wiki.lustre.org/manual/LustreManual18_HTML/MoreComplicatedConfigurations.html > > Regards > Liang > > On 12/3/10 9:32 PM, Michael Kluge wrote: >> Hi list, >> >> we have a Lustr 1.6.7.2 running on our (IB SDR) cluster and have added >> one additional NIC (tcp1) to one node and like to use this node as >> router. I have added a ip2nets statement and forwaring=enabled to the >> modprobe files on the router and reloaded the modules. I see two NIDS >> now and no trouble. >> >> The MDS server that need to go through the router to a hand full of >> additional clients is in production and I can't take it down. So I added >> the route to the additional network via lctl --net tcp1 add_route >> w.x@o2ib where W.X.Y.Z is the ipoib address of the router. When I do >> an lctl show_routes, this router is marked as "down". Is there a way to >> bring it to life? I can lctl ping the router node from the MDS but can't >> reload lnet to enable active router tests. Right now on the MDS the only >> option for the lnet module is the network config for the IB network >> interface. >> >> Any ideas who to enable this router? >> >> >> Regards, Michael >> >> >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lnet rounter immediatelly marked as down
Hi list, we have a Lustr 1.6.7.2 running on our (IB SDR) cluster and have added one additional NIC (tcp1) to one node and like to use this node as router. I have added a ip2nets statement and forwaring=enabled to the modprobe files on the router and reloaded the modules. I see two NIDS now and no trouble. The MDS server that need to go through the router to a hand full of additional clients is in production and I can't take it down. So I added the route to the additional network via lctl --net tcp1 add_route w.x@o2ib where W.X.Y.Z is the ipoib address of the router. When I do an lctl show_routes, this router is marked as "down". Is there a way to bring it to life? I can lctl ping the router node from the MDS but can't reload lnet to enable active router tests. Right now on the MDS the only option for the lnet module is the network config for the IB network interface. Any ideas who to enable this router? Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, I get the same message with you kernel RPMS: In file included from include/linux/list.h:6, from include/linux/mutex.h:13, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:9: error: redeclaration of enumerator 'false' include/linux/stddef.h:16: error: previous definition of 'false' was here /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:11: error: redeclaration of enumerator 'true' include/linux/stddef.h:18: error: previous definition of 'true' was here Could it be that this '2.6.18 being almost an 2.6.28/29' confuses the OFED backports and the 2.6.18 backport does not work anymore? Is that solvable? I found nothing in the OFED bugzilla. Michael Am 23.10.2010 17:51, schrieb Michael Kluge: > Hi Bernd, > > do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a > 2.6.18-164 from the Lustre kernels and OFED won't built against it. The > OFED backports report lot and lots of symbols as "redefined". > > > Michael > > Am 22.10.2010 23:30, schrieb Bernd Schubert: >> Hello Michael, >> >> On Friday, October 22, 2010, you wrote: >>> Hi Bernd, >>> >>>> I'm sorry to hear that. Unfortunately, I really do not have the time to >>>> port this version to your kernel version. >>> >>> No worries. I don't expect this :) >>> >>>> I remember that you use Debian. But I guess you are still using a SLES >>>> kernel then? You could ask Suse about it, although I guess they only do >>>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 >>>> kernel should work (and besides its name, it is internally more or less >>>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more >>>> recent udev, which requires at least 2.6.27. >>> >>> OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try. >> >> Just don't forget that -194 requires 1.8.4 (I think you had been at 1.8.3 >> previously). We also have this driver added as Lustre kernel patch in our >> -ddn >> releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also >> includes the driver together with with recent security backports. >> >> http://eu.ddn.com:8080/lustre/lustre/1.8.3/ >> >> >> Cheers, >> Bernd >> > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a 2.6.18-164 from the Lustre kernels and OFED won't built against it. The OFED backports report lot and lots of symbols as "redefined". Michael Am 22.10.2010 23:30, schrieb Bernd Schubert: > Hello Michael, > > On Friday, October 22, 2010, you wrote: >> Hi Bernd, >> >>> I'm sorry to hear that. Unfortunately, I really do not have the time to >>> port this version to your kernel version. >> >> No worries. I don't expect this :) >> >>> I remember that you use Debian. But I guess you are still using a SLES >>> kernel then? You could ask Suse about it, although I guess they only do >>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 >>> kernel should work (and besides its name, it is internally more or less >>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more >>> recent udev, which requires at least 2.6.27. >> >> OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try. > > Just don't forget that -194 requires 1.8.4 (I think you had been at 1.8.3 > previously). We also have this driver added as Lustre kernel patch in our -ddn > releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also > includes the driver together with with recent security backports. > > http://eu.ddn.com:8080/lustre/lustre/1.8.3/ > > > Cheers, > Bernd > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, > I'm sorry to hear that. Unfortunately, I really do not have the time to port > this version to your kernel version. No worries. I don't expect this :) > I remember that you use Debian. But I guess you are still using a SLES kernel > then? You could ask Suse about it, although I guess they only do care about > SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should > work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 > kernel). Later Debian and Ubuntu releases have a more recent udev, which > requires at least 2.6.27. OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try. Michael > > You could also ask our support department, if they have any news for 2.6.27. > I'm in Lustre engineering and as we only support RHEL5 right now, I so far > did > not care about other kernel versions too much. > > If all doesn't help, you will need to set the queue depth to 1, but that will > also impose a big performance hit :( > > > Cheers, > Bernd > > > On Friday, October 22, 2010, Michael Kluge wrote: > > Hi Bernd, > > > > I have found a RHEL-only release for this version. It does not compile > > on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to > > get a new driver. > > > > > > Michael > > > > Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: > > > On Friday, October 22, 2010, Michael Kluge wrote: > > > > Hi list, > > > > > > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > > > > command and is basically asking the host to send it again later. I had > > > > I think just one concurrent region and 32 threads running. What would > > > > be the appropriate action in this case? Reducing the queue depth on > > > > the HBA? We have Qlogic here, there is an option for the kernel module > > > > for this. > > > > > > I think you run into a known issue with the Q-Logic driver an the SFA10K. > > > You will need at least qla2xxx version 8.03.01.06.05.06-k. And the > > > optimal numbers of commands is likely to be 16 (with 4 OSS connected). > > > > > > > > > Hope it helps, > > > Bernd > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, I have found a RHEL-only release for this version. It does not compile on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to get a new driver. Michael Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: > On Friday, October 22, 2010, Michael Kluge wrote: > > Hi list, > > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > > command and is basically asking the host to send it again later. I had I > > think just one concurrent region and 32 threads running. What would be > > the appropriate action in this case? Reducing the queue depth on the > > HBA? We have Qlogic here, there is an option for the kernel module for > > this. > > I think you run into a known issue with the Q-Logic driver an the SFA10K. You > will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal > numbers > of commands is likely to be 16 (with 4 OSS connected). > > > Hope it helps, > Bernd > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Reducing the queue depth from the default of 32 to 8 did not help. It looks like this problem always shows up when I am writing to more than one region. 2 regions and 2 threads are enough to see the problem. The last tests that succeeds is 1 one region and 16 threads. 1/32 is not being tested. Michael Am Freitag, den 22.10.2010, 10:48 +0200 schrieb Michael Kluge: > Hi list, > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > command and is basically asking the host to send it again later. I had I > think just one concurrent region and 32 threads running. What would be > the appropriate action in this case? Reducing the queue depth on the > HBA? We have Qlogic here, there is an option for the kernel module for > this. > > > Regards, Michael > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi list, DID_BUS_BUSY means that the controller is unable to handle the SCSI command and is basically asking the host to send it again later. I had I think just one concurrent region and 32 threads running. What would be the appropriate action in this case? Reducing the queue depth on the HBA? We have Qlogic here, there is an option for the kernel module for this. Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] high CPU load limits bandwidth?
Disabling checksums boosts the performance to 660 MB/s for a single thread. Now placing 6 IOR processes one my eight core box gives with some striping 1.6 GB/s which is close to the LNET bandwidth. Thanks a lot again! Michael Am 20.10.2010 19:13, schrieb Michael Kluge: > Using O_DIRECT reduces the CPU load but the magical limit of 500 MB/s > for one thread remains. Are the CRC sums calculated on a per thread > base? Or stripe base? Is there a way to test the checksumming speed only? > > > Michael > > Am 20.10.2010 18:53, schrieb Andreas Dilger: >> On 2010-10-20, at 10:40, Michael Kluge wrote: >>> It is the CPU load on the client. The dd/IOR process is using one core >>> completely. The clients and the servers are connected via DDR IB. LNET >>> bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 >>> patchless. >> >> If you only have a single threaded write, then this is somewhat unavoidable >> to saturate a CPU due to copy_from_user(). O_DIRECT will avoid this. >> >>Also, disabling data checksums and debugging can help considerably. There >> is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to >> reduce this overhead, but still not as fast as no checksum at all. >> >> Cheers, Andreas >> >>> Am 20.10.2010 18:15, schrieb Andreas Dilger: >>>> Is this client CPU or server CPU? If you are using Ethernet it will >>>> definitely be CPU hungry and can easily saturate a single core. >>>> >>>> Cheers, Andreas >>>> >>>> On 2010-10-20, at 8:41, Michael Kluge >>>> wrote: >>>> >>>>> Hi list, >>>>> >>>>> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre >>>>> file system shows up with a 100% CPU load within 'top'? The reason why I >>>>> am asking this is that I can write from one client to one OST with 500 >>>>> MB/s. The CPU load will be at 100% in this case. If I stripe over two >>>>> OSTs (which use different OSS servers and different RAID controllers) I >>>>> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will >>>>> be at 100% again. >>>>> >>>>> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10% >>>>> CPU load. >>>>> >>>>> Are there ways to tune this behavior? Changing max_rpcs_in_flight and >>>>> max_dirty_mb did not help. >>>>> >>>>> >>>>> Regards, Michael >>>>> >>>>> -- >>>>> >>>>> Michael Kluge, M.Sc. >>>>> >>>>> Technische Universität Dresden >>>>> Center for Information Services and >>>>> High Performance Computing (ZIH) >>>>> D-01062 Dresden >>>>> Germany >>>>> >>>>> Contact: >>>>> Willersbau, Room A 208 >>>>> Phone: (+49) 351 463-34217 >>>>> Fax:(+49) 351 463-37773 >>>>> e-mail: michael.kl...@tu-dresden.de >>>>> WWW:http://www.tu-dresden.de/zih >>>>> ___ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss@lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> >>> >>> -- >>> Michael Kluge, M.Sc. >>> >>> Technische Universität Dresden >>> Center for Information Services and >>> High Performance Computing (ZIH) >>> D-01062 Dresden >>> Germany >>> >>> Contact: >>> Willersbau, Room WIL A 208 >>> Phone: (+49) 351 463-34217 >>> Fax:(+49) 351 463-37773 >>> e-mail: michael.kl...@tu-dresden.de >>> WWW:http://www.tu-dresden.de/zih >> > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] high CPU load limits bandwidth?
Using O_DIRECT reduces the CPU load but the magical limit of 500 MB/s for one thread remains. Are the CRC sums calculated on a per thread base? Or stripe base? Is there a way to test the checksumming speed only? Michael Am 20.10.2010 18:53, schrieb Andreas Dilger: > On 2010-10-20, at 10:40, Michael Kluge wrote: >> It is the CPU load on the client. The dd/IOR process is using one core >> completely. The clients and the servers are connected via DDR IB. LNET >> bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 patchless. > > If you only have a single threaded write, then this is somewhat unavoidable > to saturate a CPU due to copy_from_user(). O_DIRECT will avoid this. > > Also, disabling data checksums and debugging can help considerably. There > is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to > reduce this overhead, but still not as fast as no checksum at all. > > Cheers, Andreas > >> Am 20.10.2010 18:15, schrieb Andreas Dilger: >>> Is this client CPU or server CPU? If you are using Ethernet it will >>> definitely be CPU hungry and can easily saturate a single core. >>> >>> Cheers, Andreas >>> >>> On 2010-10-20, at 8:41, Michael Kluge wrote: >>> >>>> Hi list, >>>> >>>> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre >>>> file system shows up with a 100% CPU load within 'top'? The reason why I >>>> am asking this is that I can write from one client to one OST with 500 >>>> MB/s. The CPU load will be at 100% in this case. If I stripe over two >>>> OSTs (which use different OSS servers and different RAID controllers) I >>>> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will >>>> be at 100% again. >>>> >>>> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10% >>>> CPU load. >>>> >>>> Are there ways to tune this behavior? Changing max_rpcs_in_flight and >>>> max_dirty_mb did not help. >>>> >>>> >>>> Regards, Michael >>>> >>>> -- >>>> >>>> Michael Kluge, M.Sc. >>>> >>>> Technische Universität Dresden >>>> Center for Information Services and >>>> High Performance Computing (ZIH) >>>> D-01062 Dresden >>>> Germany >>>> >>>> Contact: >>>> Willersbau, Room A 208 >>>> Phone: (+49) 351 463-34217 >>>> Fax:(+49) 351 463-37773 >>>> e-mail: michael.kl...@tu-dresden.de >>>> WWW:http://www.tu-dresden.de/zih >>>> ___ >>>> Lustre-discuss mailing list >>>> Lustre-discuss@lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> >> -- >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax:(+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de >> WWW:http://www.tu-dresden.de/zih > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs performance vs. XFS performance
> For your final final filesystem you still probably want to enable async > journals (unless you are willing to enable the S2A unmirrored device cache). OK, thanks. We'll give this a try. Michael > Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c > problem, for which a patch exists: > > https://bugzilla.lustre.org/show_bug.cgi?id=21745 > > Cheers, > Bernd > > > On Wednesday, October 20, 2010, Michael Kluge wrote: >> Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device. >> We trapped into one or two bugs with obdfilter-survey as lctl has at >> least one bug in 1.8.3 when is uses multiple threads and >> obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+ >> MB/s for obdfilter-survey over a reasonable parameter space after we >> changed to the ext4 based ldiskfs. So that seems to be the trick. >> >> Michael >> >> Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger: >>> On 2010-10-18, at 10:40, Johann Lombardi wrote: >>>> On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote: >>>>> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000 >>>>> mke2fs -O journal_dev -b 4096 $RAM_DEV >>>>> >>>>> mkfs.lustre --device-size=$((7*1024*1024*1024)) --ost --fsname=luram >>>>> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b >>>>> 4096 -j -J device=$RAM_DEV" /dev/disk/by-path/... >>>>> >>>>> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1 >>>> >>>> In fact, Lustre uses additional mount options (see "Persistent mount >>>> opts" in tunefs.lustre output). If your ldiskfs module is based on >>>> ext3, you should add the extents and mballoc options which are known >>>> to improve performance. >>> >>> Even then, the IO submission path of ext3 from userspace is not very >>> good, and such a performance difference is not unexpected. When >>> submitting IO from userspace to ext3/ldiskfs it is being done in 4kB >>> blocks, and each block is allocated separately (regardless of mballoc, >>> unfortunately). When Lustre is doing IO from the kernel, the client is >>> aggregating the IO into 1MB chunks and the entire 1MB write is allocated >>> in one operation. >>> >>> That is why we developed the "delalloc" code for ext4 - so that userspace >>> could also get better IO performance, and utilize the multi-block >>> allocation (mballoc) routines that have been in ldiskfs for ages, but >>> only accessible from the kernel. >>> >>> For Lustre performance testing, I would suggest looking at lustre-iokit, >>> and in particular "sgpdd" to test the underlying block device, and then >>> obdfilter-survey to test the local Lustre IO submission path. >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Lustre Technical Lead >>> Oracle Corporation Canada Inc. > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] high CPU load limits bandwidth?
It is the CPU load on the client. The dd/IOR process is using one core completely. The clients and the servers are connected via DDR IB. LNET bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 patchless. Micha Am 20.10.2010 18:15, schrieb Andreas Dilger: > Is this client CPU or server CPU? If you are using Ethernet it will > definitely be CPU hungry and can easily saturate a single core. > > Cheers, Andreas > > On 2010-10-20, at 8:41, Michael Kluge wrote: > >> Hi list, >> >> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre >> file system shows up with a 100% CPU load within 'top'? The reason why I >> am asking this is that I can write from one client to one OST with 500 >> MB/s. The CPU load will be at 100% in this case. If I stripe over two >> OSTs (which use different OSS servers and different RAID controllers) I >> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will >> be at 100% again. >> >> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10% >> CPU load. >> >> Are there ways to tune this behavior? Changing max_rpcs_in_flight and >> max_dirty_mb did not help. >> >> >> Regards, Michael >> >> -- >> >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room A 208 >> Phone: (+49) 351 463-34217 >> Fax:(+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de >> WWW:http://www.tu-dresden.de/zih >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] high CPU load limits bandwidth?
Hi list, is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre file system shows up with a 100% CPU load within 'top'? The reason why I am asking this is that I can write from one client to one OST with 500 MB/s. The CPU load will be at 100% in this case. If I stripe over two OSTs (which use different OSS servers and different RAID controllers) I will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will be at 100% again. A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10% CPU load. Are there ways to tune this behavior? Changing max_rpcs_in_flight and max_dirty_mb did not help. Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs performance vs. XFS performance
Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device. We trapped into one or two bugs with obdfilter-survey as lctl has at least one bug in 1.8.3 when is uses multiple threads and obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+ MB/s for obdfilter-survey over a reasonable parameter space after we changed to the ext4 based ldiskfs. So that seems to be the trick. Michael Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger: > On 2010-10-18, at 10:40, Johann Lombardi wrote: > > On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote: > >> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000 > >> mke2fs -O journal_dev -b 4096 $RAM_DEV > >> > >> mkfs.lustre --device-size=$((7*1024*1024*1024)) --ost --fsname=luram > >> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b 4096 > >> -j -J device=$RAM_DEV" /dev/disk/by-path/... > >> > >> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1 > > > > In fact, Lustre uses additional mount options (see "Persistent mount opts" > > in tunefs.lustre output). > > If your ldiskfs module is based on ext3, you should add the extents and > > mballoc options which are known to improve performance. > > Even then, the IO submission path of ext3 from userspace is not very good, > and such a performance difference is not unexpected. When submitting IO from > userspace to ext3/ldiskfs it is being done in 4kB blocks, and each block is > allocated separately (regardless of mballoc, unfortunately). When Lustre is > doing IO from the kernel, the client is aggregating the IO into 1MB chunks > and the entire 1MB write is allocated in one operation. > > That is why we developed the "delalloc" code for ext4 - so that userspace > could also get better IO performance, and utilize the multi-block allocation > (mballoc) routines that have been in ldiskfs for ages, but only accessible > from the kernel. > > For Lustre performance testing, I would suggest looking at lustre-iokit, and > in particular "sgpdd" to test the underlying block device, and then > obdfilter-survey to test the local Lustre IO submission path. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] ldiskfs performance vs. XFS performance
Hi list, we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted with XFS shows 400 MB/s if oppressed with one 'dd' and large block sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It looks a bit low compared to XFS. We think with help from DDN we did everything we can from a hardware perspective. We formatted the LUN with the correct striping and stripe size, DDN adjusted some controller parameters and we even put the file system journal on a RAM disk. The LUN has 16 TB capacity. I formated only 7 for the moment due to the 8 TB limit. This is what I did: mds_nid...@somehwere RAM_DEV=/dev/ram1 dd if=/dev/zero of=$RAM_DEV bs=1M count=1000 mke2fs -O journal_dev -b 4096 $RAM_DEV mkfs.lustre --device-size=$((7*1024*1024*1024)) --ost --fsname=luram --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b 4096 -j -J device=$RAM_DEV" /dev/disk/by-path/... mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1 Is there a way to push the bandwidth limit for a single data stream any further? Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8/2.6.32 support
Hi, is there any chance to get a 1.8.4 compiled on a 2.6.32+ kernel right now with the standard Lustre sources that are available through the download pages? The "build your own kernel" wiki page points to a collection of supported kernels http://downloads.lustre.org/public/kernels/sles11/ which has a 2.6.32 in it but I could not find a working set of patches for this. Has anyone been more successful? Michael Am Montag, den 26.04.2010, 12:11 -0600 schrieb Andreas Dilger: > On 2010-03-31, at 10:16, Stephen Willey wrote: > > Obviously there is no RH-6.0 just yet (at least not beta or release) and as > > such 2.6.32 is not on the supported kernels list - obviously fair enough. > > > > There are bugzilla entries with patches for 2.6.32 but these all apply to > > HEAD as opposed to the b1_8 branch. Particularly all the stuff that > > applied against libcfs/blah/blah.m4 > > > > I'm trying to build an up-to-date patchless 1.8 client for Fedora 12 > > (2.6.32) and given a few hours to mash patches from HEAD into b1_8, it's > > doable, albeit hacky (I'm not a programmer) whereas I can compile HEAD > > almost without modification. > > > > Is it the intention to backport these various changes into b1_8 or is that > > more or less as-is now until the release of 2.0? We're in a bit of an > > awkward place since we can't compile 1.6.7.2 on 2.6.32 and 2.0 is still not > > in a production state. > > There is work going on in bugzilla for b1_8 SLES11 SP1(?) kernel support, > which will hopefully also be usable for RHEL6, when it is available. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > ___________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ls does not work on ram disk for normal user
Ahh. This user has different UIDs on the clients and the server. Do they actually have to be the same? I thought the MDS and the OSS servers just store files with the uid/gid as reported by the client. I did not assume that the servers need to map these UIDs to a user name. Michael Am 22.09.2010 um 10:57 schrieb Thomas Roth: > Hi Michael, > > "Identifier removed" occured to me when the user data base was not accessible > by > the MDS - when the MDS didn't know about any normal user. "root" is of course > known > there, but what does e.g. "id mkluge" say on your MDS? > > Regards, > Thomas > > On 09/22/2010 10:29 AM, Michael Kluge wrote: >> Hi all, >> >> I have a 1.8.3 running on a couple of servers connected via IB to a >> small cluster. To test the network performance I have one MDS and 14 OST >> residing in ram disks. One the client it is mounted on /lustre. >> >> I have a file in this directory (created as root and then chown'ed to >> 'mkluge'): >> >> mkl...@r2i0n0:~> ls -la /lustre/dfddd/ball >> -rw-r--r-- 1 mkluge zih 14680064000 2010-09-22 10:14 /lustre/dfddd/ball >> mkl...@r2i0n0:~> cd /lustre/dfddd/ >> mkl...@r2i0n0:/lustre/dfddd> ls >> /bin/ls: .: Identifier removed >> mkl...@r2i0n0:/lustre/dfddd> ls -la >> /bin/ls: .: Identifier removed >> >> Has anyone an idea what this could be? I can't event create a directory >> in /lustre >> >> mkl...@r2i0n0:~> mkdir /lustre/ww >> mkdir: cannot create directory `/lustre/ww': Identifier removed >> >> 'root' is able to create the directory. Setting permissions to '777' or >> '1777' does not help either. >> >> The MDS was formated to use mgt and mgs from the same ram device. >> >> >> Regards, Michael >> >> >> >> >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- > > Thomas Roth > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum für Schwerionenforschung GmbH > Planckstraße 1 > 64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschränkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Geschäftsführung: Professor Dr. Dr. h.c. Horst Stöcker, > Dr. Hartmut Eickhoff > > Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] ls does not work on ram disk for normal user
Hi all, I have a 1.8.3 running on a couple of servers connected via IB to a small cluster. To test the network performance I have one MDS and 14 OST residing in ram disks. One the client it is mounted on /lustre. I have a file in this directory (created as root and then chown'ed to 'mkluge'): mkl...@r2i0n0:~> ls -la /lustre/dfddd/ball -rw-r--r-- 1 mkluge zih 14680064000 2010-09-22 10:14 /lustre/dfddd/ball mkl...@r2i0n0:~> cd /lustre/dfddd/ mkl...@r2i0n0:/lustre/dfddd> ls /bin/ls: .: Identifier removed mkl...@r2i0n0:/lustre/dfddd> ls -la /bin/ls: .: Identifier removed Has anyone an idea what this could be? I can't event create a directory in /lustre mkl...@r2i0n0:~> mkdir /lustre/ww mkdir: cannot create directory `/lustre/ww': Identifier removed 'root' is able to create the directory. Setting permissions to '777' or '1777' does not help either. The MDS was formated to use mgt and mgs from the same ram device. Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet router tuning
Hi Eric, --concurrency 2 already boosted the performance to 1026 MB/s. I don't think we'll get any more out of this :) Thanks a lot, Michael Am 13.09.2010 um 07:55 schrieb Eric Barton: > Michael, > > I think you may have only got 1 BRW READ in flight at a time with this script, > so I would expect the routed throughput to be getting on for half of direct > throughput. Can you try “--concurrency 8” to simulate the number of I/Os > a real client would keep in flight? > > Cheers, > Eric > > From: Michael Kluge [mailto:michael.kl...@tu-dresden.de] > Sent: 13 September 2010 10:35 PM > To: Eric Barton > Cc: 'Lustre Diskussionsliste' > Subject: Re: [Lustre-discuss] lnet router tuning > > Hi Eric, > > basically right now I have one IB node, one 10GE node and one router node > that has both types of network interfaces. > > I've got a small lnet test script on the router node, that does the work: > export LST_SESSION=$$ > lst new_session rw > lst add_group readers 192.168.1...@tcp > lst add_group writers 10.148.0...@o2ib > lst add_batch bulk_rw > lst add_test --batch bulk_rw --from writers --to readers brw read > check=simple size=1M > lst run bulk_rw > lst stat writers & sleep 30; kill $! > lst end_session > > Is there a way to figure out the messages in flight? I remember to have a > "rpc's in flight" tunable but this is connected to the OSC layer which does > not do anything in my case (I think). > > > Michael > > > > Am 13.09.2010 um 03:08 schrieb Eric Barton: > > > > Michael, > > > How are you generating load and measuring the throughput? I’m particularly > interested in the number > of nodes on each side of the router and how many messages you have in flight > between each one. > > > Cheers, > Eric > > > > > From: lustre-discuss-boun...@lists.lustre.org > [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge > Sent: 11 September 2010 12:56 AM > To: Michael Kluge > Cc: Lustre Diskussionsliste > Subject: Re: [Lustre-discuss] lnet router tuning > > And here are my params: > > r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do > echo -n "$F: "; cat $F ; done > /sys/module/lnet/parameters/accept: secure > /sys/module/lnet/parameters/accept_backlog: 127 > /sys/module/lnet/parameters/accept_port: 988 > /sys/module/lnet/parameters/accept_timeout: 5 > /sys/module/lnet/parameters/auto_down: 1 > /sys/module/lnet/parameters/avoid_asym_router_failure: 0 > /sys/module/lnet/parameters/check_routers_before_use: 0 > /sys/module/lnet/parameters/config_on_load: 0 > /sys/module/lnet/parameters/dead_router_check_interval: 0 > /sys/module/lnet/parameters/forwarding: enabled > /sys/module/lnet/parameters/ip2nets: > /sys/module/lnet/parameters/large_router_buffers: 512 > /sys/module/lnet/parameters/live_router_check_interval: 0 > /sys/module/lnet/parameters/local_nid_dist_zero: 1 > /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1) > /sys/module/lnet/parameters/peer_buffer_credits: 0 > /sys/module/lnet/parameters/portals_compatibility: none > /sys/module/lnet/parameters/router_ping_timeout: 50 > /sys/module/lnet/parameters/routes: > /sys/module/lnet/parameters/small_router_buffers: 8192 > /sys/module/lnet/parameters/tiny_router_buffers: 1024 > > I have not used ip2nets but configure routing but put explict routing > statements into the modprobe.d/ files. Is that OK? > > > Michael > > > Am 10.09.2010 um 17:48 schrieb Michael Kluge: > > > > OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with > additional lnet router I see 550 MB/s. Time for lnet tuning? > > Michael > > > > Hi Andreas, > > Am 10.09.2010 um 16:35 schrieb Andreas Dilger: > > > > On 2010-09-10, at 08:23, Michael Kluge wrote: > > > I have a Lustre 1.8.3 setup where I'd like to some lnet router performance > tests with routing between DDR IB<->10GE networks. Currently I have three > nodes, one with DDR IB, one with 10GE and one with both that does the > routing. A first short lnet test shows 520-550 MB/s performance. > > Has anyone an idea which of the variables of the lnet module are worth > playing with to get this number a bit closer to 1GB/s? > > I would start by testing the performance on just the 10GigE side, and then > separately on the IB side, to verify you are getting the expected performance > from the components before trying them both together. Often it is necessary > to tune the ethernet s
Re: [Lustre-discuss] lnet router tuning
Nic, thanks a lot. That made my day. Michael Am 13.09.2010 um 06:49 schrieb Nic Henke: > On 09/13/2010 08:35 AM, Michael Kluge wrote: >> Hi Eric, >> >> basically right now I have one IB node, one 10GE node and one router >> node that has both types of network interfaces. >> >> I've got a small lnet test script on the router node, that does the work: >> export LST_SESSION=$$ >> lst new_session rw >> lst add_group readers 192.168.1...@tcp >> lst add_group writers 10.148.0...@o2ib >> lst add_batch bulk_rw >> lst add_test --batch bulk_rw --from writers --to readers brw read >> check=simple size=1M >> lst run bulk_rw >> lst stat writers & sleep 30; kill $! >> lst end_session >> >> Is there a way to figure out the messages in flight? I remember to have >> a "rpc's in flight" tunable but this is connected to the OSC layer which >> does not do anything in my case (I think). > > If you don't specify --concurrency to the 'lst add_test', you get 1 RPC > in flight. > > Nic > _______ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet router tuning
Hi Eric, basically right now I have one IB node, one 10GE node and one router node that has both types of network interfaces. I've got a small lnet test script on the router node, that does the work: export LST_SESSION=$$ lst new_session rw lst add_group readers 192.168.1...@tcp lst add_group writers 10.148.0...@o2ib lst add_batch bulk_rw lst add_test --batch bulk_rw --from writers --to readers brw read check=simple size=1M lst run bulk_rw lst stat writers & sleep 30; kill $! lst end_session Is there a way to figure out the messages in flight? I remember to have a "rpc's in flight" tunable but this is connected to the OSC layer which does not do anything in my case (I think). Michael Am 13.09.2010 um 03:08 schrieb Eric Barton: > > Michael, > > > How are you generating load and measuring the throughput? I’m particularly > interested in the number > of nodes on each side of the router and how many messages you have in flight > between each one. > > > Cheers, >Eric > > > > > From: lustre-discuss-boun...@lists.lustre.org > [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge > Sent: 11 September 2010 12:56 AM > To: Michael Kluge > Cc: Lustre Diskussionsliste > Subject: Re: [Lustre-discuss] lnet router tuning > > And here are my params: > > r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do > echo -n "$F: "; cat $F ; done > /sys/module/lnet/parameters/accept: secure > /sys/module/lnet/parameters/accept_backlog: 127 > /sys/module/lnet/parameters/accept_port: 988 > /sys/module/lnet/parameters/accept_timeout: 5 > /sys/module/lnet/parameters/auto_down: 1 > /sys/module/lnet/parameters/avoid_asym_router_failure: 0 > /sys/module/lnet/parameters/check_routers_before_use: 0 > /sys/module/lnet/parameters/config_on_load: 0 > /sys/module/lnet/parameters/dead_router_check_interval: 0 > /sys/module/lnet/parameters/forwarding: enabled > /sys/module/lnet/parameters/ip2nets: > /sys/module/lnet/parameters/large_router_buffers: 512 > /sys/module/lnet/parameters/live_router_check_interval: 0 > /sys/module/lnet/parameters/local_nid_dist_zero: 1 > /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1) > /sys/module/lnet/parameters/peer_buffer_credits: 0 > /sys/module/lnet/parameters/portals_compatibility: none > /sys/module/lnet/parameters/router_ping_timeout: 50 > /sys/module/lnet/parameters/routes: > /sys/module/lnet/parameters/small_router_buffers: 8192 > /sys/module/lnet/parameters/tiny_router_buffers: 1024 > > I have not used ip2nets but configure routing but put explict routing > statements into the modprobe.d/ files. Is that OK? > > > Michael > > > Am 10.09.2010 um 17:48 schrieb Michael Kluge: > > > OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with > additional lnet router I see 550 MB/s. Time for lnet tuning? > > Michael > > > Hi Andreas, > > Am 10.09.2010 um 16:35 schrieb Andreas Dilger: > > > On 2010-09-10, at 08:23, Michael Kluge wrote: > > I have a Lustre 1.8.3 setup where I'd like to some lnet router performance > tests with routing between DDR IB<->10GE networks. Currently I have three > nodes, one with DDR IB, one with 10GE and one with both that does the > routing. A first short lnet test shows 520-550 MB/s performance. > > Has anyone an idea which of the variables of the lnet module are worth > playing with to get this number a bit closer to 1GB/s? > > I would start by testing the performance on just the 10GigE side, and then > separately on the IB side, to verify you are getting the expected performance > from the components before trying them both together. Often it is necessary > to tune the ethernet send/receive buffers. > > Ethernet back to back is at 950 MB/s. I have not looked at IB back to back > yet. > > > Michael > > -- > > Michael Kluge, M.Sc. > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room WIL A 208 > Phone: (+49) 351 463-34217 > Fax:(+49) 351 463-37773 > e-mail: michael.kl...@tu-dresden.de > WWW:http://www.tu-dresden.de/zih > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > > Michael Kluge, M.Sc. > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Conta
Re: [Lustre-discuss] lnet router tuning
Has anyone else a 10GE<->IB Lustre router? What are the typical performance numbers? How close do you get to 1GB/s? Michael Am 10.09.2010 17:55, schrieb Michael Kluge: > And here are my params: > > r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; > do echo -n "$F: "; cat $F ; done > /sys/module/lnet/parameters/accept: secure > /sys/module/lnet/parameters/accept_backlog: 127 > /sys/module/lnet/parameters/accept_port: 988 > /sys/module/lnet/parameters/accept_timeout: 5 > /sys/module/lnet/parameters/auto_down: 1 > /sys/module/lnet/parameters/avoid_asym_router_failure: 0 > /sys/module/lnet/parameters/check_routers_before_use: 0 > /sys/module/lnet/parameters/config_on_load: 0 > /sys/module/lnet/parameters/dead_router_check_interval: 0 > /sys/module/lnet/parameters/forwarding: enabled > /sys/module/lnet/parameters/ip2nets: > /sys/module/lnet/parameters/large_router_buffers: 512 > /sys/module/lnet/parameters/live_router_check_interval: 0 > /sys/module/lnet/parameters/local_nid_dist_zero: 1 > /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1) > /sys/module/lnet/parameters/peer_buffer_credits: 0 > /sys/module/lnet/parameters/portals_compatibility: none > /sys/module/lnet/parameters/router_ping_timeout: 50 > /sys/module/lnet/parameters/routes: > /sys/module/lnet/parameters/small_router_buffers: 8192 > /sys/module/lnet/parameters/tiny_router_buffers: 1024 > > I have not used ip2nets but configure routing but put explict routing > statements into the modprobe.d/ files. Is that OK? > > > Michael > > > Am 10.09.2010 um 17:48 schrieb Michael Kluge: > >> OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, >> with additional lnet router I see 550 MB/s. Time for lnet tuning? >> >> Michael >> >>> Hi Andreas, >>> >>> Am 10.09.2010 um 16:35 schrieb Andreas Dilger: >>> >>>> On 2010-09-10, at 08:23, Michael Kluge wrote: >>>>> I have a Lustre 1.8.3 setup where I'd like to some lnet router >>>>> performance tests with routing between DDR IB<->10GE networks. >>>>> Currently I have three nodes, one with DDR IB, one with 10GE and >>>>> one with both that does the routing. A first short lnet test shows >>>>> 520-550 MB/s performance. >>>>> >>>>> Has anyone an idea which of the variables of the lnet module are >>>>> worth playing with to get this number a bit closer to 1GB/s? >>>> >>>> I would start by testing the performance on just the 10GigE side, >>>> and then separately on the IB side, to verify you are getting the >>>> expected performance from the components before trying them both >>>> together. Often it is necessary to tune the ethernet send/receive >>>> buffers. >>> >>> Ethernet back to back is at 950 MB/s. I have not looked at IB back to >>> back yet. >>> >>> >>> Michael >>> >>> -- >>> >>> Michael Kluge, M.Sc. >>> >>> Technische Universität Dresden >>> Center for Information Services and >>> High Performance Computing (ZIH) >>> D-01062 Dresden >>> Germany >>> >>> Contact: >>> Willersbau, Room WIL A 208 >>> Phone: (+49) 351 463-34217 >>> Fax: (+49) 351 463-37773 >>> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de> >>> WWW: http://www.tu-dresden.de/zih >>> >>> ___ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org <mailto:Lustre-discuss@lists.lustre.org> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> -- >> >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax: (+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de> >> WWW: http://www.tu-dresden.de/zih >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org <mailto:Lustre-discuss@lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > > Michael Kluge, M.Sc. > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room WIL A 208 > Phone: (+49) 351 463-34217 > Fax: (+49) 351 463-37773 > e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de> > WWW: http://www.tu-dresden.de/zih > > > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet router tuning
And here are my params: r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do echo -n "$F: "; cat $F ; done /sys/module/lnet/parameters/accept: secure /sys/module/lnet/parameters/accept_backlog: 127 /sys/module/lnet/parameters/accept_port: 988 /sys/module/lnet/parameters/accept_timeout: 5 /sys/module/lnet/parameters/auto_down: 1 /sys/module/lnet/parameters/avoid_asym_router_failure: 0 /sys/module/lnet/parameters/check_routers_before_use: 0 /sys/module/lnet/parameters/config_on_load: 0 /sys/module/lnet/parameters/dead_router_check_interval: 0 /sys/module/lnet/parameters/forwarding: enabled /sys/module/lnet/parameters/ip2nets: /sys/module/lnet/parameters/large_router_buffers: 512 /sys/module/lnet/parameters/live_router_check_interval: 0 /sys/module/lnet/parameters/local_nid_dist_zero: 1 /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1) /sys/module/lnet/parameters/peer_buffer_credits: 0 /sys/module/lnet/parameters/portals_compatibility: none /sys/module/lnet/parameters/router_ping_timeout: 50 /sys/module/lnet/parameters/routes: /sys/module/lnet/parameters/small_router_buffers: 8192 /sys/module/lnet/parameters/tiny_router_buffers: 1024 I have not used ip2nets but configure routing but put explict routing statements into the modprobe.d/ files. Is that OK? Michael Am 10.09.2010 um 17:48 schrieb Michael Kluge: > OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with > additional lnet router I see 550 MB/s. Time for lnet tuning? > > Michael > >> Hi Andreas, >> >> Am 10.09.2010 um 16:35 schrieb Andreas Dilger: >> >>> On 2010-09-10, at 08:23, Michael Kluge wrote: >>>> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance >>>> tests with routing between DDR IB<->10GE networks. Currently I have three >>>> nodes, one with DDR IB, one with 10GE and one with both that does the >>>> routing. A first short lnet test shows 520-550 MB/s performance. >>>> >>>> Has anyone an idea which of the variables of the lnet module are worth >>>> playing with to get this number a bit closer to 1GB/s? >>> >>> I would start by testing the performance on just the 10GigE side, and then >>> separately on the IB side, to verify you are getting the expected >>> performance from the components before trying them both together. Often it >>> is necessary to tune the ethernet send/receive buffers. >> >> Ethernet back to back is at 950 MB/s. I have not looked at IB back to back >> yet. >> >> >> Michael >> >> -- >> >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax:(+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de >> WWW:http://www.tu-dresden.de/zih >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > > Michael Kluge, M.Sc. > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room WIL A 208 > Phone: (+49) 351 463-34217 > Fax:(+49) 351 463-37773 > e-mail: michael.kl...@tu-dresden.de > WWW:http://www.tu-dresden.de/zih > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet router tuning
OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with additional lnet router I see 550 MB/s. Time for lnet tuning? Michael > Hi Andreas, > > Am 10.09.2010 um 16:35 schrieb Andreas Dilger: > >> On 2010-09-10, at 08:23, Michael Kluge wrote: >>> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance >>> tests with routing between DDR IB<->10GE networks. Currently I have three >>> nodes, one with DDR IB, one with 10GE and one with both that does the >>> routing. A first short lnet test shows 520-550 MB/s performance. >>> >>> Has anyone an idea which of the variables of the lnet module are worth >>> playing with to get this number a bit closer to 1GB/s? >> >> I would start by testing the performance on just the 10GigE side, and then >> separately on the IB side, to verify you are getting the expected >> performance from the components before trying them both together. Often it >> is necessary to tune the ethernet send/receive buffers. > > Ethernet back to back is at 950 MB/s. I have not looked at IB back to back > yet. > > > Michael > > -- > > Michael Kluge, M.Sc. > > Technische Universität Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room WIL A 208 > Phone: (+49) 351 463-34217 > Fax:(+49) 351 463-37773 > e-mail: michael.kl...@tu-dresden.de > WWW:http://www.tu-dresden.de/zih > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet router tuning
Hi Andreas, Am 10.09.2010 um 16:35 schrieb Andreas Dilger: > On 2010-09-10, at 08:23, Michael Kluge wrote: >> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance >> tests with routing between DDR IB<->10GE networks. Currently I have three >> nodes, one with DDR IB, one with 10GE and one with both that does the >> routing. A first short lnet test shows 520-550 MB/s performance. >> >> Has anyone an idea which of the variables of the lnet module are worth >> playing with to get this number a bit closer to 1GB/s? > > I would start by testing the performance on just the 10GigE side, and then > separately on the IB side, to verify you are getting the expected performance > from the components before trying them both together. Often it is necessary > to tune the ethernet send/receive buffers. Ethernet back to back is at 950 MB/s. I have not looked at IB back to back yet. Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lnet router tuning
Hi all, I have a Lustre 1.8.3 setup where I'd like to some lnet router performance tests with routing between DDR IB<->10GE networks. Currently I have three nodes, one with DDR IB, one with 10GE and one with both that does the routing. A first short lnet test shows 520-550 MB/s performance. Has anyone an idea which of the variables of the lnet module are worth playing with to get this number a bit closer to 1GB/s? parm: tiny_router_buffers:# of 0 payload messages to buffer in the router (int) parm: small_router_buffers:# of small (1 page) messages to buffer in the router (int) parm: large_router_buffers:# of large messages to buffer in the router (int) parm: peer_buffer_credits:# router buffer credits per peer (int) The CPU on the router node is less utilized than it was when I did back to back 10GE tests. I have 6 cores in the machine, 5 have been idle and one showing a load of about 60%. Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] O_DIRECT
Hi all, how does Lustre handle write() requests to files opened with O_DIRECT. Does the OSS enforce that the OST has physically written the data to the OST before the op is completed or does the write() call return on the client before this? I do not see the whole file content walking through the FC port of the RAID controller, but it can also be that my measurement is wrong ... Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Complete lnet routing example
Hi Josh, thanks a lot! Michael Am 24.06.2010 um 15:40 schrieb Joshua Walgenbach: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi Michael, > > This is what I'm using on my test systems: > > I have the servers set up on 192.168.1.0/24 and clients set up on > 192.168.2.0/24, with no network routing between them and a lustre router > bridging the two networks with ip addresses of 192.168.1.31 and > 192.168.2.31. I've a attached a quick diagram. > > modprobe.conf for MDS and OSS servers: > > options lnet networks="tcp0(eth2)" routes="tcp1 192.168.1...@tcp0" > > modprobe.conf for router: > > options lnet networks="tcp0(eth2), tcp1(eth3)" forwarding="enabled" > > modprobe.conf for clients: > > options lnet networks="tcp1(eth2)" routes="tcp0 192.168.2...@tcp1" > > What I have is pretty minimal, but it gets the job done. > > - -Josh > > On 06/24/2010 06:15 AM, Michael Kluge wrote: >> Hi there, >> >> does anyone have a complete lnet routing example that he/she wants to >> share that contains a network diagram and all modprobe.conf options for >> clients, servers and the routers? I found only one mail in the mailing >> list and the interesting parts have gone through a filter and now a lot >> of the configuration options are '[EMAIL PROTECTED]'. >> >> >> Thanks a lot in advance, >> Michael >> >> -- >> >> Michael Kluge, M.Sc. >> >> Technische Universität Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax:(+49) 351 463-37773 >> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de> >> WWW:http://www.tu-dresden.de/zih >> >> >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkwjYEIACgkQcqyJPuRTYp9tTACeIGttWBu44dc4SKB/0IIjHhF9 > i3QAn17sBD38/3MdsYuiGcUOruZVS8j/ > =SLQp > -END PGP SIGNATURE- > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Complete lnet routing example
Hi there, does anyone have a complete lnet routing example that he/she wants to share that contains a network diagram and all modprobe.conf options for clients, servers and the routers? I found only one mail in the mailing list and the interesting parts have gone through a filter and now a lot of the configuration options are '[EMAIL PROTECTED]'. Thanks a lot in advance, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS overload, why?
LMT (http://code.google.com/p/lmt) might be able to give some hints if users are using the FS in a 'wild' fashion. For the question "what can cause this behaviour of my MDS" I guess the answer is like: a million things ;) There is no way of being more specific with more input about the problem itself. Michael Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne Brutschy: > Hi, > > thanks for replying! > > I understand that without further information we can't do much about the > oopses. I was more hoping for some information regarding possible > sources of such an overload. Is it normal that a MDS gets overloaded > like this, while the OSTs have nothing to do, and what can I do about > it? How can I find the source of the problem? > > More specifically, what are the operations that lead to a lot of MDS > load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is > not a top-notch server, it's fairly recent and I feel the load we're > experiencing is not handable by a single MDS. > > My problem is that I can't make out major problems in the user's jobs > running on the cluster, and I can't quantify nor track down the problem > because I don't know what behavior might have caused it. > > As I said, ooppses appeared only twice, and all other problems where > just apparent by a non-responsive MDS. > > Thanks, > Arne > > > On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote: > > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote: > > > > > > The clients showed the following error: > > > > Oct 8 09:58:55 majorana kernel: LustreError: > > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 > > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens > > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > > Oct 8 09:58:55 majorana kernel: LustreError: > > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar > > > > messages > > > > > > So, my question is: what could cause such a load? The cluster was not > > > exessively used... Is this a bug or a user's job that creates the load? > > > How can I protect lustre against this kind of failure? > > > > Without any more information we could not possibly know. If you really > > are getting oopses then you will need console logs (i.e. serial console) > > so that we can see the stack trace. > > > > b. > > > > ___ > > Lustre-discuss mailing list > > Lustre-discuss@lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS overload, why?
Hmm. Should be enough. I guess you need to set up a loghost for syslog then and a reliable serial console to get stack traces. Everything else would be just a wild guess (as the question for the ram size was). Michael > Hi, > > 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz > > Arne > > On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote: > > Hi Arne, > > > > could be memory pressure and the OOM running and shooting at things. How > > much memory does you server has? > > > > > > Michael > > > > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy: > > > Hi everyone, > > > > > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 > > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux > > > 2.6.9-78.0.22. > > > > > > We were quite happy with lustre's performance, especially because > > > bottlenecks caused by /home disk access were history. > > > > > > Saturday, the cluster went down (= was inaccessible). After some > > > investigation I found out that the reason seems to be an overloaded MDS. > > > Over the following 4 days, this happened multiple times and could only > > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. > > > > > > The MDS did not respond to any command, if I managed to get a video > > > signal (not often), load was >170. Additionally, 2 times kernel oops got > > > displayed, but unfortunately I have to record of them. > > > > > > The clients showed the following error: > > > > Oct 8 09:58:55 majorana kernel: LustreError: > > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 > > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens > > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > > > Oct 8 09:58:55 majorana kernel: LustreError: > > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar > > > > messages > > > > > > So, my question is: what could cause such a load? The cluster was not > > > exessively used... Is this a bug or a user's job that creates the load? > > > How can I protect lustre against this kind of failure? > > > > > > Thanks in advance, > > > Arne > > > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS overload, why?
Hi Arne, could be memory pressure and the OOM running and shooting at things. How much memory does you server has? Michael Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy: > Hi everyone, > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1 > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux > 2.6.9-78.0.22. > > We were quite happy with lustre's performance, especially because > bottlenecks caused by /home disk access were history. > > Saturday, the cluster went down (= was inaccessible). After some > investigation I found out that the reason seems to be an overloaded MDS. > Over the following 4 days, this happened multiple times and could only > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS. > > The MDS did not respond to any command, if I managed to get a video > signal (not often), load was >170. Additionally, 2 times kernel oops got > displayed, but unfortunately I have to record of them. > > The clients showed the following error: > > Oct 8 09:58:55 majorana kernel: LustreError: > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 304/456 > > e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0 > > Oct 8 09:58:55 majorana kernel: LustreError: > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar > > messages > > So, my question is: what could cause such a load? The cluster was not > exessively used... Is this a bug or a user's job that creates the load? > How can I protect lustre against this kind of failure? > > Thanks in advance, > Arne > -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Read/Write performance problem
Am Dienstag, den 06.10.2009, 09:33 -0600 schrieb Andreas Dilger: > > ... bla bla ... > > Is there a reason why an immediate read after a write on the same node > > from/to a shared file is slow? Is there any additional communication, > > e.g. is the client flushing the buffer cache before the first read? The > > statistics show that the average time to complete a 1.44MB read request > > is increasing during the runtime of our program. At some point it hits > > an upper limit or a saturation point and stays there. Is there some kind > > of queue or something that is getting full in this kind of > > write/read-scenario? May tuneable some stuff in /proc/fs/luste? > > One possible issue is that you don't have enough extra RAM to cache 1.5GB > of the checkpoint, so during the write it is being flushed to the OSTs > and evicted from cache. When you immediately restart there is still dirty > data being written from the clients that is contending with the reads to > restart. > Cheers, Andreas Well, I do call fsync() after the write is finished. During the write process I see a constant stream of 4 GB/s running from the lustre servers to the raid controllers which finishes when the write process terminates. When I start reading, there are no more writes going this way, so I suspect it might be something else ... Even if I wait between the writes and reads 5 minutes (all dirty pages should have been flushed by then) the picture does not change. Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Read/Write performance problem
Hi all, our Lustre FS shows an interesting performance problem which I'd like to discuss as some of you might have seen this kind of things before and maybe someone has a quick explanation of what's going on. We are running Lustre 1.6.5.1. The problem shows up when we read a shared file from multiple nodes that has just been written from the same set of nodes. 512 processes write a checkpoint (1.5 GB from each node) into a shared file by seeking to position RANK*1.5GB and writing 1.5GB in 1.44M chunks. Writing works fine and gives the full file system performance. The data is being written by using write() and no flags aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is terminated and restarted and reads in the same portion of the file. For some reason this almost immediate reading of the same data that was just written on the same node is very slow. If we a) change the set of nodes or b) wait a day, we get the full read performance when we use the same executable and the same shared file. Is there a reason why an immediate read after a write on the same node from/to a shared file is slow? Is there any additional communication, e.g. is the client flushing the buffer cache before the first read? The statistics show that the average time to complete a 1.44MB read request is increasing during the runtime of our program. At some point it hits an upper limit or a saturation point and stays there. Is there some kind of queue or something that is getting full in this kind of write/read-scenario? May tuneable some stuff in /proc/fs/luste? Regards, Michael -- Michael Kluge, M.Sc. Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss