[lustre-discuss] IOR input for pathologic file system abuse

2017-06-02 Thread Michael Kluge
Hi all,

I am looking for IOR scripts that represent pathological use cases for file 
systems. Something like shared file access with a small, unaligned block size 
or random I/O to a shared file. Does anyone has some input for me there that 
he/she is willing to share?


Regards, Michael

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Falkenbrunnen, Room 240
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih




smime.p7s
Description: S/MIME cryptographic signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What happens if my stripe count is set to more than my number of stripes

2015-04-20 Thread Michael Kluge

Hi Oleg,

I tried it and looks like it actually stores the 128 stripe size (at 
least for dirs). lfs getstripe tells me that my dir is now striped over 
128 OSTs (I have 48).


[/scratch/mkluge] lctl dl | grep osc | wc -l
48
[/scratch/mkluge] mkdir  p
[/scratch/mkluge] lfs setstripe -c 128 p
[/scratch/mkluge] lfs getstripe p
p
stripe_count:   128 stripe_size:1048576 stripe_offset:  -1


Regards, Michael

Am 20.04.2015 um 18:44 schrieb Drokin, Oleg:

Hello!

Current allocator behaviour is such that when you specify more
stripes than you have OSTs, it'll treat it the same as if you set
stripe count to -1 (that is - the maximum possible stripes).

Bye, Oleg On Apr 20, 2015, at 4:47 AM, 
 wrote:


Hi,

I have a doubt regarding Lustre file system. If I have a file of
size 64 GB and I set stripe size to 1GB, my number of stripes
become 64. But if I set my stripe count as 128, what does the
Lustre do in that case?

Thanks and Regards, Prakrati
___ lustre-discuss
mailing list lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___ lustre-discuss
mailing list lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org






smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] New community release model and 2.5.3 (and 2.x.0) patch lists?

2015-04-16 Thread Michael Kluge

On Wed, Apr 15, 2015 at 11:44 AM, Scott Nolin 
wrote:


Since Intel will not be making community releases for 2.5.4 or 2.x.0
releases now, it seems the community will need to maintain some sort of
patch list against these releases.


I don't think this is how I understood it a LUG. What took with me: 
Intel will make 2.x.0 releases every 6 month including fixes. New 
releases may or may have not new features. But there will be a regular 
release cycle.


Michael

--
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] [HPDD-discuss] will obdfilter-survey destroy an already formatted file system

2013-05-22 Thread Michael Kluge
Hi Cory,

I am running this stuff now since a few weeks. Only a few users are using the 
file system up to now. Either I was lucky or Andreas is right. No one has 
complained yet that data got lost. I am running integrity checks in parallel 
and they did not find anything yet. So we can say it is "most probably safe" :)


Regards, Michael


> Michael,
> 
> Unfortunately, the current Lustre Ops Manual indicates the opposite.  From
> section 24.3 "Testing OST Performance (obdfilter_survey)":
> 
> "The obdfilter_survey script is destructive and should not be run on
> devices that containing existing data that needs to be preserved. Thus,
> tests using obdfilter_survey should be run before the Lustre file system
> is placed in production."
> 
> I opened LUDOC-146 to track the issue previously and updated the details
> to include Andreas' explanation.
> 
> Thanks,
> -Cory
> 
> 
> On 3/21/13 7:18 PM, "Dilger, Andreas"  wrote:
> 
>> On 2013/21/03 4:09 AM, "Michael Kluge" 
>> wrote:
>>> I have read through the documentation for obdfilter-survey but could not
>>> found any information on how invasive the test is. Will it destroy an
>>> already formatted OST or render user data unusable?
>> 
>> It shouldn't - the obdfilter-survey uses a different object sequence (2)
>> compared to normal filesystem objects (currently always 0), so the two do
>> not collide.
>> 
>> Cheers, Andreas
>> -- 
>> Andreas Dilger
>> 
>> Lustre Software Architect
>> Intel High Performance Data Division
>> 
>> 
>> ___
>> HPDD-discuss mailing list
>> hpdd-disc...@lists.01.org
>> https://lists.01.org/mailman/listinfo/hpdd-discuss
> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre On Two Clusters

2013-05-13 Thread Michael Kluge
Hi Mark,

I remember that the NRL used them. They had a couple of presentations at 
the Lustre User Group. Here is some pretty old stuff:
http://wiki.lustre.org/images/3/3a/JamesHoffman.pdf


Regards, Michael


Am 09.05.2013 17:15, schrieb Mr. Mark L. Dotson (Contractor):
> Thanks, Lee.
>
> Has anyone done any work with Lustre and IB WAN extenders? I need help
> with my configuration.
>
> Thanks,
>
> Mark
>
> On 05/08/13 11:03, Lee, Brett wrote:
>>> -Original Message-
>>> From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
>>> boun...@lists.lustre.org] On Behalf Of Mr. Mark L. Dotson (Contractor)
>>> Sent: Tuesday, May 07, 2013 9:16 AM
>>> To: lustre-discuss@lists.lustre.org
>>> Subject: [Lustre-discuss] Lustre On Two Clusters
>>>
>>> I have Lustre installed and working on 1 cluster. Everything is IB. I can 
>>> mount
>>> clients in this cluster with no problems. I want to mount this Lustre FS on
>>> another cluster that is attached to a separate IB switch.
>>> What's the best way to do this? Does it require a separate subnet for the IB
>>> interfaces, or does it matter?
>>
>> Hi Mark,
>>
>> Good to hear from you on the list.
>>
>> Regarding your question, a couple options jump out at me.
>>
>> 1.  Add additional interfaces to the servers.  This will allow the Lustre 
>> servers to be on both IB networks, and able to directly serve the file 
>> system to the clients.
>> 2.  Use LNet router(s), the basics of which is documented in the operations 
>> manuals.
>>
>> Either way, you'll need to perform some network configuration in (at least) 
>> the servers "lustre.conf".
>>
>> -Brett
>>
>>>
>>> Currently, my /etc/modprobe.d/lustre.conf has the following:
>>>
>>> options lnet networks="o2ib0(ib0)"
>>>
>>> Lustre version is 2.3
>>> OS's are CentOS 6.4.
>>>
>>> Any help would be much appreciated. Thanks.
>>>
>>> Mark
>>>
>>>
>>> --
>>> Mark Dotson
>>> Systems Administrator
>>> Lockheed-Martin
>>> dotsonml@afrl.hpc.mil
>>> ___
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> --
>> Brett Lee
>> Sr. Systems Engineer
>> Intel High Performance Data Division
>>
>
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] will obdfilter-survey destroy an already formatted file system

2013-03-21 Thread Michael Kluge
Hi,

I have read through the documentation for obdfilter-survey but could not found 
any information on how invasive the test is. Will it destroy an already 
formatted OST or render user data unusable?


Regards, Michael

--
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] df -h question

2012-07-11 Thread Michael Kluge
Dear list,

we are in the process of copying the whole content of a 1.6.7 Lustre FS 
to a 1.8.7 Lustre FS. For this I precreated all individual directories 
on the new FS to set striping information based on the #bytes/#files 
ratio. Then we used a parallel rsync to copy all directories over. All 
of this worked fine. Now, on the old FS the user data consumed 63 TB 
while on the new FS 'df -h' reports only 56 TB as used. I'm sure we 
copied all dirs and all rsyncs finished successfully.

Is this difference expected if one moves from 1.6->1.8? Or did I miss 
something?


Regards, Michael

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] wrong free inode count on the client side with 1.8.7

2012-06-19 Thread Michael Kluge
Hi list,

the number of free inodes seems to be reported wrongly on the client side. If I 
create files, the number of free inodes does not change. If I delete the files, 
the number of free inodes increases. So, from a client perspective, if I repeat 
to create and remove files, I can have more and more free inodes. I tried to 
find a bug for this in Whamcloud's database but could not find one. 'df -i' for 
the mdt on the MDS looks OK.

I think behaviour is depicted here:
http://lists.lustre.org/pipermail/lustre-discuss/2011-July/015789.html

Right now I don't think this is a big problem. Can this turn into a real 
problem? Like when the number of free inodes as seen by the client exceeds 2^64 
or  whatever is the limit there?


Michael

-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey and /dev/dm-0

2012-06-12 Thread Michael Kluge
Hi Frank,

thanks a lot, that helped.


Regards,
Michael


Am Dienstag, 12. Juni 2012, 14:24:27 schrieb Frank Riley:
> Mount your OSTs as a raw devices using raw. Do a "man raw". I can't remember
> if you create the raw device from the /dev/mapper/* device or the /dev/dm-N
> device, but one of those works. Then run sgpdd_survey on the /dev/rawN
> devices.
>
> From: lustre-discuss-boun...@lists.lustre.org
> [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge
> Sent: Tuesday, June 12, 2012 5:51 AM
> To: lustre-discuss
> Subject: [Lustre-discuss] sgpdd-survey and /dev/dm-0
>
> Hi list,
>
> is there way to run sgpdd-survey on device mapper disks?
>
>
> Regards, Michael
>
> --
>
> Dr.-Ing. Michael Kluge
>
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
>
> Contact:
> Willersbau, Room WIL A 208
> Phone:  (+49) 351 463-34217
> Fax:(+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de<mailto:michael.kl...@tu-dresden.de>
> WWW:http://www.tu-dresden.de/zih
--

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] sgpdd-survey and /dev/dm-0

2012-06-12 Thread Michael Kluge
Hi list,

is there way to run sgpdd-survey on device mapper disks?


Regards, Michael

-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] performance: hard vs. soft links

2012-05-26 Thread Michael Kluge
> Hard links are only directory entries with refcounts on the target inode, so 
> that when the last link to an inode is removed the inode will be deleted.
>
> Symlinks are inodes with a string that points to the original name. They are 
> not recounted on the target, but require a new inode to be allocated for each 
> one.
>
> It isn't obvious which one would be slower, since they both have some 
> overhead.
>
> Is your sample size large enough?  1000 may only take 1s to complete and may 
> not provide consistent results.

The 1000 creates need  between 2.9 and 3.0 s (3 runs) for the hard links 
and 2.2-2.3 s (3 runs as well) for the soft links. I think the numbers 
are "not so bad" in terms of accuracy. Thanks for the explanation.


Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] performance: hard vs. soft links

2012-05-25 Thread Michael Kluge
Hi list,

for creating hard links instead of soft links (1.6.7, 1000 links created by
one process, all in the same subdir, the node is behind one lnet router) I see
about 25% overhead (time) on the client side. Is this OK/normal/expected?
Lustre probably needs to increment some ref. counter on the link target if
hard links are used?


Michael

--

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Most recent Linux Kernel on the client for a 1.8.7 server

2012-05-16 Thread Michael Kluge
Hi Adrian,

OK, thanks. Then the state is the same as I remember.


Regards, Michael

On 16.05.2012 20:14, Adrian Ulrich wrote:
>
>> could someone please tell me what the most recent kernel version (and lustre 
>> version) is on the client side, if I have to stick to 1.8.7 on the server 
>> side?
>
> 2.x clients will refuse to talk to 1.8.x servers.
>
> You can build the 1.8.x client with a few patches on CentOS6 (2.6.32), but 
> you should really consider to upgrade to 2.x in the future.
>
> Regards,
>   Adrian
>
>
>

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Most recent Linux Kernel on the client for a 1.8.7 server

2012-05-16 Thread Michael Kluge
Hi list,

could someone please tell me what the most recent kernel version (and lustre 
version) is on the client side, if I have to stick to 1.8.7 on the server side? 
I think Lustre 2.1 will is not compatible, the 1.8.8 client can be compiled 
with 2.6.32 but I do not know how 2.0 is doing ...


Regards, Michael

-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] IOR writing to a shared file, performance does not scale

2012-02-10 Thread Michael Kluge
Hi Kshitij,

I would recommend to run sgpdd-survey on the servers for one and for 
multiple disks and then obdfilter-survey. Then you know what your 
storage can deliver. Then you could do lnet tests as well to see wether 
the network works fine. If the disks and the network deliver the 
expected performance, IOR will most probably run with good performance 
as well.

Please see:
http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf


Regards, Michael

On 10.02.2012 23:27, Kshitij Mehta wrote:
> We have lustre 1.6.7 configured using 64 OSTs.
> I am testing the performance using IOR, which is a file system benchmark.
>
> When I run IOR using mpi such that processes write to a shared file,
> performance does not scale. I tested with 1,2 and 4 processes, and the
> performance remains constant at 230 MBps.
>
> When processes write to separate files, performance improves greatly,
> reaching 475 MBps.
>
> Note that all processes are spawned on a single node.
>
> Here is the output:
> Writing to a shared file:
>
>> Command line used: ./IOR -a POSIX -b 2g -e -t 32m -w -o
>> /fastfs/gabriel/ss_64/km_ior.out
>> Machine: Linux deimos102
>>
>> Summary:
>>  api= POSIX
>>  test filename  = /fastfs/gabriel/ss_64/km_ior.out
>>  access = single-shared-file
>>  ordering in a file = sequential offsets
>>  ordering inter file= no tasks offsets
>>  clients= 4 (4 per node)
>>  repetitions= 1
>>  xfersize   = 32 MiB
>>  blocksize  = 2 GiB
>>  aggregate filesize = 8 GiB
>>
>> Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min
>> (OPs)  Mean (OPs)   Std Dev  Mean (s)
>> -  -  -  --   ---  -
>> -  --   ---  
>> write 233.61 233.61  233.61  0.00   7.30
>> 7.307.30  0.00  35.06771   EXCEL
>>
>> Max Write: 233.61 MiB/sec (244.95 MB/sec)
>
> Writing to separate files:
>
>> Command line used: ./IOR -a POSIX -b 2g -e -t 32m -w -o
>> /fastfs/gabriel/ss_64/km_ior.out -F
>> Machine: Linux deimos102
>>
>> Summary:
>>  api= POSIX
>>  test filename  = /fastfs/gabriel/ss_64/km_ior.out
>>  access = file-per-process
>>  ordering in a file = sequential offsets
>>  ordering inter file= no tasks offsets
>>  clients= 4 (4 per node)
>>  repetitions= 1
>>  xfersize   = 32 MiB
>>  blocksize  = 2 GiB
>>  aggregate filesize = 8 GiB
>>
>> Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min
>> (OPs)  Mean (OPs)   Std Dev  Mean (s)
>> -  -  -  --   ---  -
>> -  --   ---  ----
>> write 475.95 475.95  475.95  0.00  14.87
>> 14.87   14.87  0.00  17.21191   EXCEL
>>
>> Max Write: 475.95 MiB/sec (499.07 MB/sec)
>
> I am trying to understand where the bottleneck is, when processes write
> to a shared file.
> Your help is appreciated.
>

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] 1.8 client loses contact to 1.6 router

2012-02-03 Thread Michael Kluge
Hi list,

we have a 1.6.7 fs running which still works nicely. One node exports this FS 
(via 10GE)  to another cluster that has some 1.8.5 patchless clients. These 
clients at some point (randomly, I think) mark the router as down (lctl 
show_route). It is always a different client and usually a few clients each 
week that do this. Despite that we configured the clients to ping the router 
again from time to time, the route never comes back. On these clients I can 
still "ping" the IP of the router but "lctl ping" gives me an Input/Output 
error. If I do somthing like:

lctl --net o2ib set_route 172.30.128.241@tcp1 down
sleep 45
lctl --net o2ib del_route 172.30.128.241@tcp1
sleep 45
lctl --net o2ib add_route 172.30.128.241@tcp1
sleep 45
lctl --net o2ib set_route 172.30.128.241@tcp1 up

the route comes back, sometimes the client works again but sometimes the 
clients issue an "unexpected aliveness of peer .." and need a reboot.

I looked around and could not find a note whether 1.8. clients and 1.6 routers 
will work together as expexted. Has anyone experience with this kind of setup 
or an idea for further debugging?


Regards, Michael

modprobe.d/luste.conf on the 1.8.5 clients
-8<--
options lnet networks=tcp1(eth0)
options lnet routes="o2ib 172.30.128.241@tcp1;"
options lnet dead_router_check_interval=60 router_ping_timeout=30
-8<----------



-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS failover: SSD+DRDB or shared 15K-SAS-Storage RAID with approx. 10 disks

2012-01-22 Thread Michael Kluge
Hi Carlos,

> In my experience SSDs didn't help much, since the MDS bottleneck is not
> only a disk problem rather than the entire lustre metadata mechanism.

Yes, but one does not need much space on the MDS and four SSDs (as MDT) 
are way cheaper than a RAID controller with 10 15K disks. So the 
question is basically how the DRDB latency will influence the MDT 
performance. I know sync/async makes a big difference here, but I have 
no idea about the performance impact of both or how the reliability is 
influenced.

> One remark about DRDB: I've seen customers using it, but IMHO, if
> Active/standby HA type configuration would be more reliable and will
> provide you a better resilience. Again, don't know about your uptime and
> reliability needs, but the customers I've worked with that requires
> minimum downtime on production, always go for RAID controllers rather than
> DRDB replication.

OK, thanks. That is a good information. So SSD+DRDB are considered to be 
the "cheap" solution. Even for small clusters?


Regards, Michael

>
> Regards,
> Carlos.
>
>
> --
> Carlos Thomaz | Systems Architect
> Mobile: +1 (303) 519-0578
> ctho...@ddn.com | Skype ID: carlosthomaz
> DataDirect Networks, Inc.
> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921
> ddn.com<http://www.ddn.com/>  | Twitter: @ddn_limitless
> <http://twitter.com/ddn_limitless>  | 1.800.TERABYTE
>
>
>
>
>
> On 1/22/12 12:04 PM, "Michael Kluge"  wrote:
>
>> Hi,
>>
>> I have been asked, which one of the two I would chose for two MDS
>> servers (active/passive). Whether I would like to have SSDs, maybe two
>> (mirrored) in both servers and DRDB for synching, or a RAID controller
>> that has a 15K disks. I have not done benchmarks on this topic myself
>> and would like to ask if anyone has an idea or numbers? The cluster will
>> be pretty small, about 50 clients.
>>
>>
>> Regards, Michael
>>
>> --
>> Dr.-Ing. Michael Kluge
>>
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>>
>> Contact:
>> Willersbau, Room WIL A 208
>> Phone:  (+49) 351 463-34217
>> Fax:(+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de
>> WWW:http://www.tu-dresden.de/zih
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] MDS failover: SSD+DRDB or shared 15K-SAS-Storage RAID with approx. 10 disks

2012-01-22 Thread Michael Kluge
Hi,

I have been asked, which one of the two I would chose for two MDS 
servers (active/passive). Whether I would like to have SSDs, maybe two 
(mirrored) in both servers and DRDB for synching, or a RAID controller 
that has a 15K disks. I have not done benchmarks on this topic myself 
and would like to ask if anyone has an idea or numbers? The cluster will 
be pretty small, about 50 clients.


Regards, Michael

-- 
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client behind Router can't mount with failover mgs

2011-12-20 Thread Michael Kluge
Hi Colin,

> > our mgs server (Lustre 1.6.7) failed and we mounted it on the failover
> > node. Our clients (1.6.7) on the same IB network are still functional.
> 
> Ok.. Well aside from the fact that 1.6.7 is long since deprecated, what
> else isn't functional after failover?

Nothing. Everything is fine. Just the 1.8.5. clients behind a IB<->10GE router 
can't mount anymore.

> >   We have exported the fs via a Lustre/10GE router to another cluster
> >   with a patchless 1.8.5. The router works , we can ping around and get
> >   the usual protocol errors. But mounting the fs from the failover node
> >   does not work on these clients. Is this expected or is this supposed
> >   to work?
> 
> Sorry, what are you actually trying to do here???

We have a (pretty old) SDR IB based Cluster with ~700 nodes and 10 Lustre 
servers. We use an IB<->10GE router to attach this Lustre FS to another 
cluster. This works pretty well. But only, when the MGS is mounted on the 
primary node, not when the MGS is mounted on the failover node. I just want to 
know if this is an expected behaviour or not.


Regards, Michael

-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Client behind Router can't mount with failover mgs

2011-12-18 Thread Michael Kluge
Hi list,

our mgs server (Lustre 1.6.7) failed and we mounted it on the failover node. 
Our clients (1.6.7) on the same IB network are still functional. We have 
exported the fs via a Lustre/10GE router to another cluster with a patchless 
1.8.5. The router works , we can ping around and get the usual protocol errors. 
But mounting the fs from the failover node does not work on these clients. Is 
this expected or is this supposed to work?


Regards, Michael

--
Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] most recent kernel version for patchless client?

2011-12-14 Thread Michael Kluge
Hi list,

I am looking for information what the most recent kernel version is that I can 
use to build a patchless client for. OFED for example refuses to build on 
kernels >3.0.0. Has someone recently tried newer kernels with 1.8.7 ?


Regards, Michael

-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Interpreting iozone measurements

2011-03-09 Thread Michael Kluge
Hi Jeremy,

> > I write a single 64 byte file (with one I/O request), iozone tells me
> > something like '295548', which means ~295 MB/s. Dividing the file size
> > by the bandwidth, I get the time that was needed to create the file and
> > write the 64 bytes (with a single request). In this case, the time is
> > about 0,2 micro seconds which is way below the RTT. 
> 
> That seems oddly fast for such a small file over any latency.  Since you
> shouldn't even be able to lock the file in that time.

OK, thanks. So I have to see at least a latency of one RTT. I think I
need to dig through the data again. I might have made a mistake in one
of the formulas ...

> > That mean for a Lustre file system, if I create a file and write 64
> > bytes, the client sends two(?) RPCs to the server and does not wait for
> > the completion. Is this correct? But it will wait for the completion of
> > both RPCs when the file ist closed?
> You can see what Lustre is doing if the client isn't doing any other
> activity and you enable some tracing.  "sysctl -w lnet.debug='+rpctrace
> vfstrace'" should allow you to see the VFS ops ll_file_open,
> ll_file_writev/ll_file_aio_write, ll_file_release, along with any RPCs
> generated by them.  You should see an RPC for the file open which will
> be a 101 opcode for requesting the lock and you should see a reply AFAIK
> before the client actually attempts to write any data.  So that should
> bring your time upto at least 4 ms for 1 RTT.  The initial write should
> request a lock from the first stripe followed by a OST write RPC (opcode
> 4) followed by a file close (opcode 35).   I ran a test over 4 ms
> latency so you can see what I'm referring to.  I thought that there was
> a patch in Lustre a few months back that forced a flush before a file
> close, but this is from a 1.8.5 client so I'm guessing that isn't how it
> works because between when I closed the file and the end I had to "sync"
> for the OST write to show up.

OK, understood.

> > The numbers look different when I disable the client side cache by
> > setting max_dirty_mb to 0.
> 
> Without any grant I think all RPCs have to be synchronous so you'll see
> a huge performance hit over latency.

These numbers look different :) I'am still trying to make sense of a
couple of measurements and to put some useful data into some charts for
the LUG this year.


Michael


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Interpreting iozone measurements

2011-03-09 Thread Michael Kluge
Hi all,

we have a testbed running with Lustre 1.8.3 and a RTT of ~4ms (10GE
network cards everywhere) for a ping between client and servers. If I
have read the iozone source code correctly, iozone reports bandwidth in
KB/s and includes the time for the open() call, but not for close(). If
I write a single 64 byte file (with one I/O request), iozone tells me
something like '295548', which means ~295 MB/s. Dividing the file size
by the bandwidth, I get the time that was needed to create the file and
write the 64 bytes (with a single request). In this case, the time is
about 0,2 micro seconds which is way below the RTT. 

That mean for a Lustre file system, if I create a file and write 64
bytes, the client sends two(?) RPCs to the server and does not wait for
the completion. Is this correct? But it will wait for the completion of
both RPCs when the file ist closed?

The numbers look different when I disable the client side cache by
setting max_dirty_mb to 0.


Regards, Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS replacement

2011-02-24 Thread Michael Kluge
Hi Johann,

interesting. Is there no need to set the file system volume of the new
OSS name via tune2fs to the same string?


Michael


Am Donnerstag, den 24.02.2011, 10:48 +0100 schrieb Johann Lombardi: 
> Hi,
> 
> On Thu, Feb 24, 2011 at 10:39:32AM +0100, Gizo Nanava wrote:
> >we need to replace one of the  OSS in the cluster. We wounder whether 
> > simply copying(eg. rsync) over network
> > the content of all /dev/sdX(ldiskfs mounted) from OSS to be replaced to 
> > the new, already lustre formatted OSS
> > (all /dev/sdX on both servers are the same) will work?
> 
> Yes, the procedure is detailed in the manual:
> http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTroubleshooting.html#50651190_pgfId-1291458
> 
> Johann
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Running MGS and OSS on the same machine

2011-02-18 Thread Michael Kluge
Hi Arya,

if I remember well, Lustre uses 0@lo for the localhost address. Does 
using the other NID 192.168.0.10@tcp0 give any error message?


Michael

Am 18.02.2011 16:10, schrieb Arya Mazaheri:
> Hi again,
> I have planned to use one server as MGS and OSS simultaneously. But how
> can I format the OSTs as lustre FS?
> for example, the line below tells the ost which it's mgsnode is at
> 192.168.0.10@tcp0:
> mkfs.lustre --fsname lustre --ost --mgsnode=192.168.0.10@tcp0 /dev/vg00/ost1
>
> But, now mgsnode is the same machine. I tried to put localhost instead
> the ip address. but I didn't work.
>
> What shoud I do?
>
> Arya
>
>
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How to detect process owner on client

2011-02-11 Thread Michael Kluge
But it does not give you PIDs or user names? Or is there a way to find 
these with standard lustre tools?

Michael

Am 11.02.2011 17:34, schrieb Andreas Dilger:
> On 2011-02-10, at 23:18, Michael Kluge wrote:
>> I am not aware of any possibility to map the current statistics in /proc
>> to UIDs. But I might be wrong. We had a script like this a while ago
>> which did not kill the I/O intensive processes but told us the PIDs.
>>
>> What we did is collecting for ~30 seconds the number of I/O operations
>> per node via /proc on all nodes. Then we attached an strace process to
>> each process on nodes with heavy I/O load. This strace intercepted only
>> the I/O calls and wrote one log file per process. If this strace is
>> running for the same amount of time for each process on a host, you just
>> need to sort the log files for size.
>
> On the OSS and MDS nodes there are per-client statistics that allow this kind 
> of tracking.  They can be seen in /proc/fs/lustre/obdfilter/*/exports/*/stats 
> for detailed information (e.g. broken down by RPC type, bytes read/written), 
> or /proc/fs/lustre/ost/OSS/*/req_history to just get a dump of the recent 
> RPCs sent by each client.
>
> A little script was discussed in the thread "How to determine which lustre 
> clients are loading filesystem" (2010-07-08):
>
>> Another way that I heard some sites were doing this is to use the "rpc 
>> history".  They may already have a script to do this, but the basics are 
>> below:
>>
>> oss# lctl set_param ost.OSS.*.req_buffer_history_max=10240
>> {wait a few seconds to collect some history}
>> oss# lctl get_param ost.OSS.*.req_history
>>
>> This will give you a list of the past (up to) 10240 RPCs for the "ost_io" 
>> RPC service, which is what you are observing the high load on:
>>
>> 3436037:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957534353:448:Complete:1278612656:0s(-6s)
>>  opc 3
>> 3436038:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957536190:448:Complete:1278615489:1s(-41s)
>>  opc 3
>> 3436039:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1340648957536193:448:Complete:1278615490:0s(-6s)
>>  opc 3
>>
>> This output is in the format:
>>
>> identifier:target_nid:source_nid:rpc_xid:rpc_size:rpc_status:arrival_time:service_time(deadline)
>>  opcode
>>
>> Using some shell scripting, one can find the clients sending the most RPC 
>> requests:
>>
>> oss# lctl get_param ost.OSS.*.req_history | tr ":" " " | cut -d" " -f3,9,10 
>> | sort | uniq -c | sort -nr | head -20
>>
>>
>>3443 12345-192.168.20.159@tcp opc 3
>>1215 12345-192.168.20.157@tcp opc 3
>> 121 12345-192.168.20.157@tcp opc 4
>>
>> This will give you a sorted list of the top 20 clients that are sending the 
>> most RPCs to the ost and ost_io services, along with the operation being 
>> done (3 = OST_READ, 4 = OST_WRITE, etc. see 
>> lustre/include/lustre/lustre_idl.h).
>
>
>> Am Donnerstag, den 10.02.2011, 21:16 -0600 schrieb Satoshi Isono:
>>> Dear members,
>>>
>>> I am looking into the way which can detect userid or jobid on the Lustre 
>>> client. Assumed the following condition;
>>>
>>> 1) Any users run any jobs through scheduler like PBS Pro, LSF or SGE.
>>> 2) A users processes occupy Lustre I/O.
>>> 3) Some Lustre servers (MDS?/OSS?) can detect high I/O stress on each 
>>> server.
>>> 4) But Lustre server cannot make the mapping between jobid/userid and 
>>> Lustre I/O processes having heavy stress, because there aren't userid on 
>>> Lustre servers.
>>> 5) I expect that Lustre can monitor and can make the mapping.
>>> 6) If possible for (5), we can make a script which launches scheduler 
>>> command like as qdel.
>>> 7) Heavy users job will be killed by job scheduler.
>>>
>>> I want (5) for Lustre capability, but I guess current Lustre 1.8 cannot 
>>> perform (5). On the other hand, in order to map Lustre process to 
>>> userid/jobid, are there any ways using like rpctrace or nid stats? Can you 
>>> please your advice or comments?
>>>
>>> Regards,
>>> Satoshi Isono
>>> ___
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> --
>>
>> Michael Kluge, M.Sc.
>>
>> Technische Universität Dresden
>> Center for Information Servi

Re: [Lustre-discuss] How to detect process owner on client

2011-02-10 Thread Michael Kluge
Hi Satoshi,

I am not aware of any possibility to map the current statistics in /proc
to UIDs. But I might be wrong. We had a script like this a while ago
which did not kill the I/O intensive processes but told us the PIDs. 

What we did is collecting for ~30 seconds the number of I/O operations
per node via /proc on all nodes. Then we attached an strace process to
each process on nodes with heavy I/O load. This strace intercepted only
the I/O calls and wrote one log file per process. If this strace is
running for the same amount of time for each process on a host, you just
need to sort the log files for size.


Regards, Michael


Am Donnerstag, den 10.02.2011, 21:16 -0600 schrieb Satoshi Isono: 
> Dear members,
> 
> I am looking into the way which can detect userid or jobid on the Lustre 
> client. Assumed the following condition;
> 
>  1) Any users run any jobs through scheduler like PBS Pro, LSF or SGE.
>  2) A users processes occupy Lustre I/O.
>  3) Some Lustre servers (MDS?/OSS?) can detect high I/O stress on each server.
>  4) But Lustre server cannot make the mapping between jobid/userid and Lustre 
> I/O processes having heavy stress, because there aren't userid on Lustre 
> servers.
>  5) I expect that Lustre can monitor and can make the mapping.
>  6) If possible for (5), we can make a script which launches scheduler 
> command like as qdel.
>  7) Heavy users job will be killed by job scheduler.
> 
> I want (5) for Lustre capability, but I guess current Lustre 1.8 cannot 
> perform (5). On the other hand, in order to map Lustre process to 
> userid/jobid, are there any ways using like rpctrace or nid stats? Can you 
> please your advice or comments?
> 
> Regards,
> Satoshi Isono
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] "up" a router that is marked "down"

2011-01-25 Thread Michael Kluge
Hi Jeremy,

yup, it's marked " obsolete (DANGEROUS) ", whatever, it did the
trick :)


Thanks a lot, Michael



Am Dienstag, den 25.01.2011, 18:55 -0500 schrieb Jeremy Filizetti: 
> Though I think its marked as development or experimental in the Lustre
> documention or source "lctl set_route" has worked fine for me in the
> past with no issues.
>  
> lctl set_route  up
>  
> is the syntax I believe.
>  
> Jeremy
> 
> 
> On Tue, Jan 25, 2011 at 9:52 AM, Michael Kluge
>  wrote:
> Jason, Michael,
> 
> thanks y lot for your replies. I pinged everone from all
> directions but
> the router is still marked "down" on the client. I even
> removed and
> re-added the router entry via lctl --net tcp1 del_route
> xyz@o2ib and
> lctl --net tcp1 add_route xyz@o2ib . No luck. So I think I'll
> wait for
> the next maintenance window. Oh, and I forgot to mention that
> the
> servers run a 1.6.7.2, the router as well and the clients
> 1.8.5. Works
> good so far.
> 
> 
> Thanks, Michael
> 
> 
> Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple
> Jason: 
> 
> > I've found that even with the Protocal Error, it still
> works.
> >
> > -Jason
> >
> > -Original Message-
> > From: lustre-discuss-boun...@lists.lustre.org
> [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of
> Michael Shuey
> > Sent: martedì, 25. gennaio 2011 14:45
> > To: Michael Kluge
> > Cc: Lustre Diskussionsliste
> > Subject: Re: [Lustre-discuss] "up" a router that is marked
> "down"
> >
> > You'll want to add the "dead_router_check_interval" lnet
> module
> > parameter as soon as you are able.  As near as I can tell,
> without
> > that there's no automatic check to make sure the router is
> alive.
> >
> > I've had some success in getting machines to recognize that
> a router
> > is alive again by doing an lctl ping of their side of a
> router (e.g.,
> > on a tcp0 client, `lctl ping @tcp0`, then `lctl
> ping
> > @o2ib0` from an o2ib0 client).  If you have a
> server/client
> > version mismatch, where lctl ping returns a protocol error,
> you may be
> > out of luck.
> >
> > --
> > Mike Shuey
> >
> >
> >
> > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge
> >  wrote:
> > > Hi list,
> > >
> > > if a Lustre router is down, comes back to life and the
> servers do not
>     > > actively test the routers periodically: is it possible to
> mark a Lustre
> > > router as "up"? Or to tell the servers to ping the router?
> > >
> > > Or can I enable the "router pinger" in a live system
> without unloading
> > > and loading the Lustre kernel modules?
> > >
> > >
> > > Regards, Michael
> > >
> > > --
> > >
> > > Michael Kluge, M.Sc.
> > >
> > > Technische Universität Dresden
> > > Center for Information Services and
> > > High Performance Computing (ZIH)
> > > D-01062 Dresden
> > > Germany
> > >
> > > Contact:
> > > Willersbau, Room A 208
> > > Phone:  (+49) 351 463-34217
> > > Fax:(+49) 351 463-37773
> > > e-mail: michael.kl...@tu-dresden.de
> > > WWW:http://www.tu-dresden.de/zih
> > >
> > > ___
> > > Lustre-discuss mailing list
> > > Lustre-discuss@lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > >
> > >
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org

Re: [Lustre-discuss] "up" a router that is marked "down"

2011-01-25 Thread Michael Kluge
Jason, Michael,

thanks y lot for your replies. I pinged everone from all directions but
the router is still marked "down" on the client. I even removed and
re-added the router entry via lctl --net tcp1 del_route xyz@o2ib and
lctl --net tcp1 add_route xyz@o2ib . No luck. So I think I'll wait for
the next maintenance window. Oh, and I forgot to mention that the
servers run a 1.6.7.2, the router as well and the clients 1.8.5. Works
good so far. 


Thanks, Michael


Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple Jason: 
> I've found that even with the Protocal Error, it still works.
> 
> -Jason
> 
> -Original Message-
> From: lustre-discuss-boun...@lists.lustre.org 
> [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Shuey
> Sent: martedì, 25. gennaio 2011 14:45
> To: Michael Kluge
> Cc: Lustre Diskussionsliste
> Subject: Re: [Lustre-discuss] "up" a router that is marked "down"
> 
> You'll want to add the "dead_router_check_interval" lnet module
> parameter as soon as you are able.  As near as I can tell, without
> that there's no automatic check to make sure the router is alive.
> 
> I've had some success in getting machines to recognize that a router
> is alive again by doing an lctl ping of their side of a router (e.g.,
> on a tcp0 client, `lctl ping @tcp0`, then `lctl ping
> @o2ib0` from an o2ib0 client).  If you have a server/client
> version mismatch, where lctl ping returns a protocol error, you may be
> out of luck.
> 
> --
> Mike Shuey
> 
> 
> 
> On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge
>  wrote:
> > Hi list,
> >
> > if a Lustre router is down, comes back to life and the servers do not
> > actively test the routers periodically: is it possible to mark a Lustre
> > router as "up"? Or to tell the servers to ping the router?
> >
> > Or can I enable the "router pinger" in a live system without unloading
> > and loading the Lustre kernel modules?
> >
> >
> > Regards, Michael
> >
> > --
> >
> > Michael Kluge, M.Sc.
> >
> > Technische Universität Dresden
> > Center for Information Services and
> > High Performance Computing (ZIH)
> > D-01062 Dresden
> > Germany
> >
> > Contact:
> > Willersbau, Room A 208
> > Phone:  (+49) 351 463-34217
> > Fax:(+49) 351 463-37773
> > e-mail: michael.kl...@tu-dresden.de
> > WWW:http://www.tu-dresden.de/zih
> >
> > _______
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] "up" a router that is marked "down"

2011-01-25 Thread Michael Kluge
Hi list,

if a Lustre router is down, comes back to life and the servers do not
actively test the routers periodically: is it possible to mark a Lustre
router as "up"? Or to tell the servers to ping the router?

Or can I enable the "router pinger" in a live system without unloading
and loading the Lustre kernel modules?


Regards, Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet rounter immediatelly marked as down

2010-12-03 Thread Michael Kluge
Hi Liang,

sure, but my current question is: Why are the nodes within o2ib 
considering the router as down?

I add the route to a node within o2ib and instantly afterwards lctl 
show_route say the router is down. That does not make much sense to me.

And if I try to send a message through the router from this node I see 
that it can't send the message beause all routers are down.


Regards, Michael

Am 03.12.2010 16:29, schrieb liang Zhen:
>   Hi Michael,
>
> To add router dynamically, you also have to run "--net o2ib add_route
> a.b@tcp1" on all nodes of tcp1, so the better choice is using
> universal modprobe.conf by define "ip2nets" and "routes", you can see
> some example at here:
> http://wiki.lustre.org/manual/LustreManual18_HTML/MoreComplicatedConfigurations.html
>
> Regards
> Liang
>
> On 12/3/10 9:32 PM, Michael Kluge wrote:
>> Hi list,
>>
>> we have a Lustr 1.6.7.2 running on our (IB SDR) cluster and have added
>> one additional NIC (tcp1) to one node and like to use this node as
>> router. I have added a ip2nets statement and forwaring=enabled to the
>> modprobe files on the router and reloaded the modules. I see two NIDS
>> now and no trouble.
>>
>> The MDS server that need to go through the router to a hand full of
>> additional clients is in production and I can't take it down. So I added
>> the route to the additional network via lctl --net tcp1 add_route
>> w.x@o2ib where W.X.Y.Z is the ipoib address of the router. When I do
>> an lctl show_routes, this router is marked as "down". Is there a way to
>> bring it to life? I can lctl ping the router node from the MDS but can't
>> reload lnet to enable active router tests. Right now on the MDS the only
>> option for the lnet module is the network config for the IB network
>> interface.
>>
>> Any ideas who to enable this router?
>>
>>
>> Regards, Michael
>>
>>
>>
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lnet rounter immediatelly marked as down

2010-12-03 Thread Michael Kluge
Hi list,

we have a Lustr 1.6.7.2 running on our (IB SDR) cluster and have added
one additional NIC (tcp1) to one node and like to use this node as
router. I have added a ip2nets statement and forwaring=enabled to the
modprobe files on the router and reloaded the modules. I see two NIDS
now and no trouble.

The MDS server that need to go through the router to a hand full of
additional clients is in production and I can't take it down. So I added
the route to the additional network via lctl --net tcp1 add_route
w.x@o2ib where W.X.Y.Z is the ipoib address of the router. When I do
an lctl show_routes, this router is marked as "down". Is there a way to
bring it to life? I can lctl ping the router node from the MDS but can't
reload lnet to enable active router tests. Right now on the MDS the only
option for the lnet module is the network config for the IB network
interface.

Any ideas who to enable this router?


Regards, Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-23 Thread Michael Kluge
Hi Bernd,

I get the same message with you kernel RPMS:

In file included from include/linux/list.h:6,
  from include/linux/mutex.h:13,
  from 
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:9:
 
error: redeclaration of enumerator 'false'
include/linux/stddef.h:16: error: previous definition of 'false' was here
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:11:
 
error: redeclaration of enumerator 'true'
include/linux/stddef.h:18: error: previous definition of 'true' was here

Could it be that this '2.6.18 being almost an 2.6.28/29' confuses the 
OFED backports and the 2.6.18 backport does not work anymore? Is that 
solvable? I found nothing in the OFED bugzilla.


Michael

Am 23.10.2010 17:51, schrieb Michael Kluge:
> Hi Bernd,
>
> do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a
> 2.6.18-164 from the Lustre kernels and OFED won't built against it. The
> OFED backports report lot and lots of symbols as "redefined".
>
>
> Michael
>
> Am 22.10.2010 23:30, schrieb Bernd Schubert:
>> Hello Michael,
>>
>> On Friday, October 22, 2010, you wrote:
>>> Hi Bernd,
>>>
>>>> I'm sorry to hear that. Unfortunately, I really do not have the time to
>>>> port this version to your kernel version.
>>>
>>> No worries. I don't expect this :)
>>>
>>>> I remember that you use Debian. But I guess you are still using a SLES
>>>> kernel then? You could ask Suse about it, although I guess they only do
>>>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5
>>>> kernel should work (and besides its name, it is internally more or less
>>>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more
>>>> recent udev, which requires at least 2.6.27.
>>>
>>> OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try.
>>
>> Just don't forget that -194 requires 1.8.4 (I think you had been at 1.8.3
>> previously). We also have this driver added as Lustre kernel patch in our 
>> -ddn
>> releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also
>> includes the driver together with with recent security backports.
>>
>> http://eu.ddn.com:8080/lustre/lustre/1.8.3/
>>
>>
>> Cheers,
>> Bernd
>>
>
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-23 Thread Michael Kluge
Hi Bernd,

do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a 
2.6.18-164 from the Lustre kernels and OFED won't built against it. The 
OFED backports report lot and lots of symbols as "redefined".


Michael

Am 22.10.2010 23:30, schrieb Bernd Schubert:
> Hello Michael,
>
> On Friday, October 22, 2010, you wrote:
>> Hi Bernd,
>>
>>> I'm sorry to hear that. Unfortunately, I really do not have the time to
>>> port this version to your kernel version.
>>
>> No worries. I don't expect this :)
>>
>>> I remember that you use Debian. But I guess you are still using a SLES
>>> kernel then? You could ask Suse about it, although I guess they only do
>>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5
>>> kernel should work (and besides its name, it is internally more or less
>>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more
>>> recent udev, which requires at least 2.6.27.
>>
>> OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try.
>
> Just don't forget that -194 requires 1.8.4 (I think you had been at 1.8.3
> previously). We also have this driver added as Lustre kernel patch in our -ddn
> releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also
> includes the driver together with with recent security backports.
>
> http://eu.ddn.com:8080/lustre/lustre/1.8.3/
>
>
> Cheers,
> Bernd
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Michael Kluge
Hi Bernd,

> I'm sorry to hear that. Unfortunately, I really do not have the time to port 
> this version to your kernel version.

No worries. I don't expect this :)

> I remember that you use Debian. But I guess you are still using a SLES kernel 
> then? You could ask Suse about it, although I guess they only do care about 
> SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should 
> work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 
> kernel). Later Debian and Ubuntu releases have a more recent udev, which 
> requires at least 2.6.27.

OK, if the 2.6.18 works like a charm, I'll give the 2.6.18-194 it a try.


Michael

> 
> You could also ask our support department, if they have any news for 2.6.27. 
> I'm in Lustre engineering and as we only support RHEL5 right now, I so far 
> did 
> not care about other kernel versions too much.
> 
> If all doesn't help, you will need to set the queue depth to 1, but that will 
> also impose a big performance hit :(
> 
> 
> Cheers,
> Bernd
> 
> 
> On Friday, October 22, 2010, Michael Kluge wrote:
> > Hi Bernd,
> > 
> > I have found a RHEL-only release for this version. It does not compile
> > on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to
> > get a new driver.
> > 
> > 
> > Michael
> > 
> > Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert:
> > > On Friday, October 22, 2010, Michael Kluge wrote:
> > > > Hi list,
> > > > 
> > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI
> > > > command and is basically asking the host to send it again later. I had
> > > > I think just one concurrent region and 32 threads running. What would
> > > > be the appropriate action in this case? Reducing the queue depth on
> > > > the HBA? We have Qlogic here, there is an option for the kernel module
> > > > for this.
> > > 
> > > I think you run into a known issue with the Q-Logic driver an the SFA10K.
> > > You will need at least qla2xxx version 8.03.01.06.05.06-k. And the
> > > optimal numbers of commands is likely to be 16 (with 4 OSS connected).
> > > 
> > > 
> > > Hope it helps,
> > > Bernd
> 
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Michael Kluge
Hi Bernd,

I have found a RHEL-only release for this version. It does not compile
on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to
get a new driver.


Michael

Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: 
> On Friday, October 22, 2010, Michael Kluge wrote:
> > Hi list,
> > 
> > DID_BUS_BUSY means that the controller is unable to handle the SCSI
> > command and is basically asking the host to send it again later. I had I
> > think just one concurrent region and 32 threads running. What would be
> > the appropriate action in this case? Reducing the queue depth on the
> > HBA? We have Qlogic here, there is an option for the kernel module for
> > this.
> 
> I think you run into a known issue with the Q-Logic driver an the SFA10K. You 
> will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal 
> numbers 
> of commands is likely to be 16 (with 4 OSS connected).
> 
> 
> Hope it helps,
> Bernd
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Michael Kluge
Reducing the queue depth from the default of 32 to 8 did not help. It
looks like this problem always shows up when I am writing to more than
one region. 2 regions and 2 threads are enough to see the problem. The
last tests that succeeds is 1 one region and 16 threads. 1/32 is not
being tested.

Michael

Am Freitag, den 22.10.2010, 10:48 +0200 schrieb Michael Kluge: 
> Hi list,
> 
> DID_BUS_BUSY means that the controller is unable to handle the SCSI
> command and is basically asking the host to send it again later. I had I
> think just one concurrent region and 32 threads running. What would be
> the appropriate action in this case? Reducing the queue depth on the
> HBA? We have Qlogic here, there is an option for the kernel module for
> this.
> 
> 
> Regards, Michael
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Michael Kluge
Hi list,

DID_BUS_BUSY means that the controller is unable to handle the SCSI
command and is basically asking the host to send it again later. I had I
think just one concurrent region and 32 threads running. What would be
the appropriate action in this case? Reducing the queue depth on the
HBA? We have Qlogic here, there is an option for the kernel module for
this.


Regards, Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Michael Kluge
Disabling checksums boosts the performance to 660 MB/s for a single 
thread. Now placing 6 IOR processes one my eight core box gives with 
some striping 1.6 GB/s which is close to the LNET bandwidth. Thanks a 
lot again!

Michael

Am 20.10.2010 19:13, schrieb Michael Kluge:
> Using O_DIRECT reduces the CPU load but the magical limit of 500 MB/s
> for one thread remains. Are the CRC sums calculated on a per thread
> base? Or stripe base? Is there a way to test the checksumming speed only?
>
>
> Michael
>
> Am 20.10.2010 18:53, schrieb Andreas Dilger:
>> On 2010-10-20, at 10:40, Michael Kluge   wrote:
>>> It is the CPU load on the client. The dd/IOR process is using one core 
>>> completely. The clients and the servers are connected via DDR IB. LNET 
>>> bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 
>>> patchless.
>>
>> If you only have a single threaded write, then this is somewhat unavoidable 
>> to saturate a CPU due to copy_from_user().  O_DIRECT will avoid this.
>>
>>Also, disabling data checksums and debugging can help considerably. There 
>> is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to 
>> reduce this overhead, but still not as fast as no checksum at all.
>>
>> Cheers, Andreas
>>
>>> Am 20.10.2010 18:15, schrieb Andreas Dilger:
>>>> Is this client CPU or server CPU?  If you are using Ethernet it will 
>>>> definitely be CPU hungry and can easily saturate a single core.
>>>>
>>>> Cheers, Andreas
>>>>
>>>> On 2010-10-20, at 8:41, Michael Kluge
>>>> wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre
>>>>> file system shows up with a 100% CPU load within 'top'? The reason why I
>>>>> am asking this is that I can write from one client to one OST with 500
>>>>> MB/s. The CPU load will be at 100% in this case. If I stripe over two
>>>>> OSTs (which use different OSS servers and different RAID controllers) I
>>>>> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will
>>>>> be at 100% again.
>>>>>
>>>>> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10%
>>>>> CPU load.
>>>>>
>>>>> Are there ways to tune this behavior? Changing max_rpcs_in_flight and
>>>>> max_dirty_mb did not help.
>>>>>
>>>>>
>>>>> Regards, Michael
>>>>>
>>>>> --
>>>>>
>>>>> Michael Kluge, M.Sc.
>>>>>
>>>>> Technische Universität Dresden
>>>>> Center for Information Services and
>>>>> High Performance Computing (ZIH)
>>>>> D-01062 Dresden
>>>>> Germany
>>>>>
>>>>> Contact:
>>>>> Willersbau, Room A 208
>>>>> Phone:  (+49) 351 463-34217
>>>>> Fax:(+49) 351 463-37773
>>>>> e-mail: michael.kl...@tu-dresden.de
>>>>> WWW:http://www.tu-dresden.de/zih
>>>>> ___
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss@lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>>
>>> --
>>> Michael Kluge, M.Sc.
>>>
>>> Technische Universität Dresden
>>> Center for Information Services and
>>> High Performance Computing (ZIH)
>>> D-01062 Dresden
>>> Germany
>>>
>>> Contact:
>>> Willersbau, Room WIL A 208
>>> Phone:  (+49) 351 463-34217
>>> Fax:(+49) 351 463-37773
>>> e-mail: michael.kl...@tu-dresden.de
>>> WWW:http://www.tu-dresden.de/zih
>>
>
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Michael Kluge
Using O_DIRECT reduces the CPU load but the magical limit of 500 MB/s 
for one thread remains. Are the CRC sums calculated on a per thread 
base? Or stripe base? Is there a way to test the checksumming speed only?


Michael

Am 20.10.2010 18:53, schrieb Andreas Dilger:
> On 2010-10-20, at 10:40, Michael Kluge  wrote:
>> It is the CPU load on the client. The dd/IOR process is using one core 
>> completely. The clients and the servers are connected via DDR IB. LNET 
>> bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 patchless.
>
> If you only have a single threaded write, then this is somewhat unavoidable 
> to saturate a CPU due to copy_from_user().  O_DIRECT will avoid this.
>
>   Also, disabling data checksums and debugging can help considerably. There 
> is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to 
> reduce this overhead, but still not as fast as no checksum at all.
>
> Cheers, Andreas
>
>> Am 20.10.2010 18:15, schrieb Andreas Dilger:
>>> Is this client CPU or server CPU?  If you are using Ethernet it will 
>>> definitely be CPU hungry and can easily saturate a single core.
>>>
>>> Cheers, Andreas
>>>
>>> On 2010-10-20, at 8:41, Michael Kluge   wrote:
>>>
>>>> Hi list,
>>>>
>>>> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre
>>>> file system shows up with a 100% CPU load within 'top'? The reason why I
>>>> am asking this is that I can write from one client to one OST with 500
>>>> MB/s. The CPU load will be at 100% in this case. If I stripe over two
>>>> OSTs (which use different OSS servers and different RAID controllers) I
>>>> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will
>>>> be at 100% again.
>>>>
>>>> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10%
>>>> CPU load.
>>>>
>>>> Are there ways to tune this behavior? Changing max_rpcs_in_flight and
>>>> max_dirty_mb did not help.
>>>>
>>>>
>>>> Regards, Michael
>>>>
>>>> --
>>>>
>>>> Michael Kluge, M.Sc.
>>>>
>>>> Technische Universität Dresden
>>>> Center for Information Services and
>>>> High Performance Computing (ZIH)
>>>> D-01062 Dresden
>>>> Germany
>>>>
>>>> Contact:
>>>> Willersbau, Room A 208
>>>> Phone:  (+49) 351 463-34217
>>>> Fax:(+49) 351 463-37773
>>>> e-mail: michael.kl...@tu-dresden.de
>>>> WWW:http://www.tu-dresden.de/zih
>>>> ___
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss@lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>> --
>> Michael Kluge, M.Sc.
>>
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>>
>> Contact:
>> Willersbau, Room WIL A 208
>> Phone:  (+49) 351 463-34217
>> Fax:(+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de
>> WWW:http://www.tu-dresden.de/zih
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ldiskfs performance vs. XFS performance

2010-10-20 Thread Michael Kluge
> For your final final filesystem you still probably want to enable async
> journals (unless you are willing to enable the S2A unmirrored device cache).

OK, thanks. We'll give this a try.

Michael

> Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c
> problem, for which a patch exists:
>
> https://bugzilla.lustre.org/show_bug.cgi?id=21745





>
> Cheers,
> Bernd
>
>
> On Wednesday, October 20, 2010, Michael Kluge wrote:
>> Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
>> We trapped into one or two bugs with obdfilter-survey as lctl has at
>> least one bug in 1.8.3 when is uses multiple threads and
>> obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
>> MB/s for obdfilter-survey over a reasonable parameter space after we
>> changed to the ext4 based ldiskfs. So that seems to be the trick.
>>
>> Michael
>>
>> Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger:
>>> On 2010-10-18, at 10:40, Johann Lombardi wrote:
>>>> On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
>>>>> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
>>>>> mke2fs -O journal_dev -b 4096 $RAM_DEV
>>>>>
>>>>> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
>>>>> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b
>>>>> 4096 -j -J device=$RAM_DEV" /dev/disk/by-path/...
>>>>>
>>>>> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
>>>>
>>>> In fact, Lustre uses additional mount options (see "Persistent mount
>>>> opts" in tunefs.lustre output). If your ldiskfs module is based on
>>>> ext3, you should add the extents and mballoc options which are known
>>>> to improve performance.
>>>
>>> Even then, the IO submission path of ext3 from userspace is not very
>>> good, and such a performance difference is not unexpected.  When
>>> submitting IO from userspace to ext3/ldiskfs it is being done in 4kB
>>> blocks, and each block is allocated separately (regardless of mballoc,
>>> unfortunately).  When Lustre is doing IO from the kernel, the client is
>>> aggregating the IO into 1MB chunks and the entire 1MB write is allocated
>>> in one operation.
>>>
>>> That is why we developed the "delalloc" code for ext4 - so that userspace
>>> could also get better IO performance, and utilize the multi-block
>>> allocation (mballoc) routines that have been in ldiskfs for ages, but
>>> only accessible from the kernel.
>>>
>>> For Lustre performance testing, I would suggest looking at lustre-iokit,
>>> and in particular "sgpdd" to test the underlying block device, and then
>>> obdfilter-survey to test the local Lustre IO submission path.
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Lustre Technical Lead
>>> Oracle Corporation Canada Inc.
>
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Michael Kluge
It is the CPU load on the client. The dd/IOR process is using one core 
completely. The clients and the servers are connected via DDR IB. LNET 
bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 
patchless.


Micha

Am 20.10.2010 18:15, schrieb Andreas Dilger:
> Is this client CPU or server CPU?  If you are using Ethernet it will 
> definitely be CPU hungry and can easily saturate a single core.
>
> Cheers, Andreas
>
> On 2010-10-20, at 8:41, Michael Kluge  wrote:
>
>> Hi list,
>>
>> is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre
>> file system shows up with a 100% CPU load within 'top'? The reason why I
>> am asking this is that I can write from one client to one OST with 500
>> MB/s. The CPU load will be at 100% in this case. If I stripe over two
>> OSTs (which use different OSS servers and different RAID controllers) I
>> will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will
>> be at 100% again.
>>
>> A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10%
>> CPU load.
>>
>> Are there ways to tune this behavior? Changing max_rpcs_in_flight and
>> max_dirty_mb did not help.
>>
>>
>> Regards, Michael
>>
>> --
>>
>> Michael Kluge, M.Sc.
>>
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>>
>> Contact:
>> Willersbau, Room A 208
>> Phone:  (+49) 351 463-34217
>> Fax:(+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de
>> WWW:http://www.tu-dresden.de/zih
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Michael Kluge
Hi list,

is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre
file system shows up with a 100% CPU load within 'top'? The reason why I
am asking this is that I can write from one client to one OST with 500
MB/s. The CPU load will be at 100% in this case. If I stripe over two
OSTs (which use different OSS servers and different RAID controllers) I
will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will
be at 100% again. 

A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10%
CPU load.

Are there ways to tune this behavior? Changing max_rpcs_in_flight and
max_dirty_mb did not help.


Regards, Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ldiskfs performance vs. XFS performance

2010-10-20 Thread Michael Kluge
Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
We trapped into one or two bugs with obdfilter-survey as lctl has at
least one bug in 1.8.3 when is uses multiple threads and
obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
MB/s for obdfilter-survey over a reasonable parameter space after we
changed to the ext4 based ldiskfs. So that seems to be the trick.

Michael

Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger: 
> On 2010-10-18, at 10:40, Johann Lombardi wrote:
> > On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
> >> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
> >> mke2fs -O journal_dev -b 4096 $RAM_DEV
> >> 
> >> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
> >> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b 4096
> >> -j -J device=$RAM_DEV" /dev/disk/by-path/...
> >> 
> >> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
> > 
> > In fact, Lustre uses additional mount options (see "Persistent mount opts" 
> > in tunefs.lustre output).
> > If your ldiskfs module is based on ext3, you should add the extents and 
> > mballoc options which are known to improve performance.
> 
> Even then, the IO submission path of ext3 from userspace is not very good, 
> and such a performance difference is not unexpected.  When submitting IO from 
> userspace to ext3/ldiskfs it is being done in 4kB blocks, and each block is 
> allocated separately (regardless of mballoc, unfortunately).  When Lustre is 
> doing IO from the kernel, the client is aggregating the IO into 1MB chunks 
> and the entire 1MB write is allocated in one operation.
> 
> That is why we developed the "delalloc" code for ext4 - so that userspace 
> could also get better IO performance, and utilize the multi-block allocation 
> (mballoc) routines that have been in ldiskfs for ages, but only accessible 
> from the kernel.
> 
> For Lustre performance testing, I would suggest looking at lustre-iokit, and 
> in particular "sgpdd" to test the underlying block device, and then 
> obdfilter-survey to test the local Lustre IO submission path.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] ldiskfs performance vs. XFS performance

2010-10-18 Thread Michael Kluge
Hi list,

we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted
with XFS shows 400 MB/s if oppressed with one 'dd' and large block
sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is
default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It
looks a bit low compared to XFS.

We think with help from DDN we did everything we can from a hardware
perspective. We formatted the LUN with the correct striping and stripe
size, DDN adjusted some controller parameters and we even put the file
system journal on a RAM disk. The LUN has 16 TB capacity. I formated
only 7 for the moment due to the 8 TB limit. 

This is what I did:

mds_nid...@somehwere
RAM_DEV=/dev/ram1
dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
mke2fs -O journal_dev -b 4096 $RAM_DEV

mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
--mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b 4096
-j -J device=$RAM_DEV" /dev/disk/by-path/...

mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1

Is there a way to push the bandwidth limit for a single data stream any
further?



Michael


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1.8/2.6.32 support

2010-10-04 Thread Michael Kluge
Hi,

is there any chance to get a 1.8.4 compiled on a 2.6.32+ kernel right
now with the standard Lustre sources that are available through the
download pages? The "build your own kernel" wiki page points to a
collection of supported kernels
http://downloads.lustre.org/public/kernels/sles11/
which has a 2.6.32 in it but I could not find a working set of patches
for this. Has anyone been more successful?


Michael

Am Montag, den 26.04.2010, 12:11 -0600 schrieb Andreas Dilger: 
> On 2010-03-31, at 10:16, Stephen Willey wrote:
> > Obviously there is no RH-6.0 just yet (at least not beta or release) and as 
> > such 2.6.32 is not on the supported kernels list - obviously fair enough.
> > 
> > There are bugzilla entries with patches for 2.6.32 but these all apply to 
> > HEAD as opposed to the b1_8 branch.  Particularly all the stuff that 
> > applied against libcfs/blah/blah.m4
> > 
> > I'm trying to build an up-to-date patchless 1.8 client for Fedora 12 
> > (2.6.32) and given a few hours to mash patches from HEAD into b1_8, it's 
> > doable, albeit hacky (I'm not a programmer) whereas I can compile HEAD 
> > almost without modification.
> > 
> > Is it the intention to backport these various changes into b1_8 or is that 
> > more or less as-is now until the release of 2.0?  We're in a bit of an 
> > awkward place since we can't compile 1.6.7.2 on 2.6.32 and 2.0 is still not 
> > in a production state.
> 
> There is work going on in bugzilla for b1_8 SLES11 SP1(?) kernel support, 
> which will hopefully also be usable for RHEL6, when it is available.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> ___________
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ls does not work on ram disk for normal user

2010-09-22 Thread Michael Kluge
Ahh. This user has different UIDs on the clients and the server. Do they 
actually have to be the same? I thought the MDS and the OSS servers just store 
files with the uid/gid as reported by the client. I did not assume that the 
servers need to map these UIDs to a user name.

Michael


Am 22.09.2010 um 10:57 schrieb Thomas Roth:

> Hi Michael,
> 
> "Identifier removed" occured to me when the user data base was not accessible 
> by
> the MDS - when the MDS didn't know about any normal user. "root" is of course 
> known
> there, but what does e.g. "id mkluge" say on your MDS?
> 
> Regards,
> Thomas
> 
> On 09/22/2010 10:29 AM, Michael Kluge wrote:
>> Hi all,
>> 
>> I have a 1.8.3 running on a couple of servers connected via IB to a
>> small cluster. To test the network performance I have one MDS and 14 OST
>> residing in ram disks. One the client it is mounted on /lustre.
>> 
>> I have a file in this directory (created as root and then chown'ed to
>> 'mkluge'):
>> 
>> mkl...@r2i0n0:~>  ls -la /lustre/dfddd/ball
>> -rw-r--r-- 1 mkluge zih 14680064000 2010-09-22 10:14 /lustre/dfddd/ball
>> mkl...@r2i0n0:~>  cd /lustre/dfddd/
>> mkl...@r2i0n0:/lustre/dfddd>  ls
>> /bin/ls: .: Identifier removed
>> mkl...@r2i0n0:/lustre/dfddd>  ls -la
>> /bin/ls: .: Identifier removed
>> 
>> Has anyone an idea what this could be? I can't event create a directory
>> in /lustre
>> 
>> mkl...@r2i0n0:~>  mkdir /lustre/ww
>> mkdir: cannot create directory `/lustre/ww': Identifier removed
>> 
>> 'root' is able to create the directory. Setting permissions to '777' or
>> '1777' does not help either.
>> 
>> The MDS was formated to use mgt and mgs from the same ram device.
>> 
>> 
>> Regards, Michael
>> 
>> 
>> 
>> 
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> -- 
> 
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
> 
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1
> 64291 Darmstadt
> www.gsi.de
> 
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
> 
> Geschäftsführung: Professor Dr. Dr. h.c. Horst Stöcker,
> Dr. Hartmut Eickhoff
> 
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
> 


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] ls does not work on ram disk for normal user

2010-09-22 Thread Michael Kluge
Hi all,

I have a 1.8.3 running on a couple of servers connected via IB to a
small cluster. To test the network performance I have one MDS and 14 OST
residing in ram disks. One the client it is mounted on /lustre.

I have a file in this directory (created as root and then chown'ed to
'mkluge'):

mkl...@r2i0n0:~> ls -la /lustre/dfddd/ball
-rw-r--r-- 1 mkluge zih 14680064000 2010-09-22 10:14 /lustre/dfddd/ball
mkl...@r2i0n0:~> cd /lustre/dfddd/
mkl...@r2i0n0:/lustre/dfddd> ls
/bin/ls: .: Identifier removed
mkl...@r2i0n0:/lustre/dfddd> ls -la
/bin/ls: .: Identifier removed

Has anyone an idea what this could be? I can't event create a directory
in /lustre

mkl...@r2i0n0:~> mkdir /lustre/ww
mkdir: cannot create directory `/lustre/ww': Identifier removed

'root' is able to create the directory. Setting permissions to '777' or 
'1777' does not help either.

The MDS was formated to use mgt and mgs from the same ram device.


Regards, Michael


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet router tuning

2010-09-13 Thread Michael Kluge
Hi Eric,

--concurrency 2 already boosted the performance to 1026 MB/s. I don't think 
we'll get any more out of this :)


Thanks a lot, Michael

Am 13.09.2010 um 07:55 schrieb Eric Barton:

> Michael,
>  
> I think you may have only got 1 BRW READ in flight at a time with this script,
> so I would expect the routed throughput to be getting on for half of direct
> throughput.  Can you try “--concurrency 8” to simulate the number of I/Os
> a real client would keep in flight?
>  
> Cheers,
>        Eric
>  
>  From: Michael Kluge [mailto:michael.kl...@tu-dresden.de] 
> Sent: 13 September 2010 10:35 PM
> To: Eric Barton
> Cc: 'Lustre Diskussionsliste'
> Subject: Re: [Lustre-discuss] lnet router tuning
>  
> Hi Eric,
>  
> basically right now I have one IB node, one 10GE node and one router node 
> that has both types of network interfaces.
>  
> I've got a small lnet test script on the router node, that does the work:
> export LST_SESSION=$$
> lst new_session rw
> lst add_group readers 192.168.1...@tcp
> lst add_group writers 10.148.0...@o2ib
> lst add_batch bulk_rw
> lst add_test --batch bulk_rw --from writers --to readers brw read 
> check=simple size=1M
> lst run bulk_rw
> lst stat writers & sleep 30; kill $!
> lst end_session
>  
> Is there a way to figure out the messages in flight? I remember to have a 
> "rpc's in flight" tunable but this is connected to the OSC layer which does 
> not do anything in my case (I think).
>  
>  
> Michael
>  
>  
>  
> Am 13.09.2010 um 03:08 schrieb Eric Barton:
> 
> 
>  
> Michael,
>  
>  
> How are you generating load and measuring the throughput?   I’m particularly 
> interested in the number
> of nodes on each side of the router and how many messages you have in flight 
> between each one.
>  
>  
> Cheers,
>    Eric
>  
>  
>  
>  
> From: lustre-discuss-boun...@lists.lustre.org 
> [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge
> Sent: 11 September 2010 12:56 AM
> To: Michael Kluge
> Cc: Lustre Diskussionsliste
> Subject: Re: [Lustre-discuss] lnet router tuning
>  
> And here are my params:
>  
> r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do 
> echo -n "$F: "; cat $F ; done
> /sys/module/lnet/parameters/accept: secure
> /sys/module/lnet/parameters/accept_backlog: 127
> /sys/module/lnet/parameters/accept_port: 988
> /sys/module/lnet/parameters/accept_timeout: 5
> /sys/module/lnet/parameters/auto_down: 1
> /sys/module/lnet/parameters/avoid_asym_router_failure: 0
> /sys/module/lnet/parameters/check_routers_before_use: 0
> /sys/module/lnet/parameters/config_on_load: 0
> /sys/module/lnet/parameters/dead_router_check_interval: 0
> /sys/module/lnet/parameters/forwarding: enabled
> /sys/module/lnet/parameters/ip2nets: 
> /sys/module/lnet/parameters/large_router_buffers: 512
> /sys/module/lnet/parameters/live_router_check_interval: 0
> /sys/module/lnet/parameters/local_nid_dist_zero: 1
> /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1)
> /sys/module/lnet/parameters/peer_buffer_credits: 0
> /sys/module/lnet/parameters/portals_compatibility: none
> /sys/module/lnet/parameters/router_ping_timeout: 50
> /sys/module/lnet/parameters/routes: 
> /sys/module/lnet/parameters/small_router_buffers: 8192
> /sys/module/lnet/parameters/tiny_router_buffers: 1024
>  
> I have not used ip2nets but configure routing but put explict routing 
> statements into the modprobe.d/ files. Is that OK? 
>  
>  
> Michael
>  
>  
> Am 10.09.2010 um 17:48 schrieb Michael Kluge:
> 
> 
> 
> OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with 
> additional lnet router I see 550 MB/s. Time for lnet tuning?
>  
> Michael
> 
> 
> 
> Hi Andreas,
>  
> Am 10.09.2010 um 16:35 schrieb Andreas Dilger:
> 
> 
> 
> On 2010-09-10, at 08:23, Michael Kluge wrote:
> 
> 
> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance 
> tests with routing between DDR IB<->10GE networks. Currently I have three 
> nodes, one with DDR IB, one with 10GE and one with both that does the 
> routing. A first short lnet test shows 520-550 MB/s performance.
>  
> Has anyone an idea which of the variables of the lnet module are worth 
> playing with to get this number a bit closer to 1GB/s?
> 
> I would start by testing the performance on just the 10GigE side, and then 
> separately on the IB side, to verify you are getting the expected performance 
> from the components before trying them both together.  Often it is necessary 
> to tune the ethernet s

Re: [Lustre-discuss] lnet router tuning

2010-09-13 Thread Michael Kluge
Nic,

thanks a lot. That made my day.


Michael

Am 13.09.2010 um 06:49 schrieb Nic Henke:

> On 09/13/2010 08:35 AM, Michael Kluge wrote:
>> Hi Eric,
>> 
>> basically right now I have one IB node, one 10GE node and one router
>> node that has both types of network interfaces.
>> 
>> I've got a small lnet test script on the router node, that does the work:
>> export LST_SESSION=$$
>> lst new_session rw
>> lst add_group readers 192.168.1...@tcp
>> lst add_group writers 10.148.0...@o2ib
>> lst add_batch bulk_rw
>> lst add_test --batch bulk_rw --from writers --to readers brw read
>> check=simple size=1M
>> lst run bulk_rw
>> lst stat writers & sleep 30; kill $!
>> lst end_session
>> 
>> Is there a way to figure out the messages in flight? I remember to have
>> a "rpc's in flight" tunable but this is connected to the OSC layer which
>> does not do anything in my case (I think).
> 
> If you don't specify --concurrency to the 'lst add_test', you get 1 RPC 
> in flight.
> 
> Nic
> _______
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet router tuning

2010-09-13 Thread Michael Kluge
Hi Eric,

basically right now I have one IB node, one 10GE node and one router node that 
has both types of network interfaces.

I've got a small lnet test script on the router node, that does the work:
export LST_SESSION=$$
lst new_session rw
lst add_group readers 192.168.1...@tcp
lst add_group writers 10.148.0...@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from writers --to readers brw read check=simple 
size=1M
lst run bulk_rw
lst stat writers & sleep 30; kill $!
lst end_session

Is there a way to figure out the messages in flight? I remember to have a 
"rpc's in flight" tunable but this is connected to the OSC layer which does not 
do anything in my case (I think).


Michael



Am 13.09.2010 um 03:08 schrieb Eric Barton:

>  
> Michael,
>  
>  
> How are you generating load and measuring the throughput?   I’m particularly 
> interested in the number
> of nodes on each side of the router and how many messages you have in flight 
> between each one.
>  
>  
> Cheers,
>Eric
>  
>  
>  
>  
> From: lustre-discuss-boun...@lists.lustre.org 
> [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Kluge
> Sent: 11 September 2010 12:56 AM
> To: Michael Kluge
> Cc: Lustre Diskussionsliste
> Subject: Re: [Lustre-discuss] lnet router tuning
>  
> And here are my params:
>  
> r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do 
> echo -n "$F: "; cat $F ; done
> /sys/module/lnet/parameters/accept: secure
> /sys/module/lnet/parameters/accept_backlog: 127
> /sys/module/lnet/parameters/accept_port: 988
> /sys/module/lnet/parameters/accept_timeout: 5
> /sys/module/lnet/parameters/auto_down: 1
> /sys/module/lnet/parameters/avoid_asym_router_failure: 0
> /sys/module/lnet/parameters/check_routers_before_use: 0
> /sys/module/lnet/parameters/config_on_load: 0
> /sys/module/lnet/parameters/dead_router_check_interval: 0
> /sys/module/lnet/parameters/forwarding: enabled
> /sys/module/lnet/parameters/ip2nets: 
> /sys/module/lnet/parameters/large_router_buffers: 512
> /sys/module/lnet/parameters/live_router_check_interval: 0
> /sys/module/lnet/parameters/local_nid_dist_zero: 1
> /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1)
> /sys/module/lnet/parameters/peer_buffer_credits: 0
> /sys/module/lnet/parameters/portals_compatibility: none
> /sys/module/lnet/parameters/router_ping_timeout: 50
> /sys/module/lnet/parameters/routes: 
> /sys/module/lnet/parameters/small_router_buffers: 8192
> /sys/module/lnet/parameters/tiny_router_buffers: 1024
>  
> I have not used ip2nets but configure routing but put explict routing 
> statements into the modprobe.d/ files. Is that OK? 
>  
>  
> Michael
>  
>  
> Am 10.09.2010 um 17:48 schrieb Michael Kluge:
> 
> 
> OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with 
> additional lnet router I see 550 MB/s. Time for lnet tuning?
>  
> Michael
> 
> 
> Hi Andreas,
>  
> Am 10.09.2010 um 16:35 schrieb Andreas Dilger:
> 
> 
> On 2010-09-10, at 08:23, Michael Kluge wrote:
> 
> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance 
> tests with routing between DDR IB<->10GE networks. Currently I have three 
> nodes, one with DDR IB, one with 10GE and one with both that does the 
> routing. A first short lnet test shows 520-550 MB/s performance.
>  
> Has anyone an idea which of the variables of the lnet module are worth 
> playing with to get this number a bit closer to 1GB/s?
> 
> I would start by testing the performance on just the 10GigE side, and then 
> separately on the IB side, to verify you are getting the expected performance 
> from the components before trying them both together.  Often it is necessary 
> to tune the ethernet send/receive buffers.
>  
> Ethernet back to back is at 950 MB/s. I have not looked at IB back to back 
> yet.
>  
>  
> Michael
> 
> -- 
> 
> Michael Kluge, M.Sc.
> 
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
> 
> Contact:
> Willersbau, Room WIL A 208
> Phone:  (+49) 351 463-34217
> Fax:(+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de
> WWW:http://www.tu-dresden.de/zih
>  
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>  
> 
> -- 
> 
> Michael Kluge, M.Sc.
> 
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
> 
> Conta

Re: [Lustre-discuss] lnet router tuning

2010-09-10 Thread Michael Kluge
Has anyone else a 10GE<->IB Lustre router? What are the typical 
performance numbers? How close do you get to 1GB/s?

Michael


Am 10.09.2010 17:55, schrieb Michael Kluge:
> And here are my params:
>
> r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ;
> do echo -n "$F: "; cat $F ; done
> /sys/module/lnet/parameters/accept: secure
> /sys/module/lnet/parameters/accept_backlog: 127
> /sys/module/lnet/parameters/accept_port: 988
> /sys/module/lnet/parameters/accept_timeout: 5
> /sys/module/lnet/parameters/auto_down: 1
> /sys/module/lnet/parameters/avoid_asym_router_failure: 0
> /sys/module/lnet/parameters/check_routers_before_use: 0
> /sys/module/lnet/parameters/config_on_load: 0
> /sys/module/lnet/parameters/dead_router_check_interval: 0
> /sys/module/lnet/parameters/forwarding: enabled
> /sys/module/lnet/parameters/ip2nets:
> /sys/module/lnet/parameters/large_router_buffers: 512
> /sys/module/lnet/parameters/live_router_check_interval: 0
> /sys/module/lnet/parameters/local_nid_dist_zero: 1
> /sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1)
> /sys/module/lnet/parameters/peer_buffer_credits: 0
> /sys/module/lnet/parameters/portals_compatibility: none
> /sys/module/lnet/parameters/router_ping_timeout: 50
> /sys/module/lnet/parameters/routes:
> /sys/module/lnet/parameters/small_router_buffers: 8192
> /sys/module/lnet/parameters/tiny_router_buffers: 1024
>
> I have not used ip2nets but configure routing but put explict routing
> statements into the modprobe.d/ files. Is that OK?
>
>
> Michael
>
>
> Am 10.09.2010 um 17:48 schrieb Michael Kluge:
>
>> OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s,
>> with additional lnet router I see 550 MB/s. Time for lnet tuning?
>>
>> Michael
>>
>>> Hi Andreas,
>>>
>>> Am 10.09.2010 um 16:35 schrieb Andreas Dilger:
>>>
>>>> On 2010-09-10, at 08:23, Michael Kluge wrote:
>>>>> I have a Lustre 1.8.3 setup where I'd like to some lnet router
>>>>> performance tests with routing between DDR IB<->10GE networks.
>>>>> Currently I have three nodes, one with DDR IB, one with 10GE and
>>>>> one with both that does the routing. A first short lnet test shows
>>>>> 520-550 MB/s performance.
>>>>>
>>>>> Has anyone an idea which of the variables of the lnet module are
>>>>> worth playing with to get this number a bit closer to 1GB/s?
>>>>
>>>> I would start by testing the performance on just the 10GigE side,
>>>> and then separately on the IB side, to verify you are getting the
>>>> expected performance from the components before trying them both
>>>> together. Often it is necessary to tune the ethernet send/receive
>>>> buffers.
>>>
>>> Ethernet back to back is at 950 MB/s. I have not looked at IB back to
>>> back yet.
>>>
>>>
>>> Michael
>>>
>>> --
>>>
>>> Michael Kluge, M.Sc.
>>>
>>> Technische Universität Dresden
>>> Center for Information Services and
>>> High Performance Computing (ZIH)
>>> D-01062 Dresden
>>> Germany
>>>
>>> Contact:
>>> Willersbau, Room WIL A 208
>>> Phone: (+49) 351 463-34217
>>> Fax: (+49) 351 463-37773
>>> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de>
>>> WWW: http://www.tu-dresden.de/zih
>>>
>>> ___
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org <mailto:Lustre-discuss@lists.lustre.org>
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> --
>>
>> Michael Kluge, M.Sc.
>>
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>>
>> Contact:
>> Willersbau, Room WIL A 208
>> Phone: (+49) 351 463-34217
>> Fax: (+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de>
>> WWW: http://www.tu-dresden.de/zih
>>
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org <mailto:Lustre-discuss@lists.lustre.org>
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> --
>
> Michael Kluge, M.Sc.
>
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
>
> Contact:
> Willersbau, Room WIL A 208
> Phone: (+49) 351 463-34217
> Fax: (+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de>
> WWW: http://www.tu-dresden.de/zih
>
>
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet router tuning

2010-09-10 Thread Michael Kluge
And here are my params:

r...@doss05:/home/tests/lnet# for F in /sys/module/lnet/parameters/* ; do echo 
-n "$F: "; cat $F ; done
/sys/module/lnet/parameters/accept: secure
/sys/module/lnet/parameters/accept_backlog: 127
/sys/module/lnet/parameters/accept_port: 988
/sys/module/lnet/parameters/accept_timeout: 5
/sys/module/lnet/parameters/auto_down: 1
/sys/module/lnet/parameters/avoid_asym_router_failure: 0
/sys/module/lnet/parameters/check_routers_before_use: 0
/sys/module/lnet/parameters/config_on_load: 0
/sys/module/lnet/parameters/dead_router_check_interval: 0
/sys/module/lnet/parameters/forwarding: enabled
/sys/module/lnet/parameters/ip2nets: 
/sys/module/lnet/parameters/large_router_buffers: 512
/sys/module/lnet/parameters/live_router_check_interval: 0
/sys/module/lnet/parameters/local_nid_dist_zero: 1
/sys/module/lnet/parameters/networks: tcp0(eth2),o2ib(ib1)
/sys/module/lnet/parameters/peer_buffer_credits: 0
/sys/module/lnet/parameters/portals_compatibility: none
/sys/module/lnet/parameters/router_ping_timeout: 50
/sys/module/lnet/parameters/routes: 
/sys/module/lnet/parameters/small_router_buffers: 8192
/sys/module/lnet/parameters/tiny_router_buffers: 1024

I have not used ip2nets but configure routing but put explict routing 
statements into the modprobe.d/ files. Is that OK? 


Michael


Am 10.09.2010 um 17:48 schrieb Michael Kluge:

> OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with 
> additional lnet router I see 550 MB/s. Time for lnet tuning?
> 
> Michael
> 
>> Hi Andreas,
>> 
>> Am 10.09.2010 um 16:35 schrieb Andreas Dilger:
>> 
>>> On 2010-09-10, at 08:23, Michael Kluge wrote:
>>>> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance 
>>>> tests with routing between DDR IB<->10GE networks. Currently I have three 
>>>> nodes, one with DDR IB, one with 10GE and one with both that does the 
>>>> routing. A first short lnet test shows 520-550 MB/s performance. 
>>>> 
>>>> Has anyone an idea which of the variables of the lnet module are worth 
>>>> playing with to get this number a bit closer to 1GB/s? 
>>> 
>>> I would start by testing the performance on just the 10GigE side, and then 
>>> separately on the IB side, to verify you are getting the expected 
>>> performance from the components before trying them both together.  Often it 
>>> is necessary to tune the ethernet send/receive buffers.
>> 
>> Ethernet back to back is at 950 MB/s. I have not looked at IB back to back 
>> yet.
>> 
>> 
>> Michael
>> 
>> -- 
>> 
>> Michael Kluge, M.Sc.
>> 
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>> 
>> Contact:
>> Willersbau, Room WIL A 208
>> Phone:  (+49) 351 463-34217
>> Fax:(+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de
>> WWW:http://www.tu-dresden.de/zih
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> -- 
> 
> Michael Kluge, M.Sc.
> 
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
> 
> Contact:
> Willersbau, Room WIL A 208
> Phone:  (+49) 351 463-34217
> Fax:(+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de
> WWW:http://www.tu-dresden.de/zih
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet router tuning

2010-09-10 Thread Michael Kluge
OK, IB back to back is at 1,2 GB/s, 10GE back to back at 950 MB/s, with 
additional lnet router I see 550 MB/s. Time for lnet tuning?

Michael

> Hi Andreas,
> 
> Am 10.09.2010 um 16:35 schrieb Andreas Dilger:
> 
>> On 2010-09-10, at 08:23, Michael Kluge wrote:
>>> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance 
>>> tests with routing between DDR IB<->10GE networks. Currently I have three 
>>> nodes, one with DDR IB, one with 10GE and one with both that does the 
>>> routing. A first short lnet test shows 520-550 MB/s performance. 
>>> 
>>> Has anyone an idea which of the variables of the lnet module are worth 
>>> playing with to get this number a bit closer to 1GB/s? 
>> 
>> I would start by testing the performance on just the 10GigE side, and then 
>> separately on the IB side, to verify you are getting the expected 
>> performance from the components before trying them both together.  Often it 
>> is necessary to tune the ethernet send/receive buffers.
> 
> Ethernet back to back is at 950 MB/s. I have not looked at IB back to back 
> yet.
> 
> 
> Michael
> 
> -- 
> 
> Michael Kluge, M.Sc.
> 
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
> 
> Contact:
> Willersbau, Room WIL A 208
> Phone:  (+49) 351 463-34217
> Fax:(+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de
> WWW:http://www.tu-dresden.de/zih
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet router tuning

2010-09-10 Thread Michael Kluge
Hi Andreas,

Am 10.09.2010 um 16:35 schrieb Andreas Dilger:

> On 2010-09-10, at 08:23, Michael Kluge wrote:
>> I have a Lustre 1.8.3 setup where I'd like to some lnet router performance 
>> tests with routing between DDR IB<->10GE networks. Currently I have three 
>> nodes, one with DDR IB, one with 10GE and one with both that does the 
>> routing. A first short lnet test shows 520-550 MB/s performance. 
>> 
>> Has anyone an idea which of the variables of the lnet module are worth 
>> playing with to get this number a bit closer to 1GB/s? 
> 
> I would start by testing the performance on just the 10GigE side, and then 
> separately on the IB side, to verify you are getting the expected performance 
> from the components before trying them both together.  Often it is necessary 
> to tune the ethernet send/receive buffers.

Ethernet back to back is at 950 MB/s. I have not looked at IB back to back yet.


Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lnet router tuning

2010-09-10 Thread Michael Kluge
Hi all,

I have a Lustre 1.8.3 setup where I'd like to some lnet router
performance tests with routing between DDR IB<->10GE networks. Currently
I have three nodes, one with DDR IB, one with 10GE and one with both
that does the routing. A first short lnet test shows 520-550 MB/s
performance. 

Has anyone an idea which of the variables of the lnet module are worth
playing with to get this number a bit closer to 1GB/s? 

parm:   tiny_router_buffers:# of 0 payload messages to buffer in
the router (int)
parm:   small_router_buffers:# of small (1 page) messages to
buffer in the router (int)
parm:   large_router_buffers:# of large messages to buffer in
the router (int)
parm:   peer_buffer_credits:# router buffer credits per peer
(int)

The CPU on the router node is less utilized than it was when I did back
to back 10GE tests. I have 6 cores in the machine, 5 have been idle and
one showing a load of about 60%.


Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] O_DIRECT

2010-08-14 Thread Michael Kluge
Hi all,

how does Lustre handle write() requests to files opened with O_DIRECT. 
Does the OSS enforce that the OST has physically written the data to the 
OST before the op is completed or does the write() call return on the 
client before this? I do not see the whole file content walking through 
the FC port of the RAID controller, but it can also be that my 
measurement is wrong ...


Michael


-- 
Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Complete lnet routing example

2010-06-24 Thread Michael Kluge
Hi Josh,

thanks a lot!


Michael

Am 24.06.2010 um 15:40 schrieb Joshua Walgenbach:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi Michael,
> 
> This is what I'm using on my test systems:
> 
> I have the servers set up on 192.168.1.0/24 and clients set up on
> 192.168.2.0/24, with no network routing between them and a lustre router
> bridging the two networks with ip addresses of 192.168.1.31 and
> 192.168.2.31. I've a attached a quick diagram.
> 
> modprobe.conf for MDS and OSS servers:
> 
> options lnet networks="tcp0(eth2)" routes="tcp1 192.168.1...@tcp0"
> 
> modprobe.conf for router:
> 
> options lnet networks="tcp0(eth2), tcp1(eth3)" forwarding="enabled"
> 
> modprobe.conf for clients:
> 
> options lnet networks="tcp1(eth2)" routes="tcp0 192.168.2...@tcp1"
> 
> What I have is pretty minimal, but it gets the job done.
> 
> - -Josh
> 
> On 06/24/2010 06:15 AM, Michael Kluge wrote:
>> Hi there,
>> 
>> does anyone have a complete lnet routing example that he/she wants to
>> share that contains a network diagram and all modprobe.conf options for
>> clients, servers and the routers? I found only one mail in the mailing
>> list and the interesting parts have gone through a filter and now a lot
>> of the configuration options are '[EMAIL PROTECTED]'.
>> 
>> 
>> Thanks a lot in advance,
>> Michael
>> 
>> -- 
>> 
>> Michael Kluge, M.Sc.
>> 
>> Technische Universität Dresden
>> Center for Information Services and
>> High Performance Computing (ZIH)
>> D-01062 Dresden
>> Germany
>> 
>> Contact:
>> Willersbau, Room WIL A 208
>> Phone:  (+49) 351 463-34217
>> Fax:(+49) 351 463-37773
>> e-mail: michael.kl...@tu-dresden.de <mailto:michael.kl...@tu-dresden.de>
>> WWW:http://www.tu-dresden.de/zih
>> 
>> 
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAkwjYEIACgkQcqyJPuRTYp9tTACeIGttWBu44dc4SKB/0IIjHhF9
> i3QAn17sBD38/3MdsYuiGcUOruZVS8j/
> =SLQp
> -END PGP SIGNATURE-
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Complete lnet routing example

2010-06-24 Thread Michael Kluge
Hi there,

does anyone have a complete lnet routing example that he/she wants to share 
that contains a network diagram and all modprobe.conf options for clients, 
servers and the routers? I found only one mail in the mailing list and the 
interesting parts have gone through a filter and now a lot of the configuration 
options are '[EMAIL PROTECTED]'.


Thanks a lot in advance,
Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
LMT (http://code.google.com/p/lmt) might be able to give some hints if
users are using the FS in a 'wild' fashion. For the question "what can
cause this behaviour of my MDS" I guess the answer is like: a million
things ;) There is no way of being more specific with more input about
the problem itself.

Michael

Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne Brutschy:
> Hi,
> 
> thanks for replying!
> 
> I understand that without further information we can't do much about the
> oopses. I was more hoping for some information regarding possible
> sources of such an overload. Is it normal that a MDS gets overloaded
> like this, while the OSTs have nothing to do, and what can I do about
> it? How can I find the source of the problem?
> 
> More specifically, what are the operations that lead to a lot of MDS
> load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
> not a top-notch server, it's fairly recent and I feel the load we're
> experiencing is not handable by a single MDS.
> 
> My problem is that I can't make out major problems in the user's jobs
> running on the cluster, and I can't quantify nor track down the problem
> because I don't know what behavior might have caused it. 
> 
> As I said, ooppses appeared only twice, and all other problems where
> just apparent by a non-responsive MDS.
> 
> Thanks,
> Arne
> 
> 
> On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:
> > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > > messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > 
> > Without any more information we could not possibly know.  If you really
> > are getting oopses then you will need console logs (i.e. serial console)
> > so that we can see the stack trace.
> > 
> > b.
> > 
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
Hmm. Should be enough. I guess you need to set up a loghost for syslog
then and a reliable serial console to get stack traces. Everything else
would be just a wild guess (as the question for the ram size was).

Michael

> Hi,
> 
> 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz
> 
> Arne
> 
> On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:
> > Hi Arne,
> > 
> > could be memory pressure and the OOM running and shooting at things. How
> > much memory does you server has?
> > 
> > 
> > Michael
> > 
> > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > > Hi everyone,
> > > 
> > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > > 2.6.9-78.0.22.
> > > 
> > > We were quite happy with lustre's performance, especially because
> > > bottlenecks caused by /home disk access were history.
> > > 
> > > Saturday, the cluster went down (= was inaccessible). After some
> > > investigation I found out that the reason seems to be an overloaded MDS.
> > > Over the following 4 days, this happened multiple times and could only
> > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> > > 
> > > The MDS did not respond to any command, if I managed to get a video
> > > signal (not often), load was >170. Additionally, 2 times kernel oops got
> > > displayed, but unfortunately I have to record of them.
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > > messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > > 
> > > Thanks in advance,
> > > Arne 
> > > 
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
Hi Arne,

could be memory pressure and the OOM running and shooting at things. How
much memory does you server has?


Michael

Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> Hi everyone,
> 
> 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> 2.6.9-78.0.22.
> 
> We were quite happy with lustre's performance, especially because
> bottlenecks caused by /home disk access were history.
> 
> Saturday, the cluster went down (= was inaccessible). After some
> investigation I found out that the reason seems to be an overloaded MDS.
> Over the following 4 days, this happened multiple times and could only
> be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> 
> The MDS did not respond to any command, if I managed to get a video
> signal (not often), load was >170. Additionally, 2 times kernel oops got
> displayed, but unfortunately I have to record of them.
> 
> The clients showed the following error:
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 304/456 
> > e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > messages
> 
> So, my question is: what could cause such a load? The cluster was not
> exessively used... Is this a bug or a user's job that creates the load?
> How can I protect lustre against this kind of failure?
> 
> Thanks in advance,
> Arne 
> 
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Read/Write performance problem

2009-10-06 Thread Michael Kluge
Am Dienstag, den 06.10.2009, 09:33 -0600 schrieb Andreas Dilger:
> > ... bla bla ...
> > Is there a reason why an immediate read after a write on the same node
> > from/to a shared file is slow? Is there any additional communication,
> > e.g. is the client flushing the buffer cache before the first read? The
> > statistics show that the average time to complete a 1.44MB read request
> > is increasing during the runtime of our program. At some point it hits
> > an upper limit or a saturation point and stays there. Is there some kind
> > of queue or something that is getting full in this kind of
> > write/read-scenario? May tuneable some stuff in /proc/fs/luste?
> 
> One possible issue is that you don't have enough extra RAM to cache 1.5GB
> of the checkpoint, so during the write it is being flushed to the OSTs
> and evicted from cache.  When you immediately restart there is still dirty
> data being written from the clients that is contending with the reads to
> restart.
> Cheers, Andreas

Well, I do call fsync() after the write is finished. During the write
process I see a constant stream of 4 GB/s running from the lustre
servers to the raid controllers which finishes when the write process
terminates. When I start reading, there are no more writes going this
way, so I suspect it might be something else ... Even if I wait between
the writes and reads 5 minutes (all dirty pages should have been flushed
by then) the picture does not change.


Michael

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Read/Write performance problem

2009-10-06 Thread Michael Kluge
Hi all,

our Lustre FS shows an interesting performance problem which I'd like to
discuss as some of you might have seen this kind of things before and
maybe someone has a quick explanation of what's going on.

We are running Lustre 1.6.5.1. The problem shows up when we read a
shared file from multiple nodes that has just been written from the same
set of nodes. 512 processes write a checkpoint (1.5 GB from each node)
into a shared file by seeking to position RANK*1.5GB and writing 1.5GB
in 1.44M chunks. Writing works fine and gives the full file system
performance. The data is being written by using write() and no flags
aside O_CREAT and O_WRONLY. If the checkpoint is written, the program is
terminated and restarted and reads in the same portion of the file. For
some reason this almost immediate reading of the same data that was just
written on the same node is very slow. If we a) change the set of nodes
or b) wait a day, we get the full read performance when we use the same
executable and the same shared file. 

Is there a reason why an immediate read after a write on the same node
from/to a shared file is slow? Is there any additional communication,
e.g. is the client flushing the buffer cache before the first read? The
statistics show that the average time to complete a 1.44MB read request
is increasing during the runtime of our program. At some point it hits
an upper limit or a saturation point and stays there. Is there some kind
of queue or something that is getting full in this kind of
write/read-scenario? May tuneable some stuff in /proc/fs/luste?


Regards, Michael


-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss