Re: [Lustre-discuss] Lustre community build server

2011-01-28 Thread DEGREMONT Aurelien
Hi

Nathan Rutman a écrit :
> Hi Aurelien, Robert  - 
>
> We also use Hudson and are interested in using it to do Lustre builds 
> and testing.
>> Hi
>>
>> Robert Read a écrit :
>> > Hi Aurélien,
>> >
>> > Yes, we've noticed Hudson's support for testing is not quite what we need, 
>> > so 
>> > we're planning to use Hudson to trigger our testing system, but not 
>> > necessarily to manage it.  We'd definitely be interested in learning more 
>> > about your experiences, though. 
>> >   
>> I do not know what you mean by triggering your testing system. But here 
>> is what I set up.
>> Hudson has only 1 slave node dedicated to testing Lustre 2.
>> Hudson will launch a shell script through ssh to it.
>>
>> This script:
>>  - retrieves Lustre source (managed by Hudson git plugin)
>>  - compiles it.
>>  - launches acceptance-small with several parameters.
>>  - acceptance-small will connect to other nodes dedicated for these tests.
>>
>> acc-sm have been patched:
>> - to be more error resilient (does not stop at first failure)
>> - to generate a test report in JUNIT format.
>> 
> Is this the yaml acc-sm that Robert was talking about, or an older one?
I think so.
We modified the current test-framework.sh  and acceptance-small.sh, from 
master.
We reused the call introduced to produce a result.yml script to produce 
a junit-report.xml script

>> Hudson fetch the junit report and parse it thanks to its plugin.
>> Hudson can display in its interface all tests successes and failures.
>>
>> Everything goes fine as long as:
>>  - the testsuite leaves the node in a good shape. It is difficult to 
>> have a automatic way to put the node back. Currently, we need to manualy 
>> fix that.
>> 
> Would it be helpful to run the test in a VM?  Hudson has a 
> libvirt-slave plugin that
> seems like it can start and stop a VM for you.  Another point I like 
> about VM's is
> that they can be suspended and shipped to an investigator for local 
> debugging.
I notice this plugin for Hudson and also had the same analysis.
I did not try to setup it yet, but this as some side effects anyway and 
won't be a perfect solution for sure.

I find your idea to send the VM interesting. Not sure it could be done 
easily, but could be a great help for debugging if this is doable.
>>  - Hudson does not know about the other nodes used by acc-sm. And so can 
>> trigger tests even if some sattelites nodes are unavailable.
>> 
> Don't know if libvirt-slave can handle multiple
> VM's for a multi-node Lustre test, but maybe it can be extended.
Even without testing Lustre I will be very pleased if Hudson can 
allocate several nodes for 1 run and give the node list to the test 
script through variable by example.
This could be very interesting for us, as we are working on clustering 
stuff, we always need group of nodes whatever the tests are. This is not 
easy to handle in Hudson. We have to cheat :p

>> How is you do this on your side?
>> 
> It seems that like you, we are more interested in reporting the results
> within Hudson as opposed to a different custom tool.
In this case, I'm quite sure we can converge toward the same patch for that.

Robert's ideas are more ambitious and want to handle more generic uses. 
I will not be displeased by that as long as the test results could be 
analyzed by Hudson plugins.



Aurelien
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

2011-01-28 Thread Jason Rappleye

On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:

> On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
>>> It would probably be better to set:
>>> 
>>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>>> 
>>> or similar, to limit the read cache to files 32MB in size or less (or 
>>> whatever you consider "small" files at your site.  That allows the read 
>>> cache for config files and such, while not thrashing the cache while 
>>> accessing large files.
>>> 
>>> We should probably change this to be the default, but at the time the read 
>>> cache was introduced, we didn't know what should be considered a small vs. 
>>> large file, and the amount of RAM and number of OSTs on an OSS, and the 
>>> uses varies so much that it is difficult to pick a single correct value for 
>>> this.
> 
> limiting the total amount of OSS cache used in order to leave room for
> inodes/dentries might be more useful. the data cache will always fill
> up and push out inodes otherwise.

The inode and dentry objects in the slab cache aren't so much of an issue as 
having the disk blocks that each are generated from available in the buffer 
cache. Constructing the in-memory inode and dentry objects is cheap as long as 
the corresponding disk blocks are available. Doing the disk reads, depending on 
your hardware and some other factors, is not.

> Nathan's approach of turning off the caches entirely is extreme, but if
> it gives us back some metadata performance then it might be worth it.

We went the extreme and disabled the OSS read cache (+ writethrough cache). In 
addition, on the OSSes we pre-read all of the inode blocks that contain at 
least one used inode, along with all of the directory blocks. 

The results have been promising so far. Firing off a du on an entire 
filesystem, 3000-6000 stats/second is typical. I've noted a few causes of 
slowdowns so far; there may be more.

First, no attempt has been made to pre-read metadata from the MDT. The need to 
read in inode and directory blocks may slow things down quite a bit. I can't 
find the numbers in my notes at the moment, but I recall seeing 200-500 
stats/second when the MDS needed to do I/O.

When memory runs low on a client, kswapd kicks in to try and free up pages. On 
the client I'm currently testing on, almost all of the memory used is in the 
slab. It looks like kswapd has a difficult time clearing things up, and the 
client can go several seconds before the current stat call is completed. 
Dropping caches will (temporarily) get the performance back to expected rates. 
I haven't dug into this one too much yet.

Sometimes the performance drop is worse, and we see just tens of stats/second 
(or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} 
all need to take a lock on the parent directory of the object on the OST. 
Unlink or precreate operations whose critical section protected by this lock 
take a long time to complete will slow down stat requests. I'm working on 
tracking down the cause of this; it may be journal related. BZ 22107 is 
probably relevant as well.

> or is there a Lustre or VM setting to limit overall OSS cache size?

No, but I think that would be really useful in this situation.

> I presume that Lustre's OSS caches are subject to normal Linux VM
> pagecache tweakables, but I don't think such a knob exists in Linux at
> the moment...

Correct on both counts. A patch was proposed to do this, but I don't see any 
evidence of it making it into the kernel:

http://lwn.net/Articles/218890/

I have a small set of perl, bash, and SystemTap scripts to read the inode and 
directory blocks from disk and monitor the performance of the relevant Lustre 
calls on the servers. I'll clean them up and send them to the list next week. A 
more elegant solution would be to get e2scan to do the job, but I haven't taken 
a hack at that yet.

Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 
15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 
inode blocks/group), ~36% have at least one inode used. We pre-read those and 
ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an 
average of 3891 directory blocks per OST.

In the absence of controls on the size of the page cache, or enough RAM to 
cache all of the inode and directory blocks in memory, another potential 
solution is to place the metadata on an SSD. One can generate a dm linear 
target table that carves up an ext3/ext4 filesystem such that the inode blocks 
go on one device, and the data blocks go on another. Ideally the inode blocks 
would be placed on an SSD. 

I've tried this with both ext3, and with ext4 using flex_bg to reduce the size 
of the dm table. IIRC the overhead is acceptable in both cases - 1us, on 
average.

Placing the inodes on separate storage is not sufficient, though. Slow 
directory block reads contribute to poor stat performance as well. Adding a 
feature to ext4 to 

Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

2011-01-28 Thread Andreas Dilger
On 2011-01-28, at 10:45, Jason Rappleye wrote:
> Sometimes the performance drop is worse, and we see just tens of stats/second 
> (or fewer!) This is due to the fact that 
> filter_{fid2dentry,precreate,destory} all need to take a lock on the parent 
> directory of the object on the OST. Unlink or precreate operations whose 
> critical section protected by this lock take a long time to complete will 
> slow down stat requests. I'm working on tracking down the cause of this; it 
> may be journal related. BZ 22107 is probably relevant as well.

There is work underway to allow the locking of the ldiskfs directories to be 
multi-threaded.  This should significantly improve performance in such cases.

> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, 
> and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 
> 8 inode blocks/group), ~36% have at least one inode used. We pre-read those 
> and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an 
> average of 3891 directory blocks per OST.
> 
> In the absence of controls on the size of the page cache, or enough RAM to 
> cache all of the inode and directory blocks in memory, another potential 
> solution is to place the metadata on an SSD. One can generate a dm linear 
> target table that carves up an ext3/ext4 filesystem such that the inode 
> blocks go on one device, and the data blocks go on another. Ideally the inode 
> blocks would be placed on an SSD. 
> 
> I've tried this with both ext3, and with ext4 using flex_bg to reduce the 
> size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on 
> average.

I'd be quite interested to see the results of such testing.

> Placing the inodes on separate storage is not sufficient, though. Slow 
> directory block reads contribute to poor stat performance as well. Adding a 
> feature to ext4 to reserve a number of fixed block groups for directory 
> blocks, and always allocating them there, would help. Those blocks groups 
> could then be placed on an SSD as well.

I believe there is a heuristic that allocates directory blocks in the first 
group of a flex_bg, so if that entire group is on SSD it would potentially 
avoid this problem.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss