[Lustre-discuss] Unbalanced OST--for discussion purposes

2010-03-03 Thread Ms. Megan Larko
Thanks to both Brian and Andreas for the timely responses.
Brian posed the question as to whether or not the OSTs were more or
less balanced a week ago.  The answer is that I believe that they
were.   Usually all OSTs report a similar percentage of usage (between
1%  and 3% of one another).   I believe that is why this new report
piqued my curiosity.

Regarding Andreas remark about individual OST size, yes I understand
that having larger individual OSTs can preempt any one OST from
becoming so full that the others degrade in performance (per A.
Dilger, not B. Murrel).   For that reason I personally like the option
available in newer Lustre releases (I think 1.8.x and higher) to allow
up to 16Tb in a single OST slice.  I know the previous limit was 8Tb
per OST slice for precaution against data corruption.   (I was able to
build a larger OST slice with 1.6.7 but I was cautioned that some data
may become unreachable and/or corrupted as the Lustre system had not
at that time been modified to accept the larger partition sizes which
the underlying files systems--ext4, xfs---would accept.)The OST
formatted size of 6.3Tb fit nicely into the JBOD scheme of
evenly-sized partitions.

Thanks,
megan

On Tue, 2010-03-02 at 15:45 -0500, Ms. Megan Larko wrote:
> Hi,

Hi,

> I logged directly into the OSS (OSS4) and just ran a df (along with a
> periodic check of the log files).  I last looked about two weeks ago
> (I know it was after 17 Feb).

Is the implication that at this point the OSTs were more or less well
balanced?

> Anyway, the OST0007 is more full than
> any of the other OSTs.  The default lustre stripe (I believe that is
> set to 1) is used.Can just one file shift the size used of one OST
> that significantly?

Sure.  As an example, if one had a 1KiB file on that OST, called, let's
say, "1K_file.dat" and one did:

$ dd if=/dev/zero of=1K_file.dat bs=1G count=1024

that would overwrite the 1KiB file on that OST with a 1TiB file.
Recognizing of course that that would be 1TiB in a single object on an
OST.

> What other reasonable explanation for a
> difference on one OST in comparison with the others?

Any kind of variation on the above.

> Could this cause
> a lustre performance hit at this point?

Not really.

b.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Curious about iozone findings of new Lustre FS

2010-03-03 Thread Jagga Soorma
Hi Guys,

(P.S. My apologies if this is a duplicate - Sent this email earlier today
with an attachment and did not see it populate in google groups so was not
sure if attachments were allowed)

I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
servers (5x2TB OST's per OSS) and 16 compute nodes.  It looks like the
iozone throughput tests have demonstrated almost linear scalability of
Lustre except for when WRITING files that exceed 128MB in size.  When
multiple clients create/write files larger than 128MB, Lustre throughput
levels up to approximately ~1GB/s. This behavior has been observed with
almost all tested block size ranges except for 4KB.  I don't have any
explanation as to why Lustre performs poorly when writing large files.

Here is the iozone report:
http://docs.google.com/fileview?id=0Bz8GxDEZOhnwYjQyMDlhMWMtODVlYi00MTgwLTllN2QtYzU2MWJlNTEwMjA1&hl=en

The only changes I have made to the defaults are:
stripe_count: 2 stripe_size: 1048576 stripe_offset: -1

I am using Lustre 1.8.1.1 and my MDS/OSS servers are running RHEL 5.3.  All
the clients are SLES 11.

Has anyone experienced this behaviour?  Any comments on our findings?

Thanks,
-J
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Curious about iozone findings of new Lustre FS

2010-03-03 Thread Andreas Dilger
On 2010-03-03, at 12:50, Jagga Soorma wrote:
> I have just deployed a new Lustre FS with 2 MDS servers, 2 active  
> OSS servers (5x2TB OST's per OSS) and 16 compute nodes.

Does this mean you are using 5 2TB disks in a single RAID-5 OST per  
OSS (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?

> Attached are our findings from the iozone tests and it looks like  
> the iozone throughput tests have demonstrated almost linear  
> scalability of Lustre except for when WRITING files that exceed  
> 128MB in size.  When multiple clients create/write files larger than  
> 128MB, Lustre throughput levels up to approximately ~1GB/s. This  
> behavior has been observed with almost all tested block size ranges  
> except for 4KB.  I don't have any explanation as to why Lustre  
> performs poorly when writing large files.
>
> Has anyoned experienced this behaviour?  Any comments on our findings?


The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum  
amount of unwritten dirty data per OST before blocking the process  
submitting IO).  If you have 2 OST/OSCs and you have a stripe count of  
2 then you can cache up to 64MB on the client without having to wait  
for any RPCs to complete.  That is why you see a performance cliff for  
writes beyond 32MB.

It should be clear that the read graphs are meaningless, due to local  
cache of the file.  I'd hazard a guess that you are not getting 100GB/ 
s from 2 OSS nodes.

Also, what is the interconnect on the client?  If you are using a  
single 10GigE then 1GB/s is as fast as you can possibly write large  
files to the OSTs, regardless of the striping.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Curious about iozone findings of new Lustre FS

2010-03-03 Thread Jagga Soorma
On Wed, Mar 3, 2010 at 2:30 PM, Andreas Dilger  wrote:

> On 2010-03-03, at 12:50, Jagga Soorma wrote:
>
>> I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
>> servers (5x2TB OST's per OSS) and 16 compute nodes.
>>
>
> Does this mean you are using 5 2TB disks in a single RAID-5 OST per OSS
> (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?


No I am using 5 independent 2TB OST's per OSS.


>
>
>  Attached are our findings from the iozone tests and it looks like the
>> iozone throughput tests have demonstrated almost linear scalability of
>> Lustre except for when WRITING files that exceed 128MB in size.  When
>> multiple clients create/write files larger than 128MB, Lustre throughput
>> levels up to approximately ~1GB/s. This behavior has been observed with
>> almost all tested block size ranges except for 4KB.  I don't have any
>> explanation as to why Lustre performs poorly when writing large files.
>>
>> Has anyoned experienced this behaviour?  Any comments on our findings?
>>
>
>
> The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum
> amount of unwritten dirty data per OST before blocking the process
> submitting IO).  If you have 2 OST/OSCs and you have a stripe count of 2
> then you can cache up to 64MB on the client without having to wait for any
> RPCs to complete.  That is why you see a performance cliff for writes beyond
> 32MB.
>

So the true write performance should be measured for data captured for files
larger than 128MB?  If we do see a large number of large files being created
on the lustre fs, is this something that can be tuned on the client side?
If so, where/how can I get this done and what would be the recommended
settings?


> It should be clear that the read graphs are meaningless, due to local cache
> of the file.  I'd hazard a guess that you are not getting 100GB/s from 2 OSS
> nodes.
>

Agreed.  Is there a way to find out the size of the local cache on the
clients?


>
> Also, what is the interconnect on the client?  If you are using a single
> 10GigE then 1GB/s is as fast as you can possibly write large files to the
> OSTs, regardless of the striping.
>

I am using Infiniband (QDR) interconnects for all nodes.


>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Curious about iozone findings of new Lustre FS

2010-03-03 Thread Jagga Soorma
Or would it be better to increase the stripe count for my lustre filesystem
to the max number of OST's?

On Wed, Mar 3, 2010 at 3:27 PM, Jagga Soorma  wrote:

> On Wed, Mar 3, 2010 at 2:30 PM, Andreas Dilger  wrote:
>
>> On 2010-03-03, at 12:50, Jagga Soorma wrote:
>>
>>> I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
>>> servers (5x2TB OST's per OSS) and 16 compute nodes.
>>>
>>
>> Does this mean you are using 5 2TB disks in a single RAID-5 OST per OSS
>> (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?
>
>
> No I am using 5 independent 2TB OST's per OSS.
>
>
>>
>>
>>  Attached are our findings from the iozone tests and it looks like the
>>> iozone throughput tests have demonstrated almost linear scalability of
>>> Lustre except for when WRITING files that exceed 128MB in size.  When
>>> multiple clients create/write files larger than 128MB, Lustre throughput
>>> levels up to approximately ~1GB/s. This behavior has been observed with
>>> almost all tested block size ranges except for 4KB.  I don't have any
>>> explanation as to why Lustre performs poorly when writing large files.
>>>
>>> Has anyoned experienced this behaviour?  Any comments on our findings?
>>>
>>
>>
>> The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum
>> amount of unwritten dirty data per OST before blocking the process
>> submitting IO).  If you have 2 OST/OSCs and you have a stripe count of 2
>> then you can cache up to 64MB on the client without having to wait for any
>> RPCs to complete.  That is why you see a performance cliff for writes beyond
>> 32MB.
>>
>
> So the true write performance should be measured for data captured for
> files larger than 128MB?  If we do see a large number of large files being
> created on the lustre fs, is this something that can be tuned on the client
> side?  If so, where/how can I get this done and what would be the
> recommended settings?
>
>
>> It should be clear that the read graphs are meaningless, due to local
>> cache of the file.  I'd hazard a guess that you are not getting 100GB/s from
>> 2 OSS nodes.
>>
>
> Agreed.  Is there a way to find out the size of the local cache on the
> clients?
>
>
>>
>> Also, what is the interconnect on the client?  If you are using a single
>> 10GigE then 1GB/s is as fast as you can possibly write large files to the
>> OSTs, regardless of the striping.
>>
>
> I am using Infiniband (QDR) interconnects for all nodes.
>
>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] One or two OSS, no difference?

2010-03-03 Thread Jeffrey Bennett
Hi Lustre experts

We are building a very small Lustre cluster with 32 clients (patchless) and two 
OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. All is 
connected using dual-port DDR IB.

For testing purposes, I am enabling/disabling one of the OSS/OST by using the 
"lfs setstripe" command. I am running XDD and vdbench benchmarks.

Does anybody have an idea why there is no difference in MB/sec or random IOPS 
when using one OSS or two OSS? A quick test with "dd" also shows the same 
MB/sec when using one or two OSTs.

Thanks

jab
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] One or two OSS, no difference?

2010-03-03 Thread Oleg Drokin
Hello!

On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote:
> We are building a very small Lustre cluster with 32 clients (patchless) and 
> two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. 
> All is connected using dual-port DDR IB.
>  
> For testing purposes, I am enabling/disabling one of the OSS/OST by using the 
> “lfs setstripe” command. I am running XDD and vdbench benchmarks.
>  
> Does anybody have an idea why there is no difference in MB/sec or random IOPS 
> when using one OSS or two OSS? A quick test with “dd” also shows the same 
> MB/sec when using one or two OSTs.

I wonder if you just don't saturate even one OST (both backend SSD and IB 
interconnect) with this number of clients? Does the total throughput decreases 
as you decrease
number of active clients and increases as you increase it even further?
Increasing maximum number of in-flight rpcs might help in that case.
Also are all of your clients writing to the same file or each client does io to 
a separate file (I hope)?

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] LUN reassignment DDN and OSS

2010-03-03 Thread syed haider
All,
I recently raised a question about unbalanced OSTs and received the right
answer - increase the size of the OSTs. So I set forth to do this on
our DDN controllers and rather than having 32 1TB LUNs i decided to go with
4 8TB LUNs instead. In doing this I learned our LUNs were created
with the default 512 size and from reading the manual it appears it would
improve performance for our work to go with 4096. Since most jobs run
on lustre would be creating larger files (sequential) we're not concerned
about losing space from smaller files taking up a 4k block. Is there any
other concern I should have
with going with the larger block size?

Second question for the DDN expert- We have 4 OSS's connected to DUAL DDN
9550 controllers via fiber. With the older configuration we had
one 1TB LUN using one Tier so the lun output looked something like this:

 Logical Unit Status

 Capacity  Block
 LUN  LabelOwner  Status (Mbytes)  Size  Tiers Tier list
---
  0 lun 01Ready1120098   5121  1
  1 lun 11Ready1120098   5121  2
  2 lun 21Ready1120098   5121  3
  3 lun 31Ready1120098   5121  4
  4 lun 41Ready1120098   5121  5
  5 lun 51Ready1120098   5121  6
  6 lun 61Ready1120098   5121  7
  7 lun 71Ready1120098   5121  8
  8 lun 82Ready1120098   5121  9
  9 lun 92Ready1120098   5121  10


After making the change to only 4 LUNs two of my OSS's don't see any SCSI
devices when i run fdisk. By deleting and recreating the luns did I somehow
cause
a zoning issue? Thanks in advance.

Syed
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre reported "No space left on device"

2010-03-03 Thread xiaobao
Hi all, we are running lustre 1.8.1 on about 800 clients. Some days ago, we
found a weird problem several times. When writing data, some clients
reported "LustreError: 11-0: an error occurred while communicating with
xx.xx.xx...@o2ib. The ost_write operation failed with -28", where
xx.xx.xx.xx is one of our OSS node. But both the MDS and OSS is not anywhere
near full. As we know, the errno 28 is "no space left on device". After some
time, everything appears to be ok again.

the space used:

 # lfs df -h
UUID bytes  Used Available  Use% Mounted on
lustre-MDT_UUID  350.0G  1.1G328.9G0% /home[MDT:0]
lustre-OST_UUID6.2T263.5G  5.6T4% /home[OST:0]
lustre-OST0001_UUID6.2T264.3G  5.6T4% /home[OST:1]
lustre-OST0002_UUID5.7T261.7G  5.2T4% /home[OST:2]
lustre-OST0003_UUID5.4T206.6G  4.9T3% /home[OST:3]
lustre-OST0004_UUID4.6T197.1G  4.2T4% /home[OST:4]
lustre-OST0005_UUID4.6T160.6G  4.2T3% /home[OST:5]
lustre-OST0006_UUID4.6T300.7G  4.1T6% /home[OST:6]
lustre-OST0007_UUID4.6T174.1G  4.2T3% /home[OST:7]
lustre-OST0008_UUID6.9T232.7G  6.4T3% /home[OST:8]
lustre-OST0009_UUID6.9T237.7G  6.4T3% /home[OST:9]
lustre-OST000a_UUID6.2T219.9G  5.6T3% /home[OST:10]
lustre-OST000b_UUID6.2T257.8G  5.6T4% /home[OST:11]
lustre-OST000c_UUID6.2T784.6G  5.1T   12% /home[OST:12]
lustre-OST000d_UUID6.2T227.2G  5.6T3% /home[OST:13]
lustre-OST000e_UUID5.7T199.2G  5.2T3% /home[OST:14]
lustre-OST000f_UUID5.4T221.9G  4.9T4% /home[OST:15]
lustre-OST0010_UUID4.6T176.4G  4.2T3% /home[OST:16]
lustre-OST0011_UUID4.6T160.9G  4.2T3% /home[OST:17]
lustre-OST0012_UUID3.1T118.3G  2.8T3% /home[OST:18]
lustre-OST0013_UUID3.1T 99.0G  2.8T3% /home[OST:19]
lustre-OST0014_UUID6.9T243.7G  6.4T3% /home[OST:20]
lustre-OST0015_UUID6.9T273.6G  6.3T3% /home[OST:21]
lustre-OST0016_UUID6.2T335.6G  5.5T5% /home[OST:22]
lustre-OST0017_UUID6.2T219.1G  5.6T3% /home[OST:23]

the inode used:

 # lfs df -ih
UUIDInodes IUsed IFree IUse% Mounted on
lustre-MDT_UUID   89.3M  2.1M 87.2M2% /home[MDT:0]
lustre-OST_UUID6.2M 91.3K  6.1M1% /home[OST:0]
lustre-OST0001_UUID6.2M 90.7K  6.1M1% /home[OST:1]
lustre-OST0002_UUID5.7M 83.1K  5.6M1% /home[OST:2]
lustre-OST0003_UUID5.4M 80.1K  5.3M1% /home[OST:3]
lustre-OST0004_UUID4.6M 68.8K  4.6M1% /home[OST:4]
lustre-OST0005_UUID4.6M 69.8K  4.6M1% /home[OST:5]
lustre-OST0006_UUID4.6M 69.4K  4.6M1% /home[OST:6]
lustre-OST0007_UUID4.6M 69.2K  4.6M1% /home[OST:7]
lustre-OST0008_UUID6.9M101.7K  6.8M1% /home[OST:8]
lustre-OST0009_UUID6.9M101.3K  6.8M1% /home[OST:9]
lustre-OST000a_UUID6.2M 91.0K  6.1M1% /home[OST:10]
lustre-OST000b_UUID6.2M 90.9K  6.1M1% /home[OST:11]
lustre-OST000c_UUID6.2M 86.1K  6.1M1% /home[OST:12]
lustre-OST000d_UUID6.2M 90.2K  6.1M1% /home[OST:13]
lustre-OST000e_UUID5.7M 83.4K  5.6M1% /home[OST:14]
lustre-OST000f_UUID5.4M 80.4K  5.3M1% /home[OST:15]
lustre-OST0010_UUID4.6M 69.0K  4.6M1% /home[OST:16]
lustre-OST0011_UUID4.6M 69.3K  4.6M1% /home[OST:17]
lustre-OST0012_UUID3.1M 46.7K  3.0M1% /home[OST:18]
lustre-OST0013_UUID3.1M 46.6K  3.0M1% /home[OST:19]
lustre-OST0014_UUID6.9M101.5K  6.8M1% /home[OST:20]
lustre-OST0015_UUID6.9M101.3K  6.8M1% /home[OST:21]
lustre-OST0016_UUID6.2M 90.1K  6.1M1% /home[OST:22]
lustre-OST0017_UUID6.2M 90.7K  6.1M1% /home[OST:23]

So why it happened? BTW, OSS and MDS didn't say anything about "no space
left" in their logs, the os is SLES 10 sp2.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss