Re: [lustre-discuss] Backup software for Lustre

2017-02-07 Thread Andrew Holway
Would it be difficult to suspend IO and snapshot all the nodes (assuming
ZFS). Could you be sure that your MDS and OSS are synchronised?

On 7 February 2017 at 19:52, Mike Selway  wrote:

> Hello Brett,
>
>Actually, looking for someone who uses a commercialized
> approach (that retains user metadata and Lustre extended metadata) and not
> specifically the manual approaches of Chapter 17.
>
>
>
> Thanks!
> Mike
>
>
>
> *Mike Selway* *|** Sr. Tiered Storage Architect | Cray Inc.*
>
> Work +1-301-332-4116 <(301)%20332-4116> | msel...@cray.com
>
> 146 Castlemaine Ct,   Castle Rock,  CO  80104 | www.cray.com
>
>
>
> [image: cid:image001.png@01CF8974.1C1FA000] 
>
>
>
>
> *From:* Brett Lee [mailto:brettlee.lus...@gmail.com]
> *Sent:* Monday, February 06, 2017 11:45 AM
> *To:* Mike Selway 
> *Cc:* lustre-discuss@lists.lustre.org
> *Subject:* Re: [lustre-discuss] Backup software for Lustre
>
>
>
> Hey Mike,
>
>
>
> "Chapter 17" and
>
>
>
> http://www.intel.com/content/www/us/en/lustre/backup-and-
> restore-training.html
>
>
>
> both contain methods to backup & restore the entire Lustre file system.
>
>
>
> Are you looking for a solution that backs up only the (user) data files
> and their associated metadata (e.g. xattrs)?
>
>
> Brett
>
> --
>
> Protect Yourself From Cybercrime
>
> PDS Software Solutions LLC
>
> https://www.TrustPDS.com 
>
>
>
> On Mon, Feb 6, 2017 at 11:12 AM, Mike Selway  wrote:
>
> Hello,
>
>Anyone aware of and/or using a Backup software package to
> protect their LFS environment (not referring to the tools/scripts suggested
> in Chapter 17).
>
>
>
> Regards,
>
> Mike
>
>
>
> *Mike Selway* *|** Sr. Tiered Storage Architect | Cray Inc.*
>
> Work +1-301-332-4116 <(301)%20332-4116> | msel...@cray.com
>
> 146 Castlemaine Ct,   Castle Rock,  CO  80104 | www.cray.com
>
>
>
> [image: cid:image001.png@01CF8974.1C1FA000] 
>
>
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Backup software for Lustre

2017-02-07 Thread Mike Selway
Hello Brett,
   Actually, looking for someone who uses a commercialized approach 
(that retains user metadata and Lustre extended metadata) and not specifically 
the manual approaches of Chapter 17.

Thanks!
Mike

Mike Selway | Sr. Tiered Storage Architect | Cray Inc.
Work +1-301-332-4116 | msel...@cray.com
146 Castlemaine Ct,   Castle Rock,  CO  80104 | 
www.cray.com

[cid:image001.png@01CF8974.1C1FA000]
[cid:image004.jpg@01D28138.9C7F5D30]

From: Brett Lee [mailto:brettlee.lus...@gmail.com]
Sent: Monday, February 06, 2017 11:45 AM
To: Mike Selway 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Backup software for Lustre

Hey Mike,

"Chapter 17" and

http://www.intel.com/content/www/us/en/lustre/backup-and-restore-training.html

both contain methods to backup & restore the entire Lustre file system.

Are you looking for a solution that backs up only the (user) data files and 
their associated metadata (e.g. xattrs)?

Brett
--
Protect Yourself From Cybercrime
PDS Software Solutions LLC
https://www.TrustPDS.com

On Mon, Feb 6, 2017 at 11:12 AM, Mike Selway 
> wrote:
Hello,
   Anyone aware of and/or using a Backup software package to 
protect their LFS environment (not referring to the tools/scripts suggested in 
Chapter 17).

Regards,
Mike

Mike Selway | Sr. Tiered Storage Architect | Cray Inc.
Work +1-301-332-4116 | 
msel...@cray.com
146 Castlemaine Ct,   Castle Rock,  CO  80104 | 
www.cray.com

[cid:image001.png@01CF8974.1C1FA000]
[cid:image006.jpg@01D28138.9C7F5D30]


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] What do you use to monitor Lustre

2017-02-07 Thread E.S. Rosenberg
As a continuation to my recent question on traffic compression/caching I
was wondering what others use to monitor their Lustre performance

Currently I have collectl running on all clients, the data gets shipped by
filebeat to an ELK+Grafana stack.
Hoping to soon also deploy collectl on the OSS/MDS/MGS so that I can see
the other side of the traffic at the same time.
So far collectl is the only tool I found that does metrics for both
Infiniband and Lustre.

Thanks!
Eli
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-07 Thread Oucharek, Doug S
Because the stat command is “lst stat servers”, the statistics you are seeing 
are from the perspective of the server.  The “from” and “to” parameters can get 
quite confusing for the read case.  When reading, you are transferring the bulk 
data from the “to” group to the “from” group (yes, seems the opposite of what 
you would expect).  I think the “from” and “to” labels were designed to make 
sense in the write case and the logic was just flipped for the read case.

So, the stats you show indicated that are you writing an average of 3.6GiB/s 
(note: the lnet-selftest stats are mislabel and should be MiB/s rather than 
MB/s…I have fixed this in the latest release.  You are then getting 3.8GB/s).  
The reason you see traffic in the read direction is due to responses/acks.  
That is why there are a lot of small messages going back to the server (high 
RPC rate, small bandwidth).

So, your test looks like it is working to me.

Doug

> On Feb 7, 2017, at 2:13 AM, Jon Tegner  wrote:
> 
> Probably doing something wrong here, but I tried to test only READING with 
> the following:
> 
> #!/bin/bash
> export LST_SESSION=$$
> lst new_session read
> lst add_group servers 10.0.12.12@o2ib
> lst add_group readers 10.0.12.11@o2ib
> lst add_batch bulk_read
> lst add_test --batch bulk_read --concurrency 12 --from readers --to servers \
> brw read check=simple size=1M
> lst run bulk_read
> lst stat servers & sleep 10; kill $!
> lst end_session
> 
> which in my case gives:
> 
> [LNet Rates of servers]
> [R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
> [W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
> [LNet Bandwidth of servers]
> [R] Avg: 2.29 MB/s  Min: 2.29 MB/s  Max: 2.29 MB/s
> [W] Avg: 3608.44  MB/s  Min: 3608.44  MB/s  Max: 3608.44  MB/s
> 
> it seems strange that it should report non zero numbers in the [W] positions? 
> Specially that bandwidth is low in the [R] position (since I explicitly 
> demanded "read")? Also note that if I change "brw read" to "brw write" in the 
> script above the results are "reversed" in the sense that it reports the 
> higher number regarding bandwidth in the [R] position. That is "brw read" 
> reports (almost) the expected bandwidth in the [W]-position, whereas "brw 
> write" reports it in the [R]-position.
> 
> This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.
> 
> Thanks,
> /jon
> 
> 
> On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:
>> Try running just a read test and then just a write test rather than having 
>> both at the same time and see if the performance goes up.
> 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Traffic compression?

2017-02-07 Thread E.S. Rosenberg
Hi Ben,

On Mon, Feb 6, 2017 at 10:51 PM, Ben Evans  wrote:

> My initial question is what are you measuring and where are you measuring
> it?
>
The tool I'm using is collectl, it in turn is calling perfquery once a
minute and at the end reports back the difference between the previous and
current reading divided by 256*secondInterval to provide a number of kB/s.
(perfquery reports counters /4 legacy left over from 32b counter days)

The lustre stats seem to be gathered more or less the same way, the lustre
plugin does a delta of written/read bytes, divides by 1024 * secondInterval
to get kB/s.

>
> There are many different layers of caching happening, possibly all at the
> same time.  If you're benchmarking it's much better to figure out your max
> sustained read/write speeds than rely on peaks.
>
I'm not benchmarking, was mainly trying to understand how/why my Infiniband
graphs weren't showing at least the same amount of traffic as Lustre...

Most of the time though the graphs do more or less coincide so I guess
maybe there was either a measurement glitch or we do see some limited
effects of caching.

Thanks,
Eli

-Ben

From: lustre-discuss  on behalf of
"E.S. Rosenberg" 
Date: Monday, February 6, 2017 at 3:25 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Traffic compression?

We started closer monitoring of resources on our cluster and I noticed that
there is sometimes a big discrepancy between the read traffic reported by
Lustre and the incoming traffic reported by infiniband (which is the
interace carrying the Lustre traffic).

Currently I have a 4.4GB peak on Lustre while Infiniband at the same time
is showing just 1.4GB/s traffic (also there is a 2 minute difference
between the 2 peaks)
This is the summation of all the nodes (without the servers) in the cluster.
The stats are gathered using collectl at a 1 minute interval.

Thanks,
Eli

(There are also lots of stats that match 1:1 which makes me less sure what
to make of this)

>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-07 Thread Jon Tegner
Probably doing something wrong here, but I tried to test only READING 
with the following:


#!/bin/bash
export LST_SESSION=$$
lst new_session read
lst add_group servers 10.0.12.12@o2ib
lst add_group readers 10.0.12.11@o2ib
lst add_batch bulk_read
lst add_test --batch bulk_read --concurrency 12 --from readers --to 
servers \

brw read check=simple size=1M
lst run bulk_read
lst stat servers & sleep 10; kill $!
lst end_session

which in my case gives:

[LNet Rates of servers]
[R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
[W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 2.29 MB/s  Min: 2.29 MB/s  Max: 2.29 MB/s
[W] Avg: 3608.44  MB/s  Min: 3608.44  MB/s  Max: 3608.44  MB/s

it seems strange that it should report non zero numbers in the [W] 
positions? Specially that bandwidth is low in the [R] position (since I 
explicitly demanded "read")? Also note that if I change "brw read" to 
"brw write" in the script above the results are "reversed" in the sense 
that it reports the higher number regarding bandwidth in the [R] 
position. That is "brw read" reports (almost) the expected bandwidth in 
the [W]-position, whereas "brw write" reports it in the [R]-position.


This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.

Thanks,
/jon


On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:

Try running just a read test and then just a write test rather than having both 
at the same time and see if the performance goes up.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org