Re: [lustre-discuss] High MDS load

2020-05-28 Thread Carlson, Timothy S
Since some mailers don't like attachments, I'll just paste in the script we use 
here.  

I call the script with

./parse.sh | sort -k3 -n

You just need to change out the name of your MDT in two places.

#!/bin/bash
set -e
SLEEP=10
stats_clear()
{
cd $1
echo clear >clear
}

stats_print()
{
cd $1
echo "= $1 "
for i in *; do 
[ -d $i ] || continue
out=`cat ${i}/stats | grep -v "snapshot_time" | grep -v "ping" 
|| true`
[ -n "$out" ] || continue
echo $i $out
done
echo 
"="
echo
}

for i in /proc/fs/lustre/mdt/lzfs-MDT /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue
stats_clear "$dir"
done
echo "Waiting ${SLEEP}s after clearing stats"
sleep $SLEEP

for i in /proc/fs/lustre/mdt/lzfs-MDT/ /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue
stats_print "$dir"
done




On 5/28/20, 9:28 AM, "lustre-discuss on behalf of Bernd Melchers" 
 wrote:

>I have 2 MDSs and periodically on one of them (either at one time or
>another) peak above 300, causing the file system to basically stop.
>This lasts for a few minutes and then goes away.  We can't identify any
>one user running jobs at the times we see this, so it's hard to
>pinpoint this on a user doing something to cause it.   Could anyone
>point me in the direction of how to begin debugging this?  Any help is
>greatly appreciated.

I am not able to solve this problem, but...
We saw this behaviour (lustre 2.12.3 and 2.12.4) parallel with lustre 
kernel thread
(if i remember: ll_ost_io threads at the ods, but with other messages at
the mds) BUG messages in the
kernel log (dmesg output). At this time the omnipath interface were not
longer pingable. We were not able to say what crashes first, the
omnipath or the lustre parts in the kernel. Perhaps you can have a look
if your mds are pingable from your clients (using the network interface
of your lustre installation). Otherwise it is expected that you get a
high load because your lustre io threads cannot satisfy requests.

Mit freundlichen Grüßen
Bernd Melchers

-- 
Archiv- und Backup-Service | fab-serv...@zedat.fu-berlin.de
Freie Universität Berlin   | Tel. +49-30-838-55905
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

https://protect2.fireeye.com/v1/url?k=2b5b7e8e-77ee4041-2b5b549b-0cc47adc5e60-f39b4d99025e7043&q=1&e=02c1fc69-2754-4f01-8478-8cef00277511&u=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: Lustre Timeouts/Filesystem Hanging

2019-10-29 Thread Carlson, Timothy S
We use a simple script to find clients that are hitting the OST.   I borrowed 
this from somewhere a decade ago and still use it.

A simple bash script your run on your OSS to see all the clients hitting the 
various OSTs.

#!/bin/bash
set -e
SLEEP=10
stats_clear()
{
cd $1
echo clear >clear
}

stats_print()
{
cd $1
echo "= $1 "
for i in *; do
[ -d $i ] || continue
out=`cat ${i}/stats | grep -v "snapshot_time" | grep -v "ping" 
|| true`
[ -n "$out" ] || continue
echo $i $out
done
echo 
"="
echo
}

for i in /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue
stats_clear "$dir"
done

echo "Waiting ${SLEEP}s after clearing stats"
sleep $SLEEP

for i in  /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue

stats_print "$dir"
done

From: Moreno Diego (ID SIS) 
Sent: Tuesday, October 29, 2019 10:08 AM
To: Louis Allen ; Oral, H. ; Carlson, 
Timothy S ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Timeouts/Filesystem Hanging

Hi Louis,

If you don’t have any particular monitoring on the servers (Prometheus, 
ganglia, etc..) you could also use sar (sysstat) or a similar tool to confirm 
the CPU waits for IO. Also the device saturation on sar or with iostat. For 
instance:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.190.006.090.100.06   93.55

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 1.200.200.60 0.00 0.0120.00 
0.000.751.000.67   0.75   0.06
sdb   0.00   136.802.80   96.60 0.81 9.21   206.42 
0.191.91   26.291.20   0.55   5.46
sdc   0.00   144.20   58.80  128.00 2.3416.82   210.08 
0.241.312.680.68   0.66  12.40

Then if you enable lustre job stats you can check on that specific device which 
job is doing most IO. Last but not least you could also parse which specific 
NID is doing the intensive IO on that OST 
(/proc/fs/lustre/obdfilter/-OST0007/exports/*/stats).

Regards,

Diego


From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Louis Allen mailto:louisal...@live.co.uk>>
Date: Tuesday, 29 October 2019 at 17:43
To: "Oral, H." mailto:ora...@ornl.gov>>, "Carlson, Timothy S" 
mailto:timothy.carl...@pnnl.gov>>, 
"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Timeouts/Filesystem Hanging

Thanks, will take a look.

Any other areas i should be looking? Should i be applying any Lustre tuning?

Thanks

Get Outlook for 
Android<https://protect2.fireeye.com/v1/url?k=e0bd2a7b-bc0814b4-e0bd006e-0cc47adc5e60-e667844ab7fbc271&q=1&e=037ad19c-795c-446d-b849-ee7ed4756823&u=https%3A%2F%2Faka.ms%2Fghei36>

From: Oral, H. mailto:ora...@ornl.gov>>
Sent: Monday, October 28, 2019 7:06:41 PM
To: Louis Allen mailto:louisal...@live.co.uk>>; Carlson, 
Timothy S mailto:timothy.carl...@pnnl.gov>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [EXTERNAL] Re: [lustre-discuss] Lustre Timeouts/Filesystem Hanging

For inspecting client side I/O, you can use Darshan.

Thanks,

Sarp

--
Sarp Oral, PhD

National Center for Computational Sciences
Oak Ridge National Laboratory
ora...@ornl.gov<mailto:ora...@ornl.gov>
865-574-2173


On 10/28/19, 1:58 PM, "lustre-discuss on behalf of Louis Allen" 
mailto:lustre-discuss-boun...@lists.lustre.org%20on%20behalf%20of%20louisal...@live.co.uk>>
 wrote:


Thanks for the reply, Tim.


Are there any tools I can use to see if that is the cause?


Could any tuning possibly help the situation?


Thanks






From: Carlson, Timothy S 
mailto:timothy.carl...@pnnl.gov>>
Sent: Monday, 28 October 2019, 17:24
To: Louis Allen; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: RE: Lustre Timeouts/Filesystem Hanging


In my experience, this is almost always related to some code doing really 
bad I/O. Let’s say you have a 1000 rank MPI code doing open/read 4k/close on a 
few specific files on that OST.  That will make for a  ba

Re: [lustre-discuss] Lustre Timeouts/Filesystem Hanging

2019-10-28 Thread Carlson, Timothy S
In my experience, this is almost always related to some code doing really bad 
I/O. Let's say you have a 1000 rank MPI code doing open/read 4k/close on a few 
specific files on that OST.  That will make for a  bad day.

The other place you can see this, and this isn't your case, is when ZFS refuses 
to give up on a disk that is failing and your overall I/O suffers from ZFS 
continuing to try to read from a disk that it should just kick out

Tim


From: lustre-discuss  On Behalf Of 
Louis Allen
Sent: Monday, October 28, 2019 10:16 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre Timeouts/Filesystem Hanging

Hello,

Lustre (2.12) seem to be hanging quite frequently (5+ times a day) for us and 
one of the OSS servers (out of 4) is reporting an extremely high load average 
(150+) but the CPU usage of that server is actually very low - so it must be 
related to something else - possibly CPU_IO_WAIT.

The OSS server we are seeing the high load averages we can also see multiple 
LustreError messages in /var/log/messages:

Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Service thread pid 2403 was 
inactive for 200.08s. The thread might be hung, or it might only be slow and 
will resume later. Dumping the stack trace for debugging purposes:
Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Skipped 4 previous similar 
messages
Oct 28 11:22:23 pazlustreoss001 kernel: Pid: 2403, comm: ll_ost00_068 
3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019
Oct 28 11:22:23 pazlustreoss001 kernel: Call Trace:
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_complete_transaction+0x52/0xa0 [jbd2]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ldiskfs_sync_file+0x2e2/0x320 [ldiskfs]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
vfs_fsync_range+0x20/0x30
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
osd_object_sync+0xb1/0x160 [osd_ldiskfs]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_sync+0xb7/0x270 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ofd_sync_hdl+0x111/0x530 [ofd]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_request_handle+0xaea/0x1580 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] kthread+0xd1/0xe0
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ret_from_fork_nospec_end+0x0/0x39
Oct 28 11:22:23 pazlustreoss001 kernel: [] 0x
Oct 28 11:22:23 pazlustreoss001 kernel: LustreError: dumping log to 
/tmp/lustre-log.1572261743.2403
Oct 28 11:22:23 pazlustreoss001 kernel: Pid: 2292, comm: ll_ost03_043 
3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019
Oct 28 11:22:23 pazlustreoss001 kernel: Call Trace:
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_complete_transaction+0x52/0xa0 [jbd2]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ldiskfs_sync_file+0x2e2/0x320 [ldiskfs]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
vfs_fsync_range+0x20/0x30
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
osd_object_sync+0xb1/0x160 [osd_ldiskfs]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_sync+0xb7/0x270 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ofd_sync_hdl+0x111/0x530 [ofd]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_request_handle+0xaea/0x1580 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Service thread pid 2403 completed 
after 200.29s. This indicates the system was overloaded (too many service 
threads, or there were not enough hardware resources).
Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Skipped 48 previous similar 
messages
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Oct 28 11:22:23 pazlustreoss001 kernel: [] kthread+0xd1/0xe0
Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ret_from_fork_nospec_end+0x0/0x39
Oct 28 11:22:23 pazlustreoss001 kernel: [] 0x


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Is it a good practice to use big OST?

2019-10-08 Thread Carlson, Timothy S
I've been running 100->200TB OSTs making up small petabyte file systems for the 
last 4 or 5 years with no pain.  Lustre 2.5.x through current generation.

Plenty of ZFS rebuilds when I ran across a set of bad disks that went fine.

From: lustre-discuss  On Behalf Of 
w...@umich.edu
Sent: Tuesday, October 8, 2019 10:43 AM
To: lustre-discuss 
Subject: [lustre-discuss] Is it a good practice to use big OST?

Hi All
We recently purchased new storage hardware, and that gives us the options of 
creating big zpools for OSTs (>100TB per OST),
I am wondering if anyone has any experience of using big OSTs and if that would 
impact the performance of lustre in a good or bad way?


Any comments or suggestions are appreciated!

Cheers!

-Wenjing
w...@umich.edu
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-03-13 Thread Carlson, Timothy S
+1 on

options zfs zfs_prefetch_disable=1

Might not be as critical now, but that was a must-have on Lustre 2.5.x

Tim

From: lustre-discuss  On Behalf Of 
Riccardo Veraldi
Sent: Wednesday, March 13, 2019 3:00 PM
To: Kurt Strosahl ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS tuning for MDT/MGS

these are the zfs settings I use on my MDSes

 zfs set mountpoint=none mdt0
 zfs set sync=disabled mdt0
 zfs set atime=off amdt0
 zfs set redundant_metadata=most mdt0
 zfs set xattr=sa mdt0

if youor MDT partition is on a 4KB sector disk then you can use ashift=12 when 
you create the filesystem but zfs is pretty smart and in my case it recognized 
it automatically and used ashift=12 automatically.

also here are the zfs kernel modules parameters i use to ahve better 
performance. I use it on both MDS and OSSes

options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

some people may disagree with me anyway after years of trying different options 
I reached this stable configuration.

then there are a bunch of other important Lustre level optimizations that you 
can do if you are looking for performance increase.

Cheers

Rick

On 3/13/19 11:44 AM, Kurt Strosahl wrote:

Good Afternoon,



I'm reviewing the zfs parameters for a new metadata system and I was 
looking to see if anyone had examples (good or bad) of zfs parameters?  I'm 
assuming that the MDT won't benefit from a recordsize of 1MB, and I've already 
set the ashift to 12.  I'm using an MDT/MGS made up of a stripe across mirrored 
ssds.



w/r,

Kurt



___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Rebooting storage nodes while jobs are running?

2019-02-27 Thread Carlson, Timothy S
I will say YMMV.  I've rebooted storage nodes and have had mixed results where 
we land into one of three bucktes

1) Codes breeze through and have just been stuck in D state while OSS's reboot
2) RPCs get stuck somewhere and when the OSS comes back I eventually have to 
force an abort_recovery
3) A code dies by not handling the timeout (not sure if this is due to the code 
itself or the client improperly handling the timeout)

On our current setup with around 1000 clients, 50ish OSS, and 2.5.x vintage 
lustre servers I would say option 1 is by far the largest percentage (>95). 2 
and 3 happen from time to time with likelihood greater than 0. 

It's always a best practice to take a scheduled outage for a kernel/version 
upgrade. You never know what oddity your particular setup might encounter.

Tim

-Original Message-
From: lustre-discuss  On Behalf Of 
Paul Edmon
Sent: Wednesday, February 27, 2019 7:54 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Rebooting storage nodes while jobs are running?

 From experience rebooting the storage nodes is fine, the processes accessing 
them will just hang until restored.  I've done this many times on our cluster 
with no ill effect.

That said I have not tried it with kernel upgrades or lustre release changes.  
That may do something different and unexpected. Some one else on the list may 
have insight on these.

-Paul Edmon-

On 2/27/19 10:17 AM, Bernd Melchers wrote:
> Hi all,
> our environment: CentOS-7.6, lustre-2.12.0@zfs-0.7.12, 2 mds, 7 ods, 180 
> clients.
>
> Is it possible to reboot the mds and ods server (e.g. for new kernel 
> or new lustre releases) without affecting running jobs on the client nodes?
> The reboot can take up to 15 minutes. Did the clients still wait for 
> the storage nodes to reappear or will i/o operations get errors?
> Is the behaviour of a client influenced by the timeout parameter ( 
> "lctl get_param timeout") or by other parameters?
>
> Mit freundlichen Grüßen
> Bernd Melchers
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre for home directories

2018-04-25 Thread Carlson, Timothy S
I would work on fixing your NFS server before moving to Lustre.   That being 
said, I have no idea of how big an installation you have. How many nodes you 
have for NFS clients, how much data you are talking about moving around, etc.

As others will point out, even with improvements in Lustre metadata operations 
recently, you are likely to be metadata bound by Lustre well before you are 
bound up with NFS operations. 


Tim

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Dilger, Andreas
Sent: Wednesday, April 25, 2018 11:31 AM
To: Riccardo Veraldi 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lustre for home directories

On Apr 25, 2018, at 11:09, Riccardo Veraldi  
wrote:
> 
> Hello,
> just wondering if who is using lustre for home directories with 
> several users is happy or not.

I can't comment for other people, but there are definitely some sites that are 
using Lustre for the /home directories.  Hopefully they will speak up here.

> I am considering to move home directories from NFS to Lustre/ZFS.
> it is quite easy to send the NFS server in troubles with just a few 
> users copying files around.
> What special tuning is needed to optimize Lustre usage with small files?
> I guess 1M record size wold not be a good choice anymore.

You should almost certainly use a default of stripe_count=1 for home 
directories, on the assumption that files should not be gigantic.

In that case, stripe size does not matter if you have 1-stripe files.  This 
does not affect the on-disk allocation size.  If you have dedicated OSTs for 
the /home directories, then I'd recommend NOT to use recordsize=1M for ZFS, and 
instead leave it at the default (recordsize=128k).

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre as /home directory

2018-02-16 Thread Carlson, Timothy S
I'll just add  +1 to this thread. /home on NFS for software builds, small 
files, lots of metadata operations.  Lustre for the rest.  Users will do the 
wrong thing even after education. 

Tim

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Steve Barnet
Sent: Friday, February 16, 2018 7:21 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre as /home directory

On 2/16/18 8:53 AM, Michael Watters wrote:
> Can't be much worse than NFS.


Oh yes, it can! It will really depend upon how your users work with it. If you 
regularly have many people building their software chains, that can drag your 
interactive response to its knees very quickly.
If your users also like a to drop many thousands of small files into just a few 
directories, you will definitely notice that as well.

We have found that NFS (while it certainly will also
suffer) generally holds up better than Lustre in those use cases. No 
particularly earth shattering news there.

We have generally kept /home on NFS, and have fairly restrictive quotas there. 
The idea is that /home is used for software builds and final results, but that 
the heavy cluster processing workloads are handled by lustre.

This has worked out OK. We still run into plenty of cases where people do the 
wrong thing, but we can generally redirect them to the right place.

Best,

---Steve



> 
> 
> On 02/15/2018 10:30 AM, Mark Hahn wrote:
>>> My question is, Is it advisable to have /home in Lustre since users 
>>> data will be of small files (less than 5MB)?
>>
>> certainly it works, but is not very pleasant for metadata-intensive 
>> activity, such as compiling.
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Designing a new Lustre system

2017-12-21 Thread Carlson, Timothy S
Isilon is truly an enterprise solution. We have one (about a dozen bricks 
worth) and use it for home directories on our super computers and it allows 
easy access via CIFS to users on Windows/Mac.

It is highly configurable with “smart pools” and policies to move data around 
based on age/size/access time/etc.How you “stripe” the data depends on the 
policy you set.

It is a BSD with Isilon magic sauce. It uses its own Infiniband networking on 
the backend to move data around and pipe it out various front ends that you 
have configured. But as has been pointed out, it doesn’t really do parallel I/O.

For our users with millions (or hundreds of millions of files), it works as a 
solution.

Tim

From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of John Bent
Sent: Thursday, December 21, 2017 4:44 PM
To: Glenn Lockwood 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Designing a new Lustre system

Last I looked at Isilon, it serializes parallel writes to a single file.  
Ultimately, the data is striped across multiple data servers but it all 
channels through a single data server.  If you only have file-per-process 
workloads, and you have a lot of money, then Isilon is considered a solid 
enterprise solution.

On Thu, Dec 21, 2017 at 7:15 PM, Glenn Lockwood 
mailto:gl...@lbl.gov>> wrote:

On Wed, Dec 20, 2017 at 8:21 AM, E.S. Rosenberg 
mailto:esr+lus...@mail.hebrew.edu>> wrote:

4. One of my colleagues likes Isilon very much, I have not been able to find 
any literature on if/how Lustre compares any pointers/knowledge on the subject 
is very welcome.


I haven't looked at Isilon in a while, but my recollection was that

1. It's phenomenally expensive, especially at smaller scales.  This is the most 
obvious detractor vs. Lustre, especially at low node counts.

2. It's completely proprietary and architecturally complex, so management and 
support model is difficult to shape into existing operations.  There are also 
cost implications here.

3. It uses NFS for transport, so it doesn't offer POSIX consistency between 
clients.  This makes shared-file parallel I/O extremely hazardous.

In my experience, Isilon is popular in markets flush with money but scarce in 
institutional storage expertise.  In such cases, #1 and #2 are non-issues, and 
#3 often doesn't apply because such industries' workloads are 
throughput-oriented and rarely use MPI.

Glenn

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Does lustre 2.10 client support 2.5 server ?

2017-11-07 Thread Carlson, Timothy S
FWIW, we have successfully been running 2.9 clients (RHEL 7.3) with 2.5.3 
servers (RHEL 6.6) at a small scale. About 40 OSSes and dozens of 2.9 clients 
with hundreds of 2.5.3 clients mixed in.

Tim

From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of E.S. Rosenberg
Sent: Tuesday, November 07, 2017 7:30 AM
To: Biju C P 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Does lustre 2.10 client support 2.5 server ?

Hi Biju,
The 2.10 client is multi-rail aware while the 2.5 server is not there have been 
multiple reports on the list and several open bugs that this combination 
doesn't work.
A 2.9 client may work but in general it is my understanding that compatibility 
is only check 1 version back (so 2.9-2.8 etc.).
HTH,
Eli

On Tue, Nov 7, 2017 at 10:55 AM, Biju C P 
mailto:cpb...@gmail.com>> wrote:
Hi,

My Lustre server is running the version 2.5 and I want to use 2.10 client. Is 
this combination supported ? Is there anything that I need to be aware of ?

--
Biju C P

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problems accessing files as non-root user.

2016-12-12 Thread Carlson, Timothy S
Does your new MDS server have all the UIDs of these people in /etc/passwd?

Tim

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Phill Harvey-Smith
Sent: Monday, December 12, 2016 9:16 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] problems accessing files as non-root user.

Hi All,

I'm in the final step of upgrading our storage servers to lustre 2.8.

The MDS/OSS are running on Centos 7.2 the clients are Ubuntu 12.04, though I 
also have a virtual machine running on Centos 7.2 as a client, both seem to 
exhibit the same problem.

Our old environment was a 2.4 client and a 2.1 server.

I have used rsync across ssh to sync the data from the old environment to the 
new, this appeared to work correctly.

However trying to access the data on the new clients (ether the Ubuntu or 
Centos), as a non-root user results in strange errors and an inability to 
access the files. Accessing them as root seems to work without error.

for example one of my filesystems is mounted as /home :

192.168.0.6@tcp0:/storage   /storagelustre 
defaults,_netdev,flock,noauto 0 0
192.168.0.6@tcp0:/home  /home   lustre 
defaults,_netdev,flock,noauto 0 0
192.168.0.6@tcp0:/scratch   /scratchlustre 
defaults,_netdev,flock,noauto 0 0

for example as user stsxab doning an ls -la of /home results in many entries 
such as :

stsxab@test-r420:/home$ ls -la
ls: .: Bad address
ls: cannot access margaw: Permission denied
ls: home.old: Bad address
ls: strjab: Bad address
ls: margbe: Bad address
ls: strmah: Bad address
ls: strkar: Bad address

etc before a list of some of the directories.

The following is also logged in /var/log/syslog :
Dec 12 17:09:30 test-r420 kernel: [ 1622.96] Lustre: Unmounted home-client 
Dec 12 17:09:40 test-r420 kernel: [ 1633.122522] Lustre: Unmounted home-client 
Dec 12 17:10:04 test-r420 kernel: [ 1656.968193] Lustre: Mounted home-client 
Dec 12 17:10:04 test-r420 kernel: [ 1656.968199] Lustre: Skipped 3 previous 
similar messages Dec 12 17:11:07 test-r420 kernel: [ 1720.170159] LustreError: 
11-0: 
home-MDT-mdc-881002e2b000: operation ldlm_enqueue to node 
192.168.0.6@tcp failed: rc = -14

But doing the ls as root works fine.

Any idea what the problem is with this as I need to get it resolved or roll 
back to the old client / servers and pospone until it is working correctly.

Cheers.

Phill.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Odd problem with new OSTs not being used

2016-09-01 Thread Carlson, Timothy S
Following up on my own email.

Looks like I triggered this bug

https://jira.hpdd.intel.com/browse/LU-5778

While all of the OSTs are listed as "UP", the reality is that 4 had been made 
INACTIVE for various reasons. Once I reactivated those OSTs, the empty OSTs 
began to take data.   Looks like I will be upgrading to 2.5.4 soon as I really 
need to be able to deactivate OSTs and have the algorithm on the MDS still be 
able to choose new OSTs to write to.

Tim

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Carlson, Timothy S
Sent: Thursday, September 1, 2016 2:00 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Odd problem with new OSTs not being used

Running Lustre 2.5.3(ish) backed with ZFS. 

We’ve added a few OSTs and they show as being “UP” but aren’t taking any data

[root@lzfs01a ~]# lctl dl
  0 UP osd-zfs MGS-osd MGS-osd_UUID 5
  1 UP mgs MGS MGS 1085
  2 UP mgc MGC172.17.210.11@o2ib9 77cf08da-86a4-7824-1878-84b540993c6d 5
  3 UP osd-zfs lzfs-MDT-osd lzfs-MDT-osd_UUID 42
  4 UP mds MDS MDS_uuid 3
  5 UP lod lzfs-MDT-mdtlov lzfs-MDT-mdtlov_UUID 4
  6 UP mdt lzfs-MDT lzfs-MDT_UUID 1087
  7 UP mdd lzfs-MDD lzfs-MDD_UUID 4
  8 UP qmt lzfs-QMT lzfs-QMT_UUID 4
  9 UP osp lzfs-OST0008-osc-MDT lzfs-MDT-mdtlov_UUID 5
 10 UP osp lzfs-OST0003-osc-MDT lzfs-MDT-mdtlov_UUID 5
 11 UP osp lzfs-OST0006-osc-MDT lzfs-MDT-mdtlov_UUID 5
 12 UP osp lzfs-OST0007-osc-MDT lzfs-MDT-mdtlov_UUID 5
 13 UP osp lzfs-OST0004-osc-MDT lzfs-MDT-mdtlov_UUID 5
 14 UP osp lzfs-OST000a-osc-MDT lzfs-MDT-mdtlov_UUID 5
 15 UP osp lzfs-OST-osc-MDT lzfs-MDT-mdtlov_UUID 5
 16 UP osp lzfs-OST0002-osc-MDT lzfs-MDT-mdtlov_UUID 5
 17 UP osp lzfs-OST0001-osc-MDT lzfs-MDT-mdtlov_UUID 5
 18 UP osp lzfs-OST0005-osc-MDT lzfs-MDT-mdtlov_UUID 5
 19 UP osp lzfs-OST0009-osc-MDT lzfs-MDT-mdtlov_UUID 5
 20 UP osp lzfs-OST000b-osc-MDT lzfs-MDT-mdtlov_UUID 5
 21 UP osp lzfs-OST000c-osc-MDT lzfs-MDT-mdtlov_UUID 5
 22 UP osp lzfs-OST000d-osc-MDT lzfs-MDT-mdtlov_UUID 5
 23 UP osp lzfs-OST0010-osc-MDT lzfs-MDT-mdtlov_UUID 5
 24 UP osp lzfs-OST000f-osc-MDT lzfs-MDT-mdtlov_UUID 5
 25 UP osp lzfs-OST000e-osc-MDT lzfs-MDT-mdtlov_UUID 5
 26 UP osp lzfs-OST0011-osc-MDT lzfs-MDT-mdtlov_UUID 5
 27 UP osp lzfs-OST0015-osc-MDT lzfs-MDT-mdtlov_UUID 5
 28 UP osp lzfs-OST0016-osc-MDT lzfs-MDT-mdtlov_UUID 5
 29 UP osp lzfs-OST0017-osc-MDT lzfs-MDT-mdtlov_UUID 5
 30 UP osp lzfs-OST0018-osc-MDT lzfs-MDT-mdtlov_UUID 5
 31 UP osp lzfs-OST0019-osc-MDT lzfs-MDT-mdtlov_UUID 5
 32 UP osp lzfs-OST001b-osc-MDT lzfs-MDT-mdtlov_UUID 5
 33 UP osp lzfs-OST0013-osc-MDT lzfs-MDT-mdtlov_UUID 5
 34 UP osp lzfs-OST0014-osc-MDT lzfs-MDT-mdtlov_UUID 5
 35 UP lwp lzfs-MDT-lwp-MDT lzfs-MDT-lwp-MDT_UUID 5
 36 UP osp lzfs-OST001c-osc-MDT lzfs-MDT-mdtlov_UUID 5
 37 UP osp lzfs-OST0012-osc-MDT lzfs-MDT-mdtlov_UUID 5
 38 UP osp lzfs-OST001a-osc-MDT lzfs-MDT-mdtlov_UUID 5
 39 UP osp lzfs-OST001d-osc-MDT lzfs-MDT-mdtlov_UUID 5
 40 UP osp lzfs-OST001e-osc-MDT lzfs-MDT-mdtlov_UUID 5
 41 UP osp lzfs-OST001f-osc-MDT lzfs-MDT-mdtlov_UUID 5
 42 UP osp lzfs-OST0020-osc-MDT lzfs-MDT-mdtlov_UUID 5
 43 UP osp lzfs-OST0021-osc-MDT lzfs-MDT-mdtlov_UUID 5
 44 UP osp lzfs-OST0022-osc-MDT lzfs-MDT-mdtlov_UUID 5

Now if you look at devices 36 and higher, you’ll see that they don’t have much 
data even though they have been online for a few weeks and this is a fairly 
active file system.   The data that is in there is data that I have “forced” 
onto the OSTs for testing by setting the stripe to that specific OST.  

# lfs df
UUID   1K-blocksUsed   Available Use% Mounted on
lzfs-MDT_UUID60762585216   262897152 60499686016   0% /pic[MDT:0]
lzfs-OST_UUID90996712832 82190600320  8805795072  90% /pic[OST:0]
lzfs-OST0001_UUID90996823936 82773737088  8221323776  91% /pic[OST:1]
lzfs-OST0002_UUID90996723840 82547420928  844820  91% /pic[OST:2]
lzfs-OST0003_UUID90996780416 82570822400  8425071872  91% /pic[OST:3]
lzfs-OST0004_UUID90996792320 83526260096  7466092288  92% /pic[OST:4]
lzfs-OST0005_UUID90996764544 83071284864  7922972800  91% /pic[OST:5]
lzfs-OST0006_UUID90996729600 83348930304  7643451520  92% /pic[OST:6]
lzfs-OST0007_UUID9099680 82677238272  8314902016  91% /pic[OST:7]
lzfs-OST0008_UUID90996910208 83598099584  7396038656  92% /pic[OST:8]
lzfs-OST0009_UUID90997091328 85659415424  5335623552  94% /pic[OST:9]
lzfs-OST000a_UUID90996807680 83581871872  7410268800  92% /pic[OST:10]
lzfs-OST000b_UUID90996676352 77512128000 13484523136  85% /pic[OST:11]
lzfs-OST000c_UUID90996505984 86176576256  4819325824

[lustre-discuss] Odd problem with new OSTs not being used

2016-09-01 Thread Carlson, Timothy S
Running Lustre 2.5.3(ish) backed with ZFS. 

We’ve added a few OSTs and they show as being “UP” but aren’t taking any data

[root@lzfs01a ~]# lctl dl
  0 UP osd-zfs MGS-osd MGS-osd_UUID 5
  1 UP mgs MGS MGS 1085
  2 UP mgc MGC172.17.210.11@o2ib9 77cf08da-86a4-7824-1878-84b540993c6d 5
  3 UP osd-zfs lzfs-MDT-osd lzfs-MDT-osd_UUID 42
  4 UP mds MDS MDS_uuid 3
  5 UP lod lzfs-MDT-mdtlov lzfs-MDT-mdtlov_UUID 4
  6 UP mdt lzfs-MDT lzfs-MDT_UUID 1087
  7 UP mdd lzfs-MDD lzfs-MDD_UUID 4
  8 UP qmt lzfs-QMT lzfs-QMT_UUID 4
  9 UP osp lzfs-OST0008-osc-MDT lzfs-MDT-mdtlov_UUID 5
 10 UP osp lzfs-OST0003-osc-MDT lzfs-MDT-mdtlov_UUID 5
 11 UP osp lzfs-OST0006-osc-MDT lzfs-MDT-mdtlov_UUID 5
 12 UP osp lzfs-OST0007-osc-MDT lzfs-MDT-mdtlov_UUID 5
 13 UP osp lzfs-OST0004-osc-MDT lzfs-MDT-mdtlov_UUID 5
 14 UP osp lzfs-OST000a-osc-MDT lzfs-MDT-mdtlov_UUID 5
 15 UP osp lzfs-OST-osc-MDT lzfs-MDT-mdtlov_UUID 5
 16 UP osp lzfs-OST0002-osc-MDT lzfs-MDT-mdtlov_UUID 5
 17 UP osp lzfs-OST0001-osc-MDT lzfs-MDT-mdtlov_UUID 5
 18 UP osp lzfs-OST0005-osc-MDT lzfs-MDT-mdtlov_UUID 5
 19 UP osp lzfs-OST0009-osc-MDT lzfs-MDT-mdtlov_UUID 5
 20 UP osp lzfs-OST000b-osc-MDT lzfs-MDT-mdtlov_UUID 5
 21 UP osp lzfs-OST000c-osc-MDT lzfs-MDT-mdtlov_UUID 5
 22 UP osp lzfs-OST000d-osc-MDT lzfs-MDT-mdtlov_UUID 5
 23 UP osp lzfs-OST0010-osc-MDT lzfs-MDT-mdtlov_UUID 5
 24 UP osp lzfs-OST000f-osc-MDT lzfs-MDT-mdtlov_UUID 5
 25 UP osp lzfs-OST000e-osc-MDT lzfs-MDT-mdtlov_UUID 5
 26 UP osp lzfs-OST0011-osc-MDT lzfs-MDT-mdtlov_UUID 5
 27 UP osp lzfs-OST0015-osc-MDT lzfs-MDT-mdtlov_UUID 5
 28 UP osp lzfs-OST0016-osc-MDT lzfs-MDT-mdtlov_UUID 5
 29 UP osp lzfs-OST0017-osc-MDT lzfs-MDT-mdtlov_UUID 5
 30 UP osp lzfs-OST0018-osc-MDT lzfs-MDT-mdtlov_UUID 5
 31 UP osp lzfs-OST0019-osc-MDT lzfs-MDT-mdtlov_UUID 5
 32 UP osp lzfs-OST001b-osc-MDT lzfs-MDT-mdtlov_UUID 5
 33 UP osp lzfs-OST0013-osc-MDT lzfs-MDT-mdtlov_UUID 5
 34 UP osp lzfs-OST0014-osc-MDT lzfs-MDT-mdtlov_UUID 5
 35 UP lwp lzfs-MDT-lwp-MDT lzfs-MDT-lwp-MDT_UUID 5
 36 UP osp lzfs-OST001c-osc-MDT lzfs-MDT-mdtlov_UUID 5
 37 UP osp lzfs-OST0012-osc-MDT lzfs-MDT-mdtlov_UUID 5
 38 UP osp lzfs-OST001a-osc-MDT lzfs-MDT-mdtlov_UUID 5
 39 UP osp lzfs-OST001d-osc-MDT lzfs-MDT-mdtlov_UUID 5
 40 UP osp lzfs-OST001e-osc-MDT lzfs-MDT-mdtlov_UUID 5
 41 UP osp lzfs-OST001f-osc-MDT lzfs-MDT-mdtlov_UUID 5
 42 UP osp lzfs-OST0020-osc-MDT lzfs-MDT-mdtlov_UUID 5
 43 UP osp lzfs-OST0021-osc-MDT lzfs-MDT-mdtlov_UUID 5
 44 UP osp lzfs-OST0022-osc-MDT lzfs-MDT-mdtlov_UUID 5

Now if you look at devices 36 and higher, you’ll see that they don’t have much 
data even though they have been online for a few weeks and this is a fairly 
active file system.   The data that is in there is data that I have “forced” 
onto the OSTs for testing by setting the stripe to that specific OST.  

# lfs df
UUID   1K-blocksUsed   Available Use% Mounted on
lzfs-MDT_UUID60762585216   262897152 60499686016   0% /pic[MDT:0]
lzfs-OST_UUID90996712832 82190600320  8805795072  90% /pic[OST:0]
lzfs-OST0001_UUID90996823936 82773737088  8221323776  91% /pic[OST:1]
lzfs-OST0002_UUID90996723840 82547420928  844820  91% /pic[OST:2]
lzfs-OST0003_UUID90996780416 82570822400  8425071872  91% /pic[OST:3]
lzfs-OST0004_UUID90996792320 83526260096  7466092288  92% /pic[OST:4]
lzfs-OST0005_UUID90996764544 83071284864  7922972800  91% /pic[OST:5]
lzfs-OST0006_UUID90996729600 83348930304  7643451520  92% /pic[OST:6]
lzfs-OST0007_UUID9099680 82677238272  8314902016  91% /pic[OST:7]
lzfs-OST0008_UUID90996910208 83598099584  7396038656  92% /pic[OST:8]
lzfs-OST0009_UUID90997091328 85659415424  5335623552  94% /pic[OST:9]
lzfs-OST000a_UUID90996807680 83581871872  7410268800  92% /pic[OST:10]
lzfs-OST000b_UUID90996676352 77512128000 13484523136  85% /pic[OST:11]
lzfs-OST000c_UUID90996505984 86176576256  4819325824  95% /pic[OST:12]
lzfs-OST000d_UUID90997104256 90339916032   656510208  99% /pic[OST:13]
lzfs-OST000e_UUID90996660480 86856594560  4134641792  95% /pic[OST:14]
lzfs-OST000f_UUID90996441472 82859149568  8134773888  91% /pic[OST:15]
lzfs-OST0010_UUID90996592896 88961102592  2034770816  98% /pic[OST:16]
lzfs-OST0011_UUID90996459264 83005755520  7989576448  91% /pic[OST:17]
lzfs-OST0012_UUID9016800 1073280 90998828928   0% /pic[OST:18]
lzfs-OST0013_UUID90996418560 83272862336  7716835328  92% /pic[OST:19]
lzfs-OST0014_UUID90996442496 84503368320  6486773504  93% /pic[OST:20]
lzfs-OST0015_UUID90996476416 82157845376  8831992320  90% /pic[OST:21]
lzfs-OST0016_UUID90996456960 83149106688  7844745088  91

Re: [lustre-discuss] wildly inaccurate file size

2016-06-30 Thread Carlson, Timothy S
Is this a ZFS backed Lustre with compression? If so, then that is not at all 
surprising if that is a compressible file. I have a 1G file of zeros that shows 
up as 512 bytes

[root@pic-admin03 tim]# ls -sh 1G
512 1G
[root@pic-admin03 tim]# ls -l 1G
-rw-r--r-- 1 tim users 1073741824 Dec  2  2015 1G

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of John White
Sent: Thursday, June 30, 2016 3:15 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] wildly inaccurate file size

So I’ve got this file that’s really making a file copy tool quite upset.  Would 
the following point to metadata corruption? I’ve found several files similar to 
this.

Please note I killed the ‘dd’ after confirming it passed the 189MB mark:

[root@n0001 ~]# dd 
if=/global/scratch/kclosser/qcscratch/CH3Cl_0_160623-131335.scr.0/73.0 
of=/dev/null bs=1024k
^C23131+0 records in
23130+0 records out
24253562880 bytes (24 GB) copied, 22.1161 s, 1.1 GB/s

[root@n0001 ~]# stat 
/global/scratch/kclosser/qcscratch/CH3Cl_0_160623-131335.scr.0/73.0
  File: `/global/scratch/kclosser/qcscratch/CH3Cl_0_160623-131335.scr.0/73.0'
  Size: 2577009477120   Blocks: 385776 IO Block: 4194304 regular file
Device: 323d03b2h/842859442dInode: 148006512164371012  Links: 1
Access: (0644/-rw-r--r--)  Uid: (42015/kclosser)   Gid: (  505/ msd)
Access: 2016-06-30 15:12:57.0 -0700
Modify: 2016-06-23 21:33:36.0 -0700
Change: 2016-06-23 21:33:36.0 -0700
[root@n0001 ~]# ls -sh 
/global/scratch/kclosser/qcscratch/CH3Cl_0_160623-131335.scr.0/73.0
189M /global/scratch/kclosser/qcscratch/CH3Cl_0_160623-131335.scr.0/73.0
[root@n0001 ~]# 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS backed OSS out of memory

2016-06-24 Thread Carlson, Timothy S
Alex,

Answers to your questions.

1) I'll do more looking on github for OOM problems. I'll post if I find a 
resolution that fits my problem. I do not have Lustre mounted on the OSS.

2) We have not set any of the parameters but they do look interesting and 
possibly related. The OOM seems to be related with large read activity from the 
cluster but we haven't totally made that correlation. There are no active 
scrubs or snapshots when we see the OOM killer fire.

3) We're looking at possible upgrade paths and to where the stars would align 
on a happy version of Lustre/ZFS that isn't too different from what we have now.

4) We are going to add more monitoring to our ZFS config. The standard 
"arcstat.py" really shows nothing interesting other than our 64GB ARC is always 
full. 

5) The "drop caches" has no impact. If you see my "top" output, there is 
nothing in the cache.

Thanks for the pointers. We'll keep investigating and probably implement a 
couple of the settings in (2).

Tim

-Original Message-
From: Alexander I Kulyavtsev [mailto:a...@fnal.gov] 
Sent: Thursday, June 23, 2016 3:07 PM
To: Carlson, Timothy S 
Cc: Alexander I Kulyavtsev ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS backed OSS out of memory

1) https://github.com/zfsonlinux/zfs/issues/2581
suggests few things to monitor in /proc .  Searching for OOM at 
https://github.com/zfsonlinux/zfs/issues gives more hints where to look.

I guess OOM is not necessarily caused by zfs/spl.
Do you have lustre mounted on OSS and some process writing to it? (memory 
pressure).

2)
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLN
> L-ZFS.pdf
Last three pages.
2a) it may worth to set at /etc/modprobe.d/zfs.conf 
   options zfs zfs_prefetch_disable=1

2b) did you set metaslab_debug_unload ? It increases memory consumption.

Can you correlate OOM with some type of activity (read; write; scrub; snapshot 
delete)?
Do you actually re-read same data? ARC helps to the second read. 
Having 64GB in memory ARC seems a lot together with L2ARC on SSD.
lustre does not use zfs slog IIRC.

3) do you have option to upgrade zfs?

4) you may setup monitoring and feed zfs and lustre stats to influxdb 
(monitoring node) with telegraf (OSS). Both at influxdata.com. I have DB on 
SSD. Plot data with grafana, or query directly from influxdb. 
> # fgrep plugins /etc/opt/telegraf/telegraf.conf ...
> [plugins]
> [[plugins.cpu]]
> [[plugins.disk]]
> [[plugins.io]]
> [[plugins.mem]]
> [[plugins.swap]]
> [[plugins.system]]
> [[plugins.zfs]]
> [[plugins.lustre2]]


5) drop caches  with echo 3 > /proc/sys/vm/drop_caches .  If it helps add to 
cron to avoid OOM kills.

Alex.

> Folks,
> 
> I've done my fair share of googling and run across some good information on 
> ZFS backed Lustre tuning including this:
> 
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLN
> L-ZFS.pdf
> 
> and various discussions around how to limit (or not) the ARC and clear it if 
> needed.
> 
> That being said, here is my configuration.
> 
> RHEL 6
> Kernel 2.6.32-504.3.3.el6.x86_64
> ZFS 0.6.3
> Lustre 2.5.3 with a couple of patches
> Single OST per OSS with 4 x RAIDZ2 4TB SAS drives Log and Cache on 
> separate SSDs These OSSes are beefy with 128GB of memory and Dual 
> E5-2630 v2 CPUs
> 
> About 30 OSSes in all serving mostly a standard HPC cluster over FDR 
> IB with a sprinkle of 10G
> 
> # more /etc/modprobe.d/lustre.conf
> options lnet networks=o2ib9,tcp9(eth0)
> 
> ZFS backed MDS with same software stack.
> 
> The problem I am having is the OOM killer is whacking away at system 
> processes on a few of the OSSes. 
> 
> "top" shows all my memory is in use with very little Cache or Buffer usage.
> 
> Tasks: 1429 total,   5 running, 1424 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  2.9%sy,  0.0%ni, 94.0%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  132270088k total, 131370888k used,   899200k free, 1828k buffers
> Swap: 61407100k total, 7940k used, 61399160k free,10488k cached
> 
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>   47 root  RT   0 000 S 30.0  0.0 372:57.33 migration/11
> 
> I had done zero tuning so I am getting the default ARC size of 1/2 the memory.
> 
> [root@lzfs18b ~]# arcstat.py 1
>time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz c
> 09:11:50 0 0  0 00 00 0063G   63G
> 09:11:51  6.2K  2.6K 41   2066  2.4K   71 0063G   63G
> 09:11:52   21K  4.0K 18   3052  3.7K   3418063G   63G
> 
> The question is, if I have 128GB of RAM and ARC is only taking 63, where did 
> the rest go and

[lustre-discuss] ZFS backed OSS out of memory

2016-06-23 Thread Carlson, Timothy S
Folks,

I've done my fair share of googling and run across some good information on ZFS 
backed Lustre tuning including this:

http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLNL-ZFS.pdf

and various discussions around how to limit (or not) the ARC and clear it if 
needed.

That being said, here is my configuration.

RHEL 6 
Kernel 2.6.32-504.3.3.el6.x86_64
ZFS 0.6.3
Lustre 2.5.3 with a couple of patches
Single OST per OSS with 4 x RAIDZ2 4TB SAS drives
Log and Cache on separate SSDs
These OSSes are beefy with 128GB of memory and Dual E5-2630 v2 CPUs

 About 30 OSSes in all serving mostly a standard HPC cluster over FDR IB with a 
sprinkle of 10G

# more /etc/modprobe.d/lustre.conf
options lnet networks=o2ib9,tcp9(eth0)

ZFS backed MDS with same software stack.

The problem I am having is the OOM killer is whacking away at system processes 
on a few of the OSSes. 

"top" shows all my memory is in use with very little Cache or Buffer usage.

Tasks: 1429 total,   5 running, 1424 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.9%sy,  0.0%ni, 94.0%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132270088k total, 131370888k used,   899200k free, 1828k buffers
Swap: 61407100k total, 7940k used, 61399160k free,10488k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
   47 root  RT   0 000 S 30.0  0.0 372:57.33 migration/11

I had done zero tuning so I am getting the default ARC size of 1/2 the memory.

[root@lzfs18b ~]# arcstat.py 1
time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz c
09:11:50 0 0  0 00 00 0063G   63G
09:11:51  6.2K  2.6K 41   2066  2.4K   71 0063G   63G
09:11:52   21K  4.0K 18   3052  3.7K   3418063G   63G

The question is, if I have 128GB of RAM and ARC is only taking 63, where did 
the rest go and how can I get it back so that the OOM killer stops killing me?

Thanks!

Tim


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] lustre on debian

2013-11-25 Thread Carlson, Timothy S
Lustre is not (yet) part of the mainstream kernel so you are not going to find 
Lustre digging through the linux kernel build process. Thus you see the link 
below from Thomas on some lustre packages. 

Tim

> -Original Message-
> From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
> boun...@lists.lustre.org] On Behalf Of E.S. Rosenberg
> Sent: Monday, November 25, 2013 10:50 AM
> To: Thomas Stibor
> Cc: lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] lustre on debian
> 
> Hmm I feel a bit stupid, but I'm going through the different menus in
> menuconfig and I'm pretty sure I combed through every option in the
> filesystems menu multiple times but I can't find lustre...
> Am I missing something?
> (Downloaded a clean 3.12.1 kernel from kernel.org) Thanks, Eli
> 
> On Mon, Nov 25, 2013 at 6:42 PM, E.S. Rosenberg
>  wrote:
> > You mean the 3.11 kernel right?
> >
> > On Mon, Nov 25, 2013 at 6:43 PM, Thomas Stibor  wrote:
> >> Forgot to mention that: I have built Debian Wheezy packages which are
> >> available at:
> >>
> >> http://web-docs.gsi.de/~tstibor/lustre/lustre-builds/
> >>
> >> On Mon, Nov 25, 2013 at 05:48:06PM +0200, E.S. Rosenberg wrote:
> >>> Since in Linux we are mostly a debian shop we'd like to stick with
> >>> debian for our calculation nodes if possible.
> >>> So I wanted to ask the lustre 2.2 instructions for Debian are they
> >>> more or less relevant to lustre 2.4/2.5 or am I going headlong into
> >>> a tall brick wall.
> >>>
> >>> Also are newer clients backwards compatible with older server
> >>> software? I am currently just setting up a demo environment and
> >>> don't know what version of lustre the vendor will install on the
> >>> full fledged  version yet (though I hope they'll go with 2.4/2.5).
> >>>
> >>> Thanks,
> >>> Eli
> >>> ___
> >>> Lustre-discuss mailing list
> >>> Lustre-discuss@lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Can't increase effective client read cache

2013-11-18 Thread Carlson, Timothy S
Sitting in the pdsw workshop here at SC and watched my colleague  Evan Felix 
try to repeat my problem on a lustre 2.1 setup.  Looks like the problem does 
not exist in that configuration and lustre caches exactly as much on the client 
side as you configure.  So the problem is likely limited to 1.8 clients.  Oh 
well.  Time to plan an upgrade. 

>From my phone so short. Tim

> On Oct 4, 2013, at 9:53 AM, "Nathan Dauchy"  wrote:
> 
>> On 09/24/2013 02:14 PM, Carlson, Timothy S wrote:
>> I've got an odd situation that I can't seem to fix. 
>> 
>> My setup is Lustre 1.8.8-wc1 clients on RHEL 6 talking to 1.8.6 servers on 
>> RHEL 5.
>> 
>> My compute nodes have 64 GB of memory and I have a use case where an 
>> application has very low memory usage and needs to access a few thousand 
>> files in Lustre that range from 10 to 50 MB.  The files are subject to some 
>> reuse and it would be advantageous to cache as much of the data as possible. 
>>  The default cache for this configuration would be 48GB on the client as 
>> that is 75% of memory.   However the client never caches more than about 
>> 40GB of data according to /proc/meminfo 
>> 
>> Even if I tune the cached memory to 64GB the amount of cache in use never 
>> goes past 40GB. My current setting is as follows
>> 
>> # lctl get_param llite.*.max_cached_mb
>> llite.olympus-8804069da800.max_cached_mb=64000
>> 
>> I've also played with some of the VM tunable settings.  Like running 
>> vfs_cache_pressure down to 10
>> 
>> # vm.vfs_cache_pressure = 10
>> 
>> In no case do I see more than about 35GB of cache being used.   To do some 
>> more testing on this I created a bunch (40) 2G files in Lustre and then 
>> copied them to /dev/null on the client. While doing this I ran the fincore 
>> tool from http://code.google.com/p/linux-ftools/ to see if the file was 
>> still in cache. Once about 40GB of cache was used, the kernel started to 
>> drop files from the cache even though there was no memory pressure on the 
>> system. 
>> 
>> If I do the same test with files local to the system, I can fill all the 
>> cache to about 61GB before files start getting dropped. 
>> 
>> Is there some other Lustre tunable on the client that I can twiddle with to 
>> make more use of the local memory cache?
> 
> Tim,
> 
> Another kernel sysctl that might be in play here.  Have you looked at these?
> 
>  vm.dirty_background_ratio
>  vm.dirty_ratio
>  vm.dirty_background_bytes
>  vm.dirty_bytes
> 
> Those will control at what number of bytes or percentage of memory the
> kernel flushes buffer cache.
> 
> Hope this helps,
> Nathan
> 
> 
>> Thanks
>> 
>> Tim Carlson
>> Director, PNNL Institutional Computing
>> timothy.carl...@pnnl.gov
>> 
>> 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Can't increase effective client read cache

2013-09-26 Thread Carlson, Timothy S
> -Original Message-
> From: Dilger, Andreas [mailto:andreas.dil...@intel.com]
> Sent: Wednesday, September 25, 2013 10:03 AM
> To: Carlson, Timothy S
> Cc: lustre-discuss@lists.lustre.org; hpdd-disc...@lists.01.org
> Subject: Re: [Lustre-discuss] Can't increase effective client read cache
> 
> On 2013-09-24, at 14:15, "Carlson, Timothy S" 
> wrote:
> 
> > I've got an odd situation that I can't seem to fix.
> >
> > My setup is Lustre 1.8.8-wc1 clients on RHEL 6 talking to 1.8.6 servers on 
> > RHEL
> 5.
> >
> > My compute nodes have 64 GB of memory and I have a use case where an
> application has very low memory usage and needs to access a few thousand
> files in Lustre that range from 10 to 50 MB.  The files are subject to some 
> reuse
> and it would be advantageous to cache as much of the data as possible.  The
> default cache for this configuration would be 48GB on the client as that is 
> 75%
> of memory.   However the client never caches more than about 40GB of data
> according to /proc/meminfo
> >
> > Even if I tune the cached memory to 64GB the amount of cache in use never
> goes past 40GB. My current setting is as follows
> >
> > # lctl get_param llite.*.max_cached_mb
> > llite.olympus-8804069da800.max_cached_mb=64000
> >
> > I've also played with some of the VM tunable settings.  Like running
> vfs_cache_pressure down to 10
> >
> > # vm.vfs_cache_pressure = 10
> >
> > In no case do I see more than about 35GB of cache being used.   To do some
> more testing on this I created a bunch (40) 2G files in Lustre and then copied
> them to /dev/null on the client. While doing this I ran the fincore tool from
> http://code.google.com/p/linux-ftools/ to see if the file was still in cache. 
> Once
> about 40GB of cache was used, the kernel started to drop files from the cache
> even though there was no memory pressure on the system.
> >
> > If I do the same test with files local to the system, I can fill all the 
> > cache to
> about 61GB before files start getting dropped.
> >
> > Is there some other Lustre tunable on the client that I can twiddle with to
> make more use of the local memory cache?
> 
> This might relate to the number of DLM locks cached on the client. Of the 
> locks
> get cancelled for some reason (e.g. memory pressure on the server, old age)
> then the pages covered by the locks will also be dropped.
> 
> You could try disabling the lock LRU and specify some large static number of
> locks (for testing, I wouldn't leave this set for production systems with 
> large
> numbers of clients):
> 
> lctl set_param ldlm.namespaces.*.lru_size=1
> 
> To reset it to dynamic DLM LRU size management set a value of "0".
> 
> Cheers, Andread

I gave that a try but it didn't seem to help. On my generic example of copying 
2G files to /dev/null, the lru_size of all the OSTs is under 10 except for 3 
that I have permanently marked as inactive (and they are at 3200) and the MDS 
which is at 143. Here I dd'ed 20 2G files into /dev/null but only about 30GB is 
still in cache. 

# lctl get_param ldlm.namespaces.*.lru_size | awk -F\= '{print $2}' | sort -n | 
uniq -c
 29 0
129 1
 73 2
 32 3
  4 4
  1 5
  2 9
  1 423
  3 3200

Any other thoughts on parameters to twiddle?

Thanks!

Tim
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Can't increase effective client read cache

2013-09-24 Thread Carlson, Timothy S
I've got an odd situation that I can't seem to fix. 

My setup is Lustre 1.8.8-wc1 clients on RHEL 6 talking to 1.8.6 servers on RHEL 
5.

My compute nodes have 64 GB of memory and I have a use case where an 
application has very low memory usage and needs to access a few thousand files 
in Lustre that range from 10 to 50 MB.  The files are subject to some reuse and 
it would be advantageous to cache as much of the data as possible.  The default 
cache for this configuration would be 48GB on the client as that is 75% of 
memory.   However the client never caches more than about 40GB of data 
according to /proc/meminfo 

Even if I tune the cached memory to 64GB the amount of cache in use never goes 
past 40GB. My current setting is as follows

# lctl get_param llite.*.max_cached_mb
llite.olympus-8804069da800.max_cached_mb=64000

I've also played with some of the VM tunable settings.  Like running 
vfs_cache_pressure down to 10

# vm.vfs_cache_pressure = 10

 In no case do I see more than about 35GB of cache being used.   To do some 
more testing on this I created a bunch (40) 2G files in Lustre and then copied 
them to /dev/null on the client. While doing this I ran the fincore tool from 
http://code.google.com/p/linux-ftools/ to see if the file was still in cache. 
Once about 40GB of cache was used, the kernel started to drop files from the 
cache even though there was no memory pressure on the system. 

If I do the same test with files local to the system, I can fill all the cache 
to about 61GB before files start getting dropped. 

Is there some other Lustre tunable on the client that I can twiddle with to 
make more use of the local memory cache?

Thanks

Tim Carlson
Director, PNNL Institutional Computing
timothy.carl...@pnnl.gov


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.

2013-08-22 Thread Carlson, Timothy S
FWIW, we have seen the same issues with Lustre 1.8.x and slightly older RHEL6 
kernel.  We do the "echo" as part of our slurm prolog/epilog scripts. Not a fix 
but a workaround before/after jobs run.  No swap activity, but very large 
buffer cache in use. 

Tim

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Roger Sersted
Sent: Thursday, August 22, 2013 7:22 AM
To: lustre-discuss@lists.lustre.org
Cc: Roy Dragseth
Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.




Is this slowdown due to increased swap activity?  If "yes", then try lowering 
the "swappiness" value.  This will sacrifice buffer cache space to lower swap 
activity.

Take a look at http://en.wikipedia.org/wiki/Swappiness.

Roger S.


On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> We have just discovered that a large buffer cache generated from 
> traversing a lustre file system will cause a significant system 
> overhead for applications with high memory demands.  We have seen a 
> 50% slowdown or worse for applications.  Even High Performance 
> Linpack, that have no file IO whatsoever is affected.  The only remedy 
> seems to be to empty the buffer cache from memory by running "echo 3 > 
> /proc/sys/vm/drop_caches"
>
> Any hints on how to improve the situation is greatly appreciated.
>
>
> System setup:
> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband 
> connection to lustre server.  CentOS 6.4, with kernel 
> 2.6.32-358.11.1.el6.x86_64 and lustre
> v2.1.6 rpms downloaded from whamcloud download site.
>
> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud 
> site).  Each OSS has 12 OST, total 1.1 PB storage.
>
> How to reproduce:
>
> Traverse the lustre file system until the buffer cache is large 
> enough.  In our case we run
>
>   find . -print0 -type f | xargs -0 cat > /dev/null
>
> on the client until the buffer cache reaches ~15-20GB.  (The lustre 
> file system has lots of small files so this takes up to an hour.)
>
> Kill the find process and start a single node parallel application, we 
> use HPL (high performance linpack).  We run on all 16 cores on the 
> system with 1GB ram per core (a normal run should complete in appr. 
> 150 seconds.)  The system monitoring shows a 10-20% system cpu 
> overhead and the HPL run takes more than
> 200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches" the 
> system performance goes back to normal with a run time at 150 secs.
>
> I've created an infographic from our ganglia graphs for the above scenario.
>
> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p
> ng
>
> Attached is an excerpt from perf top indicating that the kernel 
> routine taking the most time is _spin_lock_irqsave if that means anything to 
> anyone.
>
>
> Things tested:
>
> It does not seem to matter if we mount lustre over infiniband or ethernet.
>
> Filling the buffer cache with files from an NFS filesystem does not 
> degrade performance.
>
> Filling the buffer cache with one large file does not give degraded 
> performance.
> (tested with iozone)
>
>
> Again, any hints on how to proceed is greatly appreciated.
>
>
> Best regards,
> Roy.
>
>
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Shrinking the mdt volume

2013-07-09 Thread Carlson, Timothy S
But how many inodes are in use on the MDT?  If you shrink the volume down you 
are by default going to have way fewer inodes on the MDT.  For example, my MDT 
is 450GB and using 31GB or 8% of the space but it is using 17% of the inodes 
available. 

You might have lots of big files in which case you don't have to worry about 
the inode count but I would check to see how many inodes were in use before you 
go crazy shrinking things down.

My past experience with the tar/xattr method of doing MDT movements following 
the manual verbatim has never been successful.. YMMV

Tim 

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Bob Ball
Sent: Tuesday, July 09, 2013 5:57 PM
To: hpdd-disc...@lists.01.org; Lustre discussion
Subject: [Lustre-discuss] Shirinking the mdt volume

When we set up our mdt volume, lo these many years past, we did it with a 2TB 
volume.  Overkill.  About 17G is actually in use.

This is a Lustre 1.8.4 system backed by about 450TB of OST on 8 servers.  I  
would _love_ to shrink this mdt volume to a more manageable size, say, 50GB or 
so, as we are now in a down time before we upgrade to Lustre 2.1.6 on SL6.4.  I 
have taken a "dd" of the volume, and am now in the process of doing getfattr -R 
-d -m '.*' -P . > /tmp/mdt_ea.bak after which I will do a tar czf {backup 
file}.tgz --sparse /bin/tar is the SL5 version tar-1.20-6.x86_64.  This 
supports the --xattrs switch.  So, a choice here, should I instead use the 
--xattrs switch on the tar, or should I use --no-xattrs since the mdt_ea.bak 
will have all of them?

What are my prospects for success, if I restore that tar file to a smaller 
volume, then apply the attr backup, before I upgrade?

Answers and advice are greatly appreciated.

bob

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or later kernel?

2011-10-23 Thread Carlson, Timothy S
And of course an hour after I sent this message I got a working RHEL 6 kernel 
(2.6.32-131.17.1.el6.x86_64)  on my RHEL 5 system, replaced OFED to get XRC 
support, rebuilt Lustre (1.8.6-wc1) and am now testing this configuration. It's 
not the prettiest setup in the world but it seems to be working so far.

Thanks

Tim

> -Original Message-
> From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
> boun...@lists.lustre.org] On Behalf Of Carlson, Timothy S
> Sent: Sunday, October 23, 2011 3:19 PM
> To: 'Kevin Van Maren'
> Cc: Lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or
> later kernel?
> 
> I'm trying that path as well but getting the RHEL 6 kernel going on my
> machine has proven a bit hard. I got 2.6.37.6 going on the first try but the
> RHEL 6 kernel is having a problem finding my root partition. This is usually a
> driver not getting loaded into the initrd. I  build the kernel just  fine, it 
> looks
> like I have the correct modules to support booting from my IDE (exposed as
> a SCSI) drive, but I get the dreaded "can't find /" and then an init panic.  
> It's a
> pretty basic system, with no LVM or other complications. I'll figure it out
> eventually, but I've had to use my IPMI console way too much in the past
> day to boot back into working kernels. :)
> 
> Tim
> 
> > -Original Message-
> > From: Kevin Van Maren [mailto:kvanma...@fusionio.com]
> > Sent: Saturday, October 22, 2011 8:24 AM
> > To: Carlson, Timothy S
> > Cc: Lustre-discuss@lists.lustre.org
> > Subject: Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or
> > later kernel?
> >
> > Why not use the RHEL6 kernel on RHEL5?  That's probably much easier.
> >
> > Kevin
> >
> >
> > On Oct 21, 2011, at 9:50 PM, "Carlson, Timothy S"
> >  wrote:
> >
> > > Folks,
> > >
> > > I've got a need to run a 2.6.37 or later kernel on client machines in 
> > > order
> to
> > properly support AMD Interlagos CPUs. My other option is to switch from
> > RHEL 5.x to RHEL 6.x and use the whamcloud 1.8.6-wc1 patchless client (the
> > latest RHEL 6 kernel also supports Interlagos). But I would first like to
> > investigate using a 2.6.37 or later kernel on RHEL 5.
> > >
> > > I have a running kernel and started down the path of building Lustre
> > against 2.6.37.6 and ran into the changes that have been made wrt to
> ioctl(),
> > proc structures, etc.  I am *not* a kernel programmer would rather not
> mess
> > around too much in the source.
> > >
> > > So I am asking if anyone has successfully patched up Lustre to get a 
> > > client
> > working with 2.6.37.6 or later.
> > >
> > > Thanks!
> > >
> > > Tim
> > > ___
> > > Lustre-discuss mailing list
> > > Lustre-discuss@lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > Confidentiality Notice: This e-mail message, its contents and any
> > attachments to it are confidential to the intended recipient, and may
> > contain information that is privileged and/or exempt from disclosure
> under
> > applicable law. If you are not the intended recipient, please immediately
> > notify the sender and destroy the original e-mail message and any
> > attachments (and any copies that may have been made) from your system
> > or otherwise. Any unauthorized use, copying, disclosure or distribution of
> > this information is strictly prohibited.
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or later kernel?

2011-10-23 Thread Carlson, Timothy S
I'm trying that path as well but getting the RHEL 6 kernel going on my machine 
has proven a bit hard. I got 2.6.37.6 going on the first try but the RHEL 6 
kernel is having a problem finding my root partition. This is usually a driver 
not getting loaded into the initrd. I  build the kernel just  fine, it looks 
like I have the correct modules to support booting from my IDE (exposed as a 
SCSI) drive, but I get the dreaded "can't find /" and then an init panic.  It's 
a pretty basic system, with no LVM or other complications. I'll figure it out 
eventually, but I've had to use my IPMI console way too much in the past day to 
boot back into working kernels. :)  

Tim

> -Original Message-
> From: Kevin Van Maren [mailto:kvanma...@fusionio.com]
> Sent: Saturday, October 22, 2011 8:24 AM
> To: Carlson, Timothy S
> Cc: Lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or
> later kernel?
> 
> Why not use the RHEL6 kernel on RHEL5?  That's probably much easier.
> 
> Kevin
> 
> 
> On Oct 21, 2011, at 9:50 PM, "Carlson, Timothy S"
>  wrote:
> 
> > Folks,
> >
> > I've got a need to run a 2.6.37 or later kernel on client machines in order 
> > to
> properly support AMD Interlagos CPUs. My other option is to switch from
> RHEL 5.x to RHEL 6.x and use the whamcloud 1.8.6-wc1 patchless client (the
> latest RHEL 6 kernel also supports Interlagos). But I would first like to
> investigate using a 2.6.37 or later kernel on RHEL 5.
> >
> > I have a running kernel and started down the path of building Lustre
> against 2.6.37.6 and ran into the changes that have been made wrt to ioctl(),
> proc structures, etc.  I am *not* a kernel programmer would rather not mess
> around too much in the source.
> >
> > So I am asking if anyone has successfully patched up Lustre to get a client
> working with 2.6.37.6 or later.
> >
> > Thanks!
> >
> > Tim
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Confidentiality Notice: This e-mail message, its contents and any
> attachments to it are confidential to the intended recipient, and may
> contain information that is privileged and/or exempt from disclosure under
> applicable law. If you are not the intended recipient, please immediately
> notify the sender and destroy the original e-mail message and any
> attachments (and any copies that may have been made) from your system
> or otherwise. Any unauthorized use, copying, disclosure or distribution of
> this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Anybody have a client running on a 2.6.37 or later kernel?

2011-10-21 Thread Carlson, Timothy S
Folks,

I've got a need to run a 2.6.37 or later kernel on client machines in order to 
properly support AMD Interlagos CPUs. My other option is to switch from RHEL 
5.x to RHEL 6.x and use the whamcloud 1.8.6-wc1 patchless client (the latest 
RHEL 6 kernel also supports Interlagos). But I would first like to investigate 
using a 2.6.37 or later kernel on RHEL 5.

I have a running kernel and started down the path of building Lustre against 
2.6.37.6 and ran into the changes that have been made wrt to ioctl(), proc 
structures, etc.  I am *not* a kernel programmer would rather not mess around 
too much in the source. 

So I am asking if anyone has successfully patched up Lustre to get a client 
working with 2.6.37.6 or later.

Thanks!

Tim 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?

2011-05-19 Thread Carlson, Timothy S
> On May 19, 2011, at 10:28, Kevin Van Maren wrote:
> > Dardo D Kleiner - CONTRACTOR wrote:
> > As for putting the entire filesystem on flash, sure that would be
> pretty
> > nifty, but expensive.  Not being able to do failover, with storage on
> > internal PCIe cards, is a downside.
>
>  [Andreas added this comment]
> I doubt this will be possible for a long time to come, due to cost,
> even if
> the PCI cards have external interfaces (as I've heard some high-end
> ones do).

I hate to snip out most of a thread, but I want to focus on the issues of cost 
and failover.

As for cost, I really don't think this is an issue. If I am investing in a file 
system that is either approaching a Petabyte or is larger than a Petabyte then 
I don't see that purchasing a 5K-10K flash device is really a cost factor. It 
is not quite in the noise, but it is going to be less than 5% of the total 
purchase price of a the file system. 

Failover is an issue. I've been keeping some loose statistics on my current 
Lustre configurations (a Petabyte or so in total) and looking at what 
components fail and where redundancy/failover could be improved. So far, 
metadata server failure hasn't entered the picture.  The problem with Lustre is 
it is now just too damn robust to random reboots :). 

Thanks

Tim 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?

2011-05-16 Thread Carlson, Timothy S

Folks,

I know that flash based technology gets talked about from time to time on the 
list, but I was wondering if anybody has actually implemented FusionIO devices 
for metadata. The last thread I can find on the mailing list that relates to 
this topic dates from 3 years ago. The software driving the Fusion cards has 
come quite a ways since then and I've got good experience using the device as a 
raw disk. I'm just fishing around to see if anybody has implemented one of 
these devices in a reasonably sized Lustre config where "reasonably" is left 
open to interpretation. I'm thinking >500T and a few million files.  

Thanks!

Tim

  
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss