[Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Theodore Omtzigt
I configured a Lustre file system on a collection of storage servers 
that have 12TB raw devices. I configured a combined MGS/MDS with the 
default configuration. On the OSTs however I added the force_over_8tb to 
the mountfsoptions.

Two part question:
1- do I need to set that parameter on the MGS/MDS server as well
2- if yes, how do I properly add this parameter on this running Lustre 
file system (100TB on 9 storage servers)

I can't resolve the ambiguity in the documentation as I can't find a 
good explanation of the configuration log mechanism that is being 
referenced in the man pages. The fact that the doc for --writeconf 
states This is very dangerous, I am hesitant to pull the trigger as 
there is 60TB of data on this file system that I rather not lose.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Michael Barnes

On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote:

 Two part question:
 1- do I need to set that parameter on the MGS/MDS server as well

No, they are different filesystems.  You shouldn't need to do this on the OSTs 
either.  You must be using an older lustre release.

 2- if yes, how do I properly add this parameter on this running Lustre 
 file system (100TB on 9 storage servers)

covered

 I can't resolve the ambiguity in the documentation as I can't find a 
 good explanation of the configuration log mechanism that is being 
 referenced in the man pages. The fact that the doc for --writeconf 
 states This is very dangerous, I am hesitant to pull the trigger as 
 there is 60TB of data on this file system that I rather not lose.

I've had no issues with writeconf.  Its nice because it shows you the old and 
new parameters.  Make sure that the changes that you made were the what you 
want, and that the old parameters that you want to keep are still in tact.  I 
don't remember the exact circumstances, but I've found settings were lost when 
doing a writeconf, and I had to explictly put these settings in tunefs.lustre 
command to preserve them.

-mb

--
+---
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| Scientific Computing Group
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+---




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Andreas Dilger
If you are seeing this problem it means you are using the ext3-based ldiskfs. 
Go back to the download site and get the lustre-ldiskfs and lustre-modules RPMs 
with ext4 in the name. 

That is the code that was tested with LUNs over 8TB. We kept these separate for 
some time to reduce risk for users that did not need larger LUN sizes.  This is 
the default for the recent Whamcloud 1.8.6 release. 

Cheers, Andreas

On 2011-07-14, at 11:15 AM, Theodore Omtzigt t...@stillwater-sc.com wrote:

 I configured a Lustre file system on a collection of storage servers 
 that have 12TB raw devices. I configured a combined MGS/MDS with the 
 default configuration. On the OSTs however I added the force_over_8tb to 
 the mountfsoptions.
 
 Two part question:
 1- do I need to set that parameter on the MGS/MDS server as well
 2- if yes, how do I properly add this parameter on this running Lustre 
 file system (100TB on 9 storage servers)
 
 I can't resolve the ambiguity in the documentation as I can't find a 
 good explanation of the configuration log mechanism that is being 
 referenced in the man pages. The fact that the doc for --writeconf 
 states This is very dangerous, I am hesitant to pull the trigger as 
 there is 60TB of data on this file system that I rather not lose.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Lisa Giacchetti

Hi,
 We are seeing a problem where some running jobs attempted to copy a 
file from local disk
on a worker node to a lustre file system. 14 of those files ended up 
empty or truncated.


We have 7 OSSs with either 6 or 12 ost's on each. All 14 files ended up 
being on an ost on
one of the two systems that have 12 osts. There are 12 different OST's 
involved.


So if I look at the messages file on one of those OSS's and I 
specifically look for messages
related to one of the OST's that have a truncated or empty file I see 
things like this:


Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.191.35@tcp was evicted due to a lock completion 
callback to 131.225.191.35@tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.204.88@tcp was evicted due to a lock blocking 
callback to 131.225.204.88@tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.207.176@tcp was evicted due to a lock blocking 
callback to 131.225.207.176@tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.190.151@tcp was evicted due to a lock blocking 
callback to 131.225.190.151@tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
export 810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 
rrc: 2 type: EXT [0-1048575] (req 0-1048575) flags: 0x0 remote: 
0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
12345-131.225.190.151@tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
12345-131.225.190.151@tcp - client will retry



Some of these errors seem really bad - like the bulk IO comm error or 
the eviction due to a locking call back.
What should I be looking for here?  I have determined some of the 
messages that say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the 
OSS think the client is dead?


Also is there any way to determine what files are involved in these errors?

lisa


attachment: lisa.vcf___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Theodore Omtzigt
Andreas:

   Thanks for taking a look at this. Unfortunately, I don't quite 
understand the guidance you present: If you are seeing 'this' 
problem. I haven't seen 'any' problems pertaining to 8tb yet, so I 
cannot place your guidance in the context of the question I posted.

My question was whether or not I need this parameter on the MDS and if 
so, how to apply it retroactively.  The Lustre environment I installed 
was the 1.8.5 set. Any insight in the issues would be appreciated.

Theo

On 7/14/2011 1:41 PM, Andreas Dilger wrote:
 If you are seeing this problem it means you are using the ext3-based ldiskfs. 
 Go back to the download site and get the lustre-ldiskfs and lustre-modules 
 RPMs with ext4 in the name.

 That is the code that was tested with LUNs over 8TB. We kept these separate 
 for some time to reduce risk for users that did not need larger LUN sizes.  
 This is the default for the recent Whamcloud 1.8.6 release.

 Cheers, Andreas

 On 2011-07-14, at 11:15 AM, Theodore Omtzigtt...@stillwater-sc.com  wrote:

 I configured a Lustre file system on a collection of storage servers
 that have 12TB raw devices. I configured a combined MGS/MDS with the
 default configuration. On the OSTs however I added the force_over_8tb to
 the mountfsoptions.

 Two part question:
 1- do I need to set that parameter on the MGS/MDS server as well
 2- if yes, how do I properly add this parameter on this running Lustre
 file system (100TB on 9 storage servers)

 I can't resolve the ambiguity in the documentation as I can't find a
 good explanation of the configuration log mechanism that is being
 referenced in the man pages. The fact that the doc for --writeconf
 states This is very dangerous, I am hesitant to pull the trigger as
 there is 60TB of data on this file system that I rather not lose.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Packaged kerberized VM client image Re: Migrating virtual machines over Lustre using Proxmox

2011-07-14 Thread Josephine Palencia

Hi Paul,

I wanted to signify our interest in your project as we have
something similar and related.

As part of the OSG ExTENCI project, we've set up a kerberized lustre
fs that uses virtual (VM) lustre clients in remote sites. With proper 
network tuning/route analysis, we observe that it is possible to
saturate the full IO bandwidth even for remote VM lustre clients
and obtain good IO rates.

So far, we've made available kerberized XEN VM lustre image 
(ftp://ftp.psc.edu/pub/jwan/Lustre-2.1/2.0.62/vm-images/) that
ExTENCI Tier3 remote sites can download and just boot up after
being given the proper kerberos principals.

We will also provide the kerberized images for KVM (Proxmox)
and VMware.

Currently, we use Lustre 2.1 (2.0.62) with 2.0.63 for clients.
PSC locally runs the same set up on a separate kerberos realm.

We invite collaboration with other parties who might be
interested in trying packaged kerberized lustre VM  clients
at their  sites.

Regards,
josephine



On Sat, 9 Jul 2011, Paul Gray wrote:

 Like most of the readers on the list, my background with Lustre
 originates from cluster environments.  But as virtualization trends seem
 to be here to stay, the question of using Lustre to support large-scale
 distributed virtualization naturally arises.  Being able to leverage
 Lustre benefits in a VM cloud would seem to have quite a few advantages.

 As a test case, at UNI we extended the Proxmox Virtualization
 Environment to support *live* Virtual Machine migration across separate
 physical (bare-metal) hosts of the Proxmox virtualization cluster,
 supported by a distributed Lustre filesystem.

 If you aren't familiar with Proxmox and live migration support over
 Lustre, what we deployed at UNI is akin to being able to do VMWare's
 VMotion over Lustre (without the associated license costs).

 We put together two screencasts showing the prototype deployment and
 wanted to share the proof-of-concept results with the community:

 *)  A small demonstration of live migration with a small Debian VM whose
 root filesystem is supported over a distributed lustre implementation
 can be found here:
   http://dragon.cs.uni.edu/flash/proxmoxlustre.html

 *)  A short screencast showing live migration over Lustre using the
 Proxmox GUI can be viewed here:
http://dragon.cs.uni.edu/flash/gui-migration.html

 Our immediate interests are in the performance of large (in terms of
 quantity), dynamic, live migrations that would leverage our
 high-throughput IB-based Lustre subsystem from our clusters.  We'd
 welcome your comments, feedback, questions or requests for specific
 benchmarks to explore.

 ADVthanksANCE
 -- 
 Paul Gray -o)
 314 East Gym, Dept. of Computer Science   /\\
 University of Northern Iowa  _\_V
  Message void if penguin violated ...  Don't mess with the penguin
  No one says, Hey, I can't read that ASCII attachment ya sent me.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Theodore Omtzigt
Michael:

The reason I had to do it on the OST's is because when issuing the 
mkfs.lustre command to build the OST it would error out with the message 
that I should use the force_over_8tb mount option. I was not able to 
create an OST on that device without the force_over_8tb option.

Your insights on the writeconf are excellent: good to know that 
writeconf is solid. Thank you.

Theo

On 7/14/2011 1:29 PM, Michael Barnes wrote:
 On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote:

 Two part question:
 1- do I need to set that parameter on the MGS/MDS server as well
 No, they are different filesystems.  You shouldn't need to do this on the 
 OSTs either.  You must be using an older lustre release.

 2- if yes, how do I properly add this parameter on this running Lustre
 file system (100TB on 9 storage servers)
 covered

 I can't resolve the ambiguity in the documentation as I can't find a
 good explanation of the configuration log mechanism that is being
 referenced in the man pages. The fact that the doc for --writeconf
 states This is very dangerous, I am hesitant to pull the trigger as
 there is 60TB of data on this file system that I rather not lose.
 I've had no issues with writeconf.  Its nice because it shows you the old and 
 new parameters.  Make sure that the changes that you made were the what you 
 want, and that the old parameters that you want to keep are still in tact.  I 
 don't remember the exact circumstances, but I've found settings were lost 
 when doing a writeconf, and I had to explictly put these settings in 
 tunefs.lustre command to preserve them.

 -mb

 --
 +---
 | Michael Barnes
 |
 | Thomas Jefferson National Accelerator Facility
 | Scientific Computing Group
 | 12000 Jefferson Ave.
 | Newport News, VA 23606
 | (757) 269-7634
 +---





___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Lisa Giacchetti

I am running 1.8.3 on servers and clients.
lisa

On 7/14/11 12:59 PM, Lisa Giacchetti wrote:

Hi,
 We are seeing a problem where some running jobs attempted to copy a 
file from local disk
on a worker node to a lustre file system. 14 of those files ended up 
empty or truncated.


We have 7 OSSs with either 6 or 12 ost's on each. All 14 files ended 
up being on an ost on
one of the two systems that have 12 osts. There are 12 different OST's 
involved.


So if I look at the messages file on one of those OSS's and I 
specifically look for messages
related to one of the OST's that have a truncated or empty file I see 
things like this:


Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.191.35@tcp was evicted due to a lock completion 
callback to 131.225.191.35@tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88@tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.204.88@tcp was evicted due to a lock blocking 
callback to 131.225.204.88@tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.207.176@tcp was evicted due to a lock blocking 
callback to 131.225.207.176@tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 
131.225.204.118@tcp 7s ago has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 
131.225.190.151@tcp 7s ago has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.190.151@tcp was evicted due to a lock blocking 
callback to 131.225.190.151@tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on 
destroyed export 810c3926f400 ns: filter-cmsprod1-OST002d_UUID 
lock: 8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 
337742/0 rrc: 2 type: EXT [0-1048575] (req 0-1048575) flags: 0x0 
remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
12345-131.225.190.151@tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
12345-131.225.190.151@tcp - client will retry



Some of these errors seem really bad - like the bulk IO comm error or 
the eviction due to a locking call back.
What should I be looking for here?  I have determined some of the 
messages that say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes 
the OSS think the client is dead?


Also is there any way to determine what files are involved in these 
errors?


lisa



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


attachment: lisa.vcf___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Cliff White
--writeconf will erase parameters set via lctl conf_param, and will erase
pools definitions.
It will also allow you to set rather silly parameters that can prevent your
filesystem from starting, such
as incorrect server NIDs or incorrect failover NIDs. For this reason (and
from a history of customer
support) we caveat it's use in the manual.

The --writeconf option never touches data, only server configs, so it will
not mess up your data.

So, given sensible precautions as mentioned above, it's safe to do.
cliffw


On Thu, Jul 14, 2011 at 11:03 AM, Theodore Omtzigt
t...@stillwater-sc.comwrote:

 Andreas:

   Thanks for taking a look at this. Unfortunately, I don't quite
 understand the guidance you present: If you are seeing 'this'
 problem. I haven't seen 'any' problems pertaining to 8tb yet, so I
 cannot place your guidance in the context of the question I posted.

 My question was whether or not I need this parameter on the MDS and if
 so, how to apply it retroactively.  The Lustre environment I installed
 was the 1.8.5 set. Any insight in the issues would be appreciated.

 Theo

 On 7/14/2011 1:41 PM, Andreas Dilger wrote:
  If you are seeing this problem it means you are using the ext3-based
 ldiskfs. Go back to the download site and get the lustre-ldiskfs and
 lustre-modules RPMs with ext4 in the name.
 
  That is the code that was tested with LUNs over 8TB. We kept these
 separate for some time to reduce risk for users that did not need larger LUN
 sizes.  This is the default for the recent Whamcloud 1.8.6 release.
 
  Cheers, Andreas
 
  On 2011-07-14, at 11:15 AM, Theodore Omtzigtt...@stillwater-sc.com
  wrote:
 
  I configured a Lustre file system on a collection of storage servers
  that have 12TB raw devices. I configured a combined MGS/MDS with the
  default configuration. On the OSTs however I added the force_over_8tb to
  the mountfsoptions.
 
  Two part question:
  1- do I need to set that parameter on the MGS/MDS server as well
  2- if yes, how do I properly add this parameter on this running Lustre
  file system (100TB on 9 storage servers)
 
  I can't resolve the ambiguity in the documentation as I can't find a
  good explanation of the configuration log mechanism that is being
  referenced in the man pages. The fact that the doc for --writeconf
  states This is very dangerous, I am hesitant to pull the trigger as
  there is 60TB of data on this file system that I rather not lose.
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Cliff White
This error message you are seeing is what Andreas was talking about - you
must use the
ext4-based version, as you will not need any option with your size LUNS. The
'must use force_over_8tb'
error is the key here, you most certainly want/need to *.ext4.rpm versions
of stuff.
cliffw


On Thu, Jul 14, 2011 at 11:10 AM, Theodore Omtzigt
t...@stillwater-sc.comwrote:

 Michael:

The reason I had to do it on the OST's is because when issuing the
 mkfs.lustre command to build the OST it would error out with the message
 that I should use the force_over_8tb mount option. I was not able to
 create an OST on that device without the force_over_8tb option.

 Your insights on the writeconf are excellent: good to know that
 writeconf is solid. Thank you.

 Theo

 On 7/14/2011 1:29 PM, Michael Barnes wrote:
  On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote:
 
  Two part question:
  1- do I need to set that parameter on the MGS/MDS server as well
  No, they are different filesystems.  You shouldn't need to do this on the
 OSTs either.  You must be using an older lustre release.
 
  2- if yes, how do I properly add this parameter on this running Lustre
  file system (100TB on 9 storage servers)
  covered
 
  I can't resolve the ambiguity in the documentation as I can't find a
  good explanation of the configuration log mechanism that is being
  referenced in the man pages. The fact that the doc for --writeconf
  states This is very dangerous, I am hesitant to pull the trigger as
  there is 60TB of data on this file system that I rather not lose.
  I've had no issues with writeconf.  Its nice because it shows you the old
 and new parameters.  Make sure that the changes that you made were the what
 you want, and that the old parameters that you want to keep are still in
 tact.  I don't remember the exact circumstances, but I've found settings
 were lost when doing a writeconf, and I had to explictly put these settings
 in tunefs.lustre command to preserve them.
 
  -mb
 
  --
  +---
  | Michael Barnes
  |
  | Thomas Jefferson National Accelerator Facility
  | Scientific Computing Group
  | 12000 Jefferson Ave.
  | Newport News, VA 23606
  | (757) 269-7634
  +---
 
 
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] LNET o2ib networking and MTU

2011-07-14 Thread Adesanya, Adeyemi

Just need some clarification on this: 

We use the o2ib driver for Lustre IB communication. We also use IPoIB to define 
IP addresses for the IB interfaces in the network. Does the MTU configuration 
parameter impact Lustre in any way? My understanding is that LNET is only using 
IPoIB for address resolution when using o2ib.

---
Yemi
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Oleg Drokin
Hello!

On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:

 Jul  7 07:10:08 cmsls6 kernel: Lustre: 
 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
 c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
 Jul  7 07:59:42 cmsls6 kernel: Lustre: 
 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
 x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35@tcp 7s ago 
 has timed out (7s prior to deadline).
 Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
 on nid 131.225.191.35@tcp was evicted due to a lock completion callback to 
 131.225.191.35@tcp timed out: rc -107
 Jul  7 09:26:58 cmsls6 kernel: Lustre: 
 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
 Jul  7 09:53:50 cmsls6 kernel: Lustre: 
 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
 x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88@tcp 7s ago 
 has timed out (7s prior to deadline).
 Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
 on nid 131.225.204.88@tcp was evicted due to a lock blocking callback to 
 131.225.204.88@tcp timed out: rc -107
 Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
 on nid 131.225.207.176@tcp was evicted due to a lock blocking callback to 
 131.225.207.176@tcp timed out: rc -107
 Jul  7 10:23:01 cmsls6 kernel: Lustre: 
 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
 x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118@tcp 7s 
 ago has timed out (7s prior to deadline).
 Jul  7 11:06:31 cmsls6 kernel: Lustre: 
 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
 e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
 Jul  7 12:26:17 cmsls6 kernel: Lustre: 
 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
 x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151@tcp 7s 
 ago has timed out (7s prior to deadline).
 Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
 on nid 131.225.190.151@tcp was evicted due to a lock blocking callback to 
 131.225.190.151@tcp timed out: rc -107
 Jul  7 12:26:17 cmsls6 kernel: LustreError: 
 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
 export 810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
 8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 
 2 type: EXT [0-1048575] (req 0-1048575) flags: 0x0 remote: 
 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
 Jul  7 12:26:17 cmsls6 kernel: Lustre: 
 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk 
 IO comm error with 
 f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
 12345-131.225.190.151@tcp - client will retry
 Jul  7 12:26:19 cmsls6 kernel: Lustre: 
 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk 
 IO comm error with 
 f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID id 
 12345-131.225.190.151@tcp - client will retry
 
 
 Some of these errors seem really bad - like the bulk IO comm error or the 
 eviction due to a locking call back.
 What should I be looking for here?  I have determined some of the messages 
 that say a client has been evicted cause the
 OSS thinks its dead are not due the system being down. So what makes the OSS 
 think the client is dead?

Well, the clients become unresponsive for some reason, you really need to look 
at the client side logs for some clues on that.

 Also is there any way to determine what files are involved in these errors?

Well, the lock blocking callbacks message will provide you with ost number and 
object index that you might be able to backreference to a file.

All that said, 1.8.3 is quite old and I think it would be a much better idea to 
try 1.8.6 and see if it improves things.

Bye,
Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Lisa Giacchetti

Oleg,
thanks for your response.
See my responses inline.

lisa

On 7/14/11 2:47 PM, Oleg Drokin wrote:

Hello!

On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:


Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.191.35@tcp was evicted due to a lock completion callback to 
131.225.191.35@tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.204.88@tcp was evicted due to a lock blocking callback to 
131.225.204.88@tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.207.176@tcp was evicted due to a lock blocking callback to 
131.225.207.176@tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.190.151@tcp was evicted due to a lock blocking callback to 
131.225.190.151@tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export 
810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2 type: 
EXT [0-1048575] (req 0-1048575) flags: 0x0 remote: 0x6c03f21f59f6b4e6 expref: 
19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO 
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID 
id 12345-131.225.190.151@tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO 
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x283e1be97_UUID 
id 12345-131.225.190.151@tcp - client will retry


Some of these errors seem really bad - like the bulk IO comm error or the 
eviction due to a locking call back.
What should I be looking for here?  I have determined some of the messages that 
say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the OSS 
think the client is dead?

Well, the clients become unresponsive for some reason, you really need to look 
at the client side logs for some clues on that.
I have been doing this as I was waiting for a reply and going through 
the manual and lustre-discuss archives.
Here is an example of one of the client's logs during the appropriate 
time frame:
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164@tcp. The obd_ping operation 
failed with -107
Jul  7 11:55:33 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-810617966400: Connection to service 
cmsprod1-OST0033 via nid 131.225.191.164@tcp was lost; in progress 
operations using this service will wait for recovery to complete.
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164@tcp. The ost_write operation 
failed with -107
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 167-0: This client was 
evicted by cmsprod1-OST0033; in progress operations using this service 
will fail.
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 
3750:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
req@81031d414400 x1373265269802511/t0 
o4-cmsprod1-OST0033_UUID@131.225.191.164@tcp:6/4 lens 448/608 e 0 to 1 
dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Jul  7 11:55:35 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-810617966400: Connection restored to service 
cmsprod1-OST0033 using nid 131.225.191.164@tcp.



Also is there any way to 

Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Oleg Drokin
Hello!

On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:
 Jul  7 07:10:08 cmsls6 kernel: Lustre: 
 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
 c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
 Some of these errors seem really bad - like the bulk IO comm error or the 
 eviction due to a locking call back.
 What should I be looking for here?  I have determined some of the messages 
 that say a client has been evicted cause the
 OSS thinks its dead are not due the system being down. So what makes the 
 OSS think the client is dead?
 Well, the clients become unresponsive for some reason, you really need to 
 look at the client side logs for some clues on that.
 I have been doing this as I was waiting for a reply and going through the 
 manual and lustre-discuss archives.
 Here is an example of one of the client's logs during the appropriate time 
 frame:
 Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while 
 communicating with 131.225.191.164@tcp. The obd_ping operation failed with 
 -107
 Jul  7 11:55:33 cmswn1526 kernel: Lustre: 
 cmsprod1-OST0033-osc-810617966400: Connection to service cmsprod1-OST0033 
 via nid 131.225.191.164@tcp was lost; in progress operations using this 
 service will wait for recovery to complete.

This is way too late in the game, here the server already evicted the client.
Was there anything before then?

 Also is there any way to determine what files are involved in these errors?
 Well, the lock blocking callbacks message will provide you with ost number 
 and object index that you might be able to backreference to a file.
 I know there is a way to do this from the /proc file system (at least I think 
 its /proc) but I can't find any reference to this
 in the book I got from class on this or in the manual.
 Can someone refresh my memory?

Actually I think you can do it with combination of lfs find and lfs getattr.

 All that said, 1.8.3 is quite old and I think it would be a much better idea 
 to try 1.8.6 and see if it improves things.
 downtimes are few and far between for us so this may take a while to get 
 scheduled.
 If there is anything that can be done in the meantime I'd like to try it.

I suspect the might have been several bugs since 1.8.3 that might have 
manifested in slowness to reply to lock callback requests
and you'll end up having downtime to upgrade the clients one way or the other.

Bye,
Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Lisa Giacchetti

On 7/14/11 3:05 PM, Oleg Drokin wrote:

Hello!

On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:

Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Some of these errors seem really bad - like the bulk IO comm error or the 
eviction due to a locking call back.
What should I be looking for here?  I have determined some of the messages that 
say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the OSS 
think the client is dead?

Well, the clients become unresponsive for some reason, you really need to look 
at the client side logs for some clues on that.

I have been doing this as I was waiting for a reply and going through the 
manual and lustre-discuss archives.
Here is an example of one of the client's logs during the appropriate time 
frame:
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while 
communicating with 131.225.191.164@tcp. The obd_ping operation failed with -107
Jul  7 11:55:33 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-810617966400: Connection to service cmsprod1-OST0033 
via nid 131.225.191.164@tcp was lost; in progress operations using this service 
will wait for recovery to complete.

This is way too late in the game, here the server already evicted the client.
Was there anything before then?


No there is nothing before then.


Also is there any way to determine what files are involved in these errors?

Well, the lock blocking callbacks message will provide you with ost number and 
object index that you might be able to backreference to a file.

I know there is a way to do this from the /proc file system (at least I think 
its /proc) but I can't find any reference to this
in the book I got from class on this or in the manual.
Can someone refresh my memory?

Actually I think you can do it with combination of lfs find and lfs getattr.

Hmm. Ok let me try that


All that said, 1.8.3 is quite old and I think it would be a much better idea to 
try 1.8.6 and see if it improves things.

downtimes are few and far between for us so this may take a while to get 
scheduled.
If there is anything that can be done in the meantime I'd like to try it.

I suspect the might have been several bugs since 1.8.3 that might have 
manifested in slowness to reply to lock callback requests
and you'll end up having downtime to upgrade the clients one way or the other.

Bye,
 Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.



attachment: lisa.vcf___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Kevin Van Maren
With one other note: you should have used --mkfsoptions='-t ext4' when 
doing mkfs.lustre, and NOT the force option.
Given that it is already formatted and you don't want to use data, at 
least use the ext4 Lustre RPMs.

Pretty sure you don't need a --writeconf -- you would either run as-is 
with ext4-based ldiskfs or reformat.

The MDT device should be limited to 8TB; I don't think anyone has tested 
a larger MDT.

Kevin


Cliff White wrote:
 This error message you are seeing is what Andreas was talking about - 
 you must use the
 ext4-based version, as you will not need any option with your size 
 LUNS. The 'must use force_over_8tb'
 error is the key here, you most certainly want/need to *.ext4.rpm 
 versions of stuff. 
 cliffw


 On Thu, Jul 14, 2011 at 11:10 AM, Theodore Omtzigt 
 t...@stillwater-sc.com mailto:t...@stillwater-sc.com wrote:

 Michael:

The reason I had to do it on the OST's is because when issuing the
 mkfs.lustre command to build the OST it would error out with the
 message
 that I should use the force_over_8tb mount option. I was not able to
 create an OST on that device without the force_over_8tb option.

 Your insights on the writeconf are excellent: good to know that
 writeconf is solid. Thank you.

 Theo

 On 7/14/2011 1:29 PM, Michael Barnes wrote:
  On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote:
 
  Two part question:
  1- do I need to set that parameter on the MGS/MDS server as well
  No, they are different filesystems.  You shouldn't need to do
 this on the OSTs either.  You must be using an older lustre release.
 
  2- if yes, how do I properly add this parameter on this running
 Lustre
  file system (100TB on 9 storage servers)
  covered
 
  I can't resolve the ambiguity in the documentation as I can't
 find a
  good explanation of the configuration log mechanism that is being
  referenced in the man pages. The fact that the doc for --writeconf
  states This is very dangerous, I am hesitant to pull the
 trigger as
  there is 60TB of data on this file system that I rather not lose.
  I've had no issues with writeconf.  Its nice because it shows
 you the old and new parameters.  Make sure that the changes that
 you made were the what you want, and that the old parameters that
 you want to keep are still in tact.  I don't remember the exact
 circumstances, but I've found settings were lost when doing a
 writeconf, and I had to explictly put these settings in
 tunefs.lustre command to preserve them.
 
  -mb
 
  --
  +---
  | Michael Barnes
  |
  | Thomas Jefferson National Accelerator Facility
  | Scientific Computing Group
  | 12000 Jefferson Ave.
  | Newport News, VA 23606
  | (757) 269-7634 tel:%28757%29%20269-7634
  +---
 
 
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 mailto:Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




 -- 
 cliffw
 Support Guy
 WhamCloud, Inc. 
 www.whamcloud.com http://www.whamcloud.com


 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] New wc-discuss Lustre Mailing List

2011-07-14 Thread Isaac Huang
On Tue, Jul 12, 2011 at 02:12:38PM -0700, Peter Jones wrote:
 Isaac
 
 If you (or anyone else for that matter) is having trouble joining the 
 group let me know privately at pjo...@whamcloud.com which email address 
 that you would like to use and I will add you manually.

Thanks Peter, I got an invitation from a subscriber and joined without
any problem, no Google account required at all.

- Isaac
__
This email may contain privileged or confidential information, which should 
only be used for the purpose for which it was sent by Xyratex. No further 
rights or licenses are granted to use such information. If you are not the 
intended recipient of this message, please notify the sender by return and 
delete it. You may not use, copy, disclose or rely on the information contained 
in it.
 
Internet email is susceptible to data corruption, interception and unauthorised 
amendment for which Xyratex does not accept liability. While we have taken 
reasonable precautions to ensure that this email is free of viruses, Xyratex 
does not accept liability for the presence of any computer viruses in this 
email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England  Wales, 
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in 
Bermuda, Xyratex International Inc, registered in California, Xyratex 
(Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd 
registered in The People's Republic of China and Xyratex Japan Limited 
registered in Japan.
__
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LNET o2ib networking and MTU

2011-07-14 Thread Isaac Huang
On Thu, Jul 14, 2011 at 12:43:32PM -0700, Adesanya, Adeyemi wrote:
 
 Just need some clarification on this: 
 
 We use the o2ib driver for Lustre IB communication. We also use IPoIB to 
 define IP addresses for the IB interfaces in the network. Does the MTU 
 configuration parameter impact Lustre in any way? My understanding is that 
 LNET is only using IPoIB for address resolution when using o2ib.

IPoIB MTU settings don't affect Lustre/LNet in any way if the o2ib
driver is used. With recent Lustre releases, you can find out path
MTU of each IB connection used by LNet with:
lctl conn_list --net o2ib0, for example.

- Isaac
__
This email may contain privileged or confidential information, which should 
only be used for the purpose for which it was sent by Xyratex. No further 
rights or licenses are granted to use such information. If you are not the 
intended recipient of this message, please notify the sender by return and 
delete it. You may not use, copy, disclose or rely on the information contained 
in it.
 
Internet email is susceptible to data corruption, interception and unauthorised 
amendment for which Xyratex does not accept liability. While we have taken 
reasonable precautions to ensure that this email is free of viruses, Xyratex 
does not accept liability for the presence of any computer viruses in this 
email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England  Wales, 
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in 
Bermuda, Xyratex International Inc, registered in California, Xyratex 
(Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd 
registered in The People's Republic of China and Xyratex Japan Limited 
registered in Japan.
__
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss