[lustre-discuss] lustre-client is obsoleted

2017-07-26 Thread Jon Tegner

Hi,

when trying to update clients from 2.9 to 2.10.0 (on CentOS-7) I 
received the following:


"Package lustre-client is obsoleted by lustre, trying to install 
lustre-2.10.0-1.el7.x86_64 instead"


and then the update failed (to my guessing due to the fact that 
zfs-related packages are missing on the system (at the moment I don't 
intend to use zfs) .


I managed to get past this by forcing the installation of the client, i.e.,

"yum install lustre-client-2.10.0-1.el7.x86_64.rpm"

Just curious, is lustre-client really obsoleted?

Regards,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Virtual servers

2017-02-17 Thread Jon Tegner
As I wrote, the situation was improved by moving the client to one of 
the machines where the OSS are running under a virtual instance.


But what could be the reason for this? Seems to me there has to be some 
kind of competition of resources. But what resources?


Would the situation be improved by more memory? Faster CPU or disk? 
While creating the virtual machine number of cores, and amount of memory 
in the virtual machine are specified, should I increase or decrease 
these resources? Are there some kernel settings that could help improve 
the situation? Or maybe some BIOS settings related to the virtual machines?


Just curious here.

Thanks,

/jon

On 02/16/2017 04:55 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Feb 16, 2017, at 9:56 AM, Jon Tegner <teg...@foi.se> wrote:

I have three (physical) machines, and each one have a virtual machine on them 
(KVM). On one of the virtual machines there is an MDS and on two of them there 
are OSS:es installed.

All system use CentOS-7.3 and Lustre 2.9.0, and I mount the file system on one 
of the physical machines.


Which machine of you mount the file system on?  Is it an OSS node or the MDS 
node?


Anyway, when evaluating the performance (using FIO) I expected to see an increased 
transfer rate when going from one to two simultaneous files in the tests, and I do see 
this, but only for very small file sizes (around 0.2 GB). On the "non virtual" 
file systems I have tested the effect of increased transfer rate as the number of files 
are increased is on the other hand easily verified.

Have you tried mounting the file system on different nodes?  This could help 
determine if the problem is always the same or if it might be affected by the 
type of node (MDS vs OSS) that is being used for the client.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Virtual servers

2017-02-16 Thread Jon Tegner

Thanks!

The first result was indeed with the client on the node where the 
virtual instance hosted the MDS. When I switched to put the client on 
one of the machines where the virtual machine hosts one of the OSS 
performance look a lot more what I would have expected (i.e., decent 
scaling when going from one to two files). Guess my "hypothesis" was wrong!


/jon

On 02/16/2017 04:55 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Feb 16, 2017, at 9:56 AM, Jon Tegner <teg...@foi.se> wrote:

I have three (physical) machines, and each one have a virtual machine on them 
(KVM). On one of the virtual machines there is an MDS and on two of them there 
are OSS:es installed.

All system use CentOS-7.3 and Lustre 2.9.0, and I mount the file system on one 
of the physical machines.


Which machine of you mount the file system on?  Is it an OSS node or the MDS 
node?


Anyway, when evaluating the performance (using FIO) I expected to see an increased 
transfer rate when going from one to two simultaneous files in the tests, and I do see 
this, but only for very small file sizes (around 0.2 GB). On the "non virtual" 
file systems I have tested the effect of increased transfer rate as the number of files 
are increased is on the other hand easily verified.

Have you tried mounting the file system on different nodes?  This could help 
determine if the problem is always the same or if it might be affected by the 
type of node (MDS vs OSS) that is being used for the client.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Virtual servers

2017-02-16 Thread Jon Tegner

Hi,

I have been playing around with Lustre on virtual servers (mainly with 
the purpose of gaining some experience).


I have three (physical) machines, and each one have a virtual machine on 
them (KVM). On one of the virtual machines there is an MDS and on two of 
them there are OSS:es installed.


The machines are connected over a 10Gbit network (and I have verified 
the network performance - there are some losses going over the "virtual 
layer" but not terribly much).


All system use CentOS-7.3 and Lustre 2.9.0, and I mount the file system 
on one of the physical machines.


I know it is not recommended to use clients on the same machine as the 
servers, but my hypothesis was that this problem should be mitigated (or 
removed altogether) by "confining" the servers in virtual instances.


Anyway, when evaluating the performance (using FIO) I expected to see an 
increased transfer rate when going from one to two simultaneous files in 
the tests, and I do see this, but only for very small file sizes (around 
0.2 GB). On the "non virtual" file systems I have tested the effect of 
increased transfer rate as the number of files are increased is on the 
other hand easily verified.


Question: could this lack of increased performance be an effect of the 
fact that the client sit on the same machine as one of the Lustre 
servers (even though the server is "confined" in a virtual instance)? Or 
could there be some other issues, possibly connected to running the 
servers on virtual machines?


Thanks,

/jon


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-08 Thread Jon Tegner

Thanks a lot!

A related question: is it possible to use the result from the "ping" 
test to verify the latency obtained from openmpi? Or, how do I know it 
the result from the "ping" test is "acceptable"?


/jon

On 02/07/2017 06:38 PM, Oucharek, Doug S wrote:

Because the stat command is “lst stat servers”, the statistics you are seeing 
are from the perspective of the server.  The “from” and “to” parameters can get 
quite confusing for the read case.  When reading, you are transferring the bulk 
data from the “to” group to the “from” group (yes, seems the opposite of what 
you would expect).  I think the “from” and “to” labels were designed to make 
sense in the write case and the logic was just flipped for the read case.

So, the stats you show indicated that are you writing an average of 3.6GiB/s 
(note: the lnet-selftest stats are mislabel and should be MiB/s rather than 
MB/s…I have fixed this in the latest release.  You are then getting 3.8GB/s).  
The reason you see traffic in the read direction is due to responses/acks.  
That is why there are a lot of small messages going back to the server (high 
RPC rate, small bandwidth).

So, your test looks like it is working to me.

Doug


On Feb 7, 2017, at 2:13 AM, Jon Tegner <teg...@foi.se> wrote:

Probably doing something wrong here, but I tried to test only READING with the 
following:

#!/bin/bash
export LST_SESSION=$$
lst new_session read
lst add_group servers 10.0.12.12@o2ib
lst add_group readers 10.0.12.11@o2ib
lst add_batch bulk_read
lst add_test --batch bulk_read --concurrency 12 --from readers --to servers \
brw read check=simple size=1M
lst run bulk_read
lst stat servers & sleep 10; kill $!
lst end_session

which in my case gives:

[LNet Rates of servers]
[R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
[W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 2.29 MB/s  Min: 2.29 MB/s  Max: 2.29 MB/s
[W] Avg: 3608.44  MB/s  Min: 3608.44  MB/s  Max: 3608.44  MB/s

it seems strange that it should report non zero numbers in the [W] positions? Specially that bandwidth is low in the [R] position (since I 
explicitly demanded "read")? Also note that if I change "brw read" to "brw write" in the script above the 
results are "reversed" in the sense that it reports the higher number regarding bandwidth in the [R] position. That is "brw 
read" reports (almost) the expected bandwidth in the [W]-position, whereas "brw write" reports it in the [R]-position.

This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.

Thanks,
/jon


On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:

Try running just a read test and then just a write test rather than having both 
at the same time and see if the performance goes up.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-07 Thread Jon Tegner
Probably doing something wrong here, but I tried to test only READING 
with the following:


#!/bin/bash
export LST_SESSION=$$
lst new_session read
lst add_group servers 10.0.12.12@o2ib
lst add_group readers 10.0.12.11@o2ib
lst add_batch bulk_read
lst add_test --batch bulk_read --concurrency 12 --from readers --to 
servers \

brw read check=simple size=1M
lst run bulk_read
lst stat servers & sleep 10; kill $!
lst end_session

which in my case gives:

[LNet Rates of servers]
[R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
[W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 2.29 MB/s  Min: 2.29 MB/s  Max: 2.29 MB/s
[W] Avg: 3608.44  MB/s  Min: 3608.44  MB/s  Max: 3608.44  MB/s

it seems strange that it should report non zero numbers in the [W] 
positions? Specially that bandwidth is low in the [R] position (since I 
explicitly demanded "read")? Also note that if I change "brw read" to 
"brw write" in the script above the results are "reversed" in the sense 
that it reports the higher number regarding bandwidth in the [R] 
position. That is "brw read" reports (almost) the expected bandwidth in 
the [W]-position, whereas "brw write" reports it in the [R]-position.


This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.

Thanks,
/jon


On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:

Try running just a read test and then just a write test rather than having both 
at the same time and see if the performance goes up.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Jon Tegner

Hi,

I used the following script:

#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 10.0.12.12@o2ib
lst add_group readers 10.0.12.11@o2ib
lst add_group writers 10.0.12.11@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --concurrency 12 --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
brw write check=simple size=1M
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session

and tried with concurrency from 0,2,4,8,12,16, results in

http://renget.se/lnetBandwidth.png
and
http://renget.se/lnetRates.png

From Bandwidth a max of just below 2800 MB/s can be noted. Since in 
this case "readers" and "writers" are the same, I did a few tests with 
the line


lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
brw write check=simple size=1M

removed from the script - which resulted in a bandwidth of around 3600 MB/s.

I also did tests using mpitests-osu_bw from openmpi, and in that case I 
monitored a bandwidth of about 3900 MB/s.


Considering the "openmpi-bandwidth" should I be happy with the numbers 
obtained by LNet selftest? Is there a way to modify the test so that the 
result gets closer to what openmpi is giving? And what can be said of 
the "Rates of servers (RPC/s)" - are they "good" or "bad"? What to 
compare them with?


Thanks!

/jon

On 02/05/2017 08:55 PM, Jeff Johnson wrote:

Without seeing your entire command it is hard to say for sure but I would make 
sure your concurrency option is set to 8 for starters.

--Jeff

Sent from my iPhone


On Feb 5, 2017, at 11:30, Jon Tegner <teg...@foi.se> wrote:

Hi,

I'm trying to use lnet selftest to evaluate network performance on a test setup 
(only two machines). Using e.g., iperf or Netpipe I've managed to demonstrate 
the bandwidth of the underlying 10 Gbits/s network (and typically you reach the 
expected bandwidth as the packet size increases).

How can I do the same using lnet selftest (i.e., verifying the bandwidth of the 
underlying hardware)? My initial thought was to increase the I/O size, but it seems the 
maximum size one can use is "--size=1M".

Thanks,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] LNET Self-test

2017-02-05 Thread Jon Tegner

Hi,

I'm trying to use lnet selftest to evaluate network performance on a 
test setup (only two machines). Using e.g., iperf or Netpipe I've 
managed to demonstrate the bandwidth of the underlying 10 Gbits/s 
network (and typically you reach the expected bandwidth as the packet 
size increases).


How can I do the same using lnet selftest (i.e., verifying the bandwidth 
of the underlying hardware)? My initial thought was to increase the I/O 
size, but it seems the maximum size one can use is "--size=1M".


Thanks,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre OSS and clients on same physical server

2016-08-11 Thread Jon Tegner
Regarding clients and OSS on same physical server. Seems to me the 
problem is not (directly) related to the amount of memory on the 
machine, but instead to different applications "competing" for the memory?


Could this possibly be resolved by running lustre in a virtual machine? 
Or would there be some other way to "partition" the memory in separate 
"batches" (or containers)? One for the application and one for the servers?


In most cases it seems wise to keep the servers separate from the 
clients, but e.g., in a "desk side", personal, smaller cluster (with 
basically only one user) it would be nice (better use of the resources) 
IF it would be possible to put servers and clients on the same machines.


/jon

On 07/15/2016 10:17 PM, Christopher J. Morrone wrote:

On 07/15/2016 12:11 PM, Cory Spitz wrote:

Chris,

On 7/13/16, 2:00 PM, "lustre-discuss on behalf of Christopher J. Morrone" 
 wrote:


If you put both the client and server code on the same node and do any
serious amount of IO, it has been pretty easy in the past to get that
node to go completely out to lunch thrashing on memory issues

Chris, you wrote “in the past.”  How current is your experience?  I’m sure it 
is still a good word of caution, but I’d venture that modern Lustre (on a 
modern kernel) might fare a tad bit better.  Does anyone have experience on 
current releases?

Pretty recent.

We have had memory management issues with servers and clients
independently at pretty much all periods of time, recent history
included.  Putting the components together only exacerbates the issues.

Lustre still has too many of its own caches with fixed, or nearly fixed
caches size, and places where it does not play well with the kernel
memory reclaim mechanisms.  There are too many places where lustre
ignores the kernels requests for memory reclaim, and often goes on to
use even more memory.  That significantly impedes the kernel's ability
to keep things responsive when memory contention arises.


I understand that it isn’t a design goal for us, but perhaps we should pay some 
attention to this possibility?  Perhaps we’ll have interest in co-locating 
clients on servers in the near future as part of a replication, network 
striping, or archiving capability?

There is going to need to be a lot of work to have Lustre's memory usage
be more dynamic, more aware of changing conditions on the system, and
more responsive to the kernel's requests to free memory.  I imagine it
won't be terribly easy, especially in areas such as dirty and unstable
data which cannot be freed until it is safe on disk.  But even for that,
there are no doubt ways to make things better.

Chris

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building RPMs for 2.5.3 on CentOS-7.2

2016-06-05 Thread Jon Tegner

Thanks a lot!

Did found the "--disable-server" and that seemed to work (sort of), 
however, had other issues when building the (client) rpms.


Do you have any general advice considering that we want CentOS-7.2 on 
the clients? We have happily been using 2.5.3 for quite some time now 
(6.5 on both servers and clients), and we would like to see similar 
stability on our updated system (that is, with 7.2 on the clients). What 
would be my best/safest bet under these circumstances?


I actually tried a client on 2.8.0 using our old servers (6.5/2.5.3) - 
and initial (very limited) tests seemed to indicate that this was OK. Is 
this something one should generally stay away from? Or would it be 
better to use 2.8.0 on the servers also, if that is what is used on the 
clients?


Thanks again,
/jon


On 06/05/2016 03:26 PM, Patrick Farrell wrote:

Do you need the server code?  If not, you can do --disable-server (I think) and 
skip the kernel patching steps.  If no, you're out of luck - the public 2.5 
code is not going to build against 7.2.

In fact, you may have some issues even just building the client.  But those may 
be surmountable, I'd say the server issues are not.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Building RPMs for 2.5.3 on CentOS-7.2

2016-06-04 Thread Jon Tegner

Hi,

I want to build RPMS for 2.5.3 on CentOS-7.2 (the other option would be 
to use 2.8.0, but since we don't need the news in 2.8.0, and since I've 
heard 2.5.3 would be more stable I opted to go for 2.5.3).


Found the link

wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821

describing the process under CentOS-6.4, but I do get a bit hesitant 
over this procedure when I reach the kernel-patching-part.


Is it in general a bad idea to build 2.5.3 on 7.2? Hints on proceeding?

Do prebuilt RPMs exist (would make life easier)?

Since we only require 7.2 on the clients, would it be an option to use 
2.5.3 on CentOS-6.5 on the servers, and 2.8.0 on CentOS-7.2 on the clients?


Regards, and thanks!

/jon

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Problem mounting over infiniband

2016-04-28 Thread Jon Tegner

Hi,

I have brought up a test system using

2.8.0-3.10.0_327.3.1.el7.x86_64_g96792ba

I can mount the system over tcp, but when I try to do so over infiniband 
i get errors of the type:


Can't accept conn from 10.0.51.1@o2ib, queue depth too large: 128 (<=8 
wanted)


Can't accept conn from 10.0.51.1@o2ib (version 12): max_frags 32 
incompatible without FMR pool (256 wanted)


After searching I suspected it had something to do with the fact that we 
have mellanox (mlx4_ib) on the server and qlogic on the client (ib_qib).


Also found a possible solution, by putting

options ko2iblnd peer_credits=124 concurrent_sends=62 map_on_demand=256

However, there are a bunch of options to ko2iblnd, and to me it is not 
obvious which values to chose. Is there a specific strategy one should 
follow?


Regards,

/jon

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.8.0 released

2016-03-19 Thread Jon Tegner

This is great!

Will start working with this new version as soon as possible.

At this point I have a few questions:

the lustre server kernel is based on

kernel-3.10.0-327.3.1.el7.x86_64.rpm

surely this means that the clients should be on the same version of the 
kernel (i.e., 3.1, but the standard one)?


And if I want to use a later kernel I have to build it from source 
(something I have done on one of the release candidates, but didn't have 
time to test)?


Would it actually be recommended to build lustre on a newer kernel (if 
you don't have any general issues with this older kernel)?


Regards, and thanks again

On 03/16/2016 11:13 PM, Jones, Peter A wrote:

We are pleased to announce that the Lustre 2.8.0 Release has been declared GA and is available 
for  
download  
  . You can also grab the source 
from  
git


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Rebuild server

2016-03-10 Thread Jon Tegner

Hi,

yesterday I had an incident where the system disk of one of my servers 
(MDT/MGS) went down, but the raid could be rebuilt and the system went 
up again.


However, in the event of a complete failure of the system disk (assuming 
all relevant "lustre disks" are still intact) is there a clear procedure 
to follow in order to rebuild the file system once the OS has been 
reinstalled on new disk?


Thanks,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.8.0

2016-02-25 Thread Jon Tegner

Hi,

just downloaded 2.8.0-RC1 (|git clone 
http://git.whamcloud.com/fs/lustre-release.git 2.8.0-RC1).


Have some questions before I get started:

* I would like to build on CentOS-7.2, and latest kernel there seems to 
be 3.10.0-327.10.1.el7, should I go for this one, or rather use 
3.10.0-327.3.1.el7?


* Are there any specific instructions for how to build on 
CentOS-7.2/RHEL-7.2 (I found one for 6.4)?

|
Thanks!

/jon

On 02/22/2016 01:27 PM, Jones, Peter A wrote:

Hi Jon

It should be quite soon. We¹re in final release testing atm. You can check
out the latest code yourself to test at
http://git.whamcloud.com/fs/lustre-release.git/shortlog/refs/heads/b2_8 if
you are interested.

Peter

On 2/22/16, 12:26 AM, "lustre-discuss on behalf of Jon Tegner"
<lustre-discuss-boun...@lists.lustre.org on behalf of teg...@foi.se> wrote:


Hi,

any news on when 2.8.0 will be released?

http://wiki.lustre.org/Release_2.8.0#Current_Schedule (is this the
relevant place to check?) states Feb 15.

Regards,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Date release of 2.7.1

2016-01-21 Thread Jon Tegner

Hi,

don't think there will be a 2.7.1, but 2.8.0 is scheduled for release in 
some weeks:


http://wiki.lustre.org/Release_2.8.0

Regards,

/jon

On 01/21/2016 03:43 PM, Pardo Diaz, Alfonso wrote:

Hi,


We want to upgrade our Lustre environment from Centos 6.5 with Lustre 
2.5.2 to Centos 7 with Lustre 2.7. I see in the “Lustre Support 
Matrix” that servers with Centos 7 only are supported by Lustre 2.7.1.


When is the date release of 2.7.1 version?



Regards,



Alfonso Pardo Diaz
*System Administrator / Researcher*
/c/ Sola nº 1; 10200 Trujillo, ESPAÑA/
/Tel: +34 927 65 93 17 Fax: +34 927 32 32 37/

CETA-Ciemat logo 


/ Confidencialidad: Este mensaje y sus 
ficheros adjuntos se dirige exclusivamente a su destinatario y puede 
contener informaci�n privilegiada o confidencial. Si no es vd. el 
destinatario indicado, queda notificado de que la utilizaci�n, 
divulgaci�n y/o copia sin autorizaci�n est� prohibida en virtud de la 
legislaci�n vigente. Si ha recibido este mensaje por error, le rogamos 
que nos lo comunique inmediatamente respondiendo al mensaje y proceda 
a su destrucci�n. Disclaimer: This message and its attached files is 
intended exclusively for its recipients and may contain confidential 
information. If you received this e-mail in error you are hereby 
notified that any dissemination, copy or disclosure of this 
communication is strictly prohibited and may be unlawful. In this 
case, please notify us by a reply and delete this email and its 
contents immediately.  /



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.7 deployment issues

2015-12-04 Thread Jon Tegner

Hi,

Where do you find the 2.7.x-releases? I thought fixes were only released 
for the Intel maintenance version?


Regards,

/jon

On 12/04/2015 11:43 AM, jerome.be...@inserm.fr wrote:

Hello Ray,

One consideration first : You try the 2.7 version which is not the 
production one (aka 2.5). From this perspective wether you run 2.7.0 
or 2.7.x won't make any big difference, it is the develpment release.


Then if I understand the problem comes from the infiniband driver 
module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you 
have to update the kernel to fix it. Doing this may result that the 
2.7.0 version on the site, compiled on an older kernel version, will 
refuse to load then. (because kernel modules - i.e the lustre ones 
here -  relies on features that may change between different kernel 
version making it incompatible)


In any case you can try to rebuild the 2.7.0 version from the source 
to your new kernel. The procedure is quite easy :


https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel 



It will regenerate the 2.7.0 client uppon your newer kernel with the 
working infinband modules, but the stability is not garanted as the 
2.7 branch is under development anyway.


Or use a precompiled one on the build site if you can't (some nasty 
bugs in the base 2.x.0 version are fixed in the latest builds)


The only thing is to stick to the very same version on mds and oss and 
at least the same or newer version for the clients.


Regards

Le 03-12-2015 16:13, Ray Muno a écrit :

I am trying to set up a test deployment of Lustre 2.7.

I pulled RPMS from http://lustre.org/download/ and installed them on a
set of server running Scientific Linux 6.6 which seems to be a proper
OS for deployment.  Everything installs and I can format the
filesystems on the MDS (1) and OSS (2) servers. When I try and mount
the OST files systems, I get communication errors. I can "lctl ping"
the servers from each other, but cannot establish communication
between the MDS and OSS.

The installation is on servers connected over Infiniband (Qlogic DDR 
4X).


In trying to diagnose the issues related to the error messages, I
found mention in some list discussions that o2ib is broken in the
2.6.32-504.8.1 kernel.

After much frustration, I pulled a nightly build from
build.hpdd.intel.com (kernel
2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
Everything worked as I expected.

Am I missing something? Is the default release pointed to at
https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
just the hardware I am trying to deploy against?

I can provide specifics about the errors I see, I am just posting this
to make sure I am pulling the Lustre RPM's from the proper source.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.5.3 on centos 6.y with y>=5

2015-10-02 Thread Jon Tegner

Well, guess it depends on how secure your environment is.

We have no issues running 2.5.3 on CentOS-6.5 (we plan on moving to 
CentOS-7, but only once Luster-2.8 is released).


/jon

On 10/02/2015 11:59 AM, Thomas Lorenzen wrote:

Hi'

Yes, this happens when doing yum update. However, sticking with older 
6.x versions seems deprecated as per the following readme snippet of 
centos, and therefore the question still goes, if lustre 2.5.3 (only 
client) is bound to centos 6.5 explicitly or if centos 6.y with y>=5 
will be good as well?


#
This directory (and version of CentOS) is deprecated.  For normal users,
you should use /6/ and not /6.5/ in your path. Please see this FAQ
concerning the CentOS release scheme:

https://wiki.centos.org/FAQ/General

If you know what you are doing, and absolutely want to remain at the 6.5
level, go to http://vault.centos.org/ for packages.

Please keep in mind that 6.0, 6.1, 6.2, 6.3, 6.4 and 6.5 no longer gets any 
updates, nor
any security fix's.
#

Best regards.

Thomas.



Den 11:50 fredag den 2. oktober 2015 skrev Jon Tegner <teg...@foi.se>:


I think this happens when you do a default upgrade. To prevent it I
modified the files in /etc/yum.repos to explicitly point to "6.5".
Originally they are linked to "6" (or a variable with that value).

Regards,

/jon

On 10/02/2015 11:21 AM, Thomas Lorenzen wrote:
> Hi'
>
> The support matrix states that lustre 2.5.3 is supported on centos
> 6.5. However, it seems, that when updating a fresh centos 6.5, it
> turns into centos 6.7. The kernel is still 2.6.32, but "sub version"
> changes from -431 to -573.7.1. This makes installing the lustre client
> packages fail due to the

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.5.3 on centos 6.y with y>=5

2015-10-02 Thread Jon Tegner
I think this happens when you do a default upgrade. To prevent it I 
modified the files in /etc/yum.repos to explicitly point to "6.5". 
Originally they are linked to "6" (or a variable with that value).


Regards,

/jon

On 10/02/2015 11:21 AM, Thomas Lorenzen wrote:

Hi'

The support matrix states that lustre 2.5.3 is supported on centos 
6.5. However, it seems, that when updating a fresh centos 6.5, it 
turns into centos 6.7. The kernel is still 2.6.32, but "sub version" 
changes from -431 to -573.7.1. This makes installing the lustre client 
packages fail due to the


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.7 with EL7

2015-07-31 Thread Jon Tegner

Hi,

we have some new hardware, on which we want to bring up Lustre (as of 
today we have two Lustre systems, 2.5.3, running on CentOS-6.5). Without 
checking we installed CentOS-7.1 on them, but when looking for the 
server packages we realized that there are none for RHEL7 systems.


So it seems we have two options:

1. Wait for Lustre 2.8.0, or
2. Install CentOS-6.6 on our servers.

No big deal to re install the servers, but just wanted to check if we 
have other options that we missed (or, if possible, 2.8.0 is going to be 
released really soon).


Thanks!

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Cannot remove file

2015-05-28 Thread Jon Tegner

Hi,

I have a few files which are listed (ls -l) with:

-? ? ?  ? ??

I have tried to remove them, both with rm and with ulink, but 
neither of these work (unlink: cannot unlink `file': Invalid argument). 
This really doesn't bother me, except for an error message when backing 
up - but it still would be nice to know how to fix this.


We are using lustre-2.5.3 on CentOS-6.5 boxes.

Thanks!

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[Lustre-discuss] lfs find

2014-11-06 Thread Jon Tegner

Hi!

Short question (using lustre 2.5.3). Have added a new OST and about to 
remove a faulty one. After deactivating it I would like to migrate the 
data from the faulty one, in chapter 14  the command:


lfs find --obd  ost_name /mount/point | lfs_migrate -y

is given, and in chapter 19:

lfs find --ost {OST_UUID} lfs_migrate -y

Are these two command equivalent?

And just to be certain, once the migration is finish should the command:

lctl conf_param ost_name.osc.active=0

be run on all clients as well as on the MDS? And how to I find the name 
ost_name.osc in the command above (llobdstat states it can be found 
under /proc/fs/lustre/obdfilter - but no such file on my system)?


Regards, and thanks!

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Replacing faulty OSS/OST

2014-10-21 Thread Jon Tegner

Hi again,

We are running lustre 2.5.3 on a small system, consisting of one 
combined MGS/MDT and four combined OSS/OSTs. One of the OSS/OSTs has 
faulty hardware, and need to be replaced. The procedure I plan on using 
is the following.


1. Deactivate the faulty OSS.
2. Make a file-level backup of it (not much data on it).
3. Temporarily deactivate it.

The manual seems clear on how to perform these steps, what I'm a bit 
uncertain about is how I introduce the replacement machine (which will 
be a complete new one). Are there certain steps I need to take into 
account, e.g., can will it be possible to create the new OST file system 
with the old ost_index?


4. Once the replacement is up, and its OST activated, the backup should 
be restored. In the manual it is stated that e2label should be used to 
set the file system label (based on the old ost_index in hex as I 
understand it - a bit curious on this point, since I thought that would 
be taken care of automatically (I have never done this when creating 
OSTs before).


Thanks again!

Regards,

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Replacing faulty OSS/OST

2014-10-21 Thread Jon Tegner

Thanks!

If I understand correctly this is taken care of by

tar czvf {backup file}.tgz --xattrs --sparse .

(which should work on CentOS-6.5)?

On 10/21/2014 10:16 AM, Dilger, Andreas wrote:

filesystems, you should also backup and restore the xattrs on all the
files as is needed for the MDT filesystem backup/restore process.


Kind regards,

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] OST RAID sets

2014-10-20 Thread Jon Tegner

Hi,

rather new to lustre, and read the following in the manual:

For better performance, we recommend that you create RAID sets with 4 
or 8 data disks plus one or two parity disks.


We were experimenting with OSTs with 6 disks, and tried raid set of both 
4 and 5 disks (raid5). While testing for performance we didn't notice 
any difference (4 was not noticeable faster than 5), so just wondering 
if there are any references where the performance advantage of 4 or 8 
data disks are verified.


Thanks!

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Replace OST

2014-03-20 Thread Jon Tegner

Thanks a lot!

Much appreciated!

I'm using 2.4.2 on CentOS-6.5. Indeed it seems related to the issue 
Andreas links to. I did some further tests, and it seems that if I run 
the find command (or something like ls -R /lusterMountPoint) BEFORE 
I unmount the OST the find command works after the OST has been 
unmounted. Related to some kind of caching?


Tried to use getstripe instead (thanks!), with the command:

lfs getstripe --ost lustre-OST0001 -v -r /home | grep home | xargs -n 1 
unlink


(where /home is my luster mount point).

By then following the procedure as indicated in the manual (thanks for 
pointing me to the latest!) I managed to introduce the new (in my case 
reformated) OST in the file system.


At the moment I'm just testing, and the amount of data on the system is 
very small, and I have no clue whether the way I combine getstripe 
with xargs an unlink is an efficient way of doing it?


Another thing I didn't understand was that after reformating the OST, 
and mounting it, df indicated that the file system was growing. Before 
mounting the failed OST it seemed as if about half of the space was 
gone (reasonable, since I'm playing around with two OSTs). However, 
after a while it seemed as if the file system was automagically 
repopulated. But not with ALL of the files (when compared with my 
backup).  But maybe this could be attributed to confusion on my part...


Anyway, thanks again for your help (realizing that I can get feedback 
from this list will make it less scary to move over to lustre)!


Regards,

/jon

On 03/19/2014 08:51 PM, Dilger, Andreas wrote:

On 2014/03/19, 9:35 AM, Lee, Brett brett@intel.com wrote:


Hi Jon,

Looks like you are jumping in with both feet (new to Lustre and trying to
replace an OST).  Pretty ambitious... ;)


From what I can tell you are following section 14.8.3 in the latest

Lustre manual:

14.8.3.  Removing an OST from the File System
http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/
lustre_manual.xhtml

I'm not familiar with the process (having not needed to do that), but I
believe that step describes how to remove an *active* OST from the file
system.

Here's what I'd suggest:

1.  Verify that each step in the *latest* manual has been executed
successfully.

2.  If the lfs find process has not completed yet, run strace -p
pid on the parent process to provide the list more detail.

Looks like https://jira.hpdd.intel.com/browse/LU-1738 which hasn't been
fixed yet.  In the meantime, you can use lfs getstripe instead.


3.  Provide the list which version of Lustre are you running.

4.  Provide the list relevant syslog messages from the client, MDS, MGS,
and the OSS with the deactivated OST.

Note that for replacing a *fully* failed OST, I believe you would
reformat the OST and then follow 14.8.5 (in the latest manual).

Dr. Brett Lee, Solutions Architect
High Performance Data Division, Intel
+1.303.625.3595






-Original Message-
From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
boun...@lists.lustre.org] On Behalf Of Jon Tegner
Sent: Monday, March 17, 2014 10:28 AM
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Replace OST

Hi,

I'm new to lustre, so please excuse me for probably some stupid
questions.

I have set up a small test system, consisting of

* 1 MGS/MDT
* 2 OSS/OSTs
* 6 clients on infiniband and one on gigabit.

I have verified the scaling effect (increased performance with two OSTs
compared to one). I further wanted to gain some experience when
components fail, and in a first test I wanted to replace one of my
OSS/OSTs.
Did the following (trying to follow chapter 14.7 in the manual):

1. Unmounted the OST on one of my two OSS/OSTs (simulating a crash).

2. Deactivating this OST on the MGS/MDT.

3. Trying to remove the files located on this OST with the command:

lfs find --obd lustre-OST0002 -print0 /home |  tee /tmp/files_to_restore
| xargs -0 -n 1 unlink

The last command was issued from one of the clients, but it just hangs.

Are there something wrong with the way I'm trying to do this? Any help
would be greatly appreciated!

Thanks!

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss



Cheers, Andreas


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Replace OST

2014-03-17 Thread Jon Tegner

Hi,

I'm new to lustre, so please excuse me for probably some stupid questions.

I have set up a small test system, consisting of

* 1 MGS/MDT
* 2 OSS/OSTs
* 6 clients on infiniband and one on gigabit.

I have verified the scaling effect (increased performance with two OSTs 
compared to one). I further wanted to gain some experience when 
components fail, and in a first test I wanted to replace one of my 
OSS/OSTs. Did the following (trying to follow chapter 14.7 in the manual):


1. Unmounted the OST on one of my two OSS/OSTs (simulating a crash).

2. Deactivating this OST on the MGS/MDT.

3. Trying to remove the files located on this OST with the command:

lfs find --obd lustre-OST0002 -print0 /home |  tee /tmp/files_to_restore 
| xargs -0 -n 1 unlink


The last command was issued from one of the clients, but it just hangs.

Are there something wrong with the way I'm trying to do this? Any help 
would be greatly appreciated!


Thanks!

/jon
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss