Re: [Lustre-discuss] 1.8.x server for el6

2012-01-05 Thread Michael Sternberg
Peter, Andreas,

Thank you both for your assessment.

I decided to proceed conservatively with 1.8.7 on el5. I'll take the 
opportunity to get more comfortable with 2.x on el6 in a test system on the 
earlier server nodes.


Best,
Michael

On January 5, 2012, at 12:12 , Peter Jones wrote:

> inline
> 
> On 12-01-05 9:21 AM, Andreas Dilger wrote:
>> On 2012-01-05, at 3:29 AM, Michael Sternberg wrote:
>>> I am wondering about the release matrix for Lustre vs. RHEL/CentOS.
>>> [..]
>>> It appears that the current recommended production release is Lustre-1.8, 
>>> and, since there is no server for el6 I take that as a strong hint to stick 
>>> to el5.
>> For new deployments the recommended version is 2.1.0 with RHEL6.1.  We are
>> starting work on a 2.1.1 maintenance release for the spring.
> While it is not often that I would disagree with Andreas, I would say 
> that the answer on this point depends upon your timing. Right now, if 
> stability is your primary driver (and it sounds like it is) then I would 
> recommend 1.8.7-wc1. The early feedback from 2.1 is very encouraging, 
> but I think that we need a little more production feedback before we 
> could confidently assert that 2.1.x is the default option.
>> 
>>> Is there some more information public? I was planning to upgrade a cluster 
>>> to el6. Ideally, I'd like to have servers and user nodes all on the same 
>>> major OS release. However, stability will trump that.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] 1.8.x server for el6

2012-01-05 Thread Michael Sternberg
I am wondering about the release matrix for Lustre vs. RHEL/CentOS.

==
Lustre  lustre.org  whamcloud
Release 
el5 el6 el5 el6
--
1.8.7 clientyes no  yes yes
1.8.7 serveryes no  yes no

2.0 (discontinued)  n/a

2.1   client n/ayes yes
2.1   server n/ayes yes
==

http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/
http://downloads.whamcloud.com/public/lustre/lustre-1.8.7-wc1/el6/
http://downloads.whamcloud.com/public/lustre/lustre-2.1.0/


It appears that the current recommended production release is Lustre-1.8, and, 
since there is no server for el6 I take that as a strong hint to stick to el5.

Is there some more information public? I was planning to upgrade a cluster to 
el6. Ideally, I'd like to have servers and user nodes all on the same major OS 
release. However, stability will trump that.


Best, Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre-2.1, CentOS-6.2: unknown param quota_type=ug

2012-01-05 Thread Michael Sternberg
Phew!  OK, I found a manual-bug on this already open –– 
http://jira.whamcloud.com/browse/LUDOC-26


Thank you very much!
Michael


On Jan 5, 2012, at 3:10 , Johann Lombardi wrote:

> In lustre 2.x, this must be mdd.quota_type (instead of mdt.quota_type). A 
> patch was landed to master some time ago (will be available in 2.2) to 
> interpret mdt.quota_type as mdd.quota_type transparently, see 
> http://review.whamcloud.com/#change,354.
> 
> Cheers,
> Johann
> 
> On Thu, Jan 05, 2012 at 01:35:14AM -0600, Michael Sternberg wrote:
>> I cannot get quota to work on a clean new file system with Lustre-2.1 on 
>> CentOS-6.2.
>> 
>>  # tunefs.lustre --param mdt.quota_type=ug /dev/mapper/mdt1
>>  # mount /mnt/mdt1
>>  mount.lustre: mount /dev/mapper/mdt1 at /mnt/mdt1 failed: Function not 
>> implemented
>> 
>> This logs:
>> 
>>  LustreError: 3957:0:(obd_config.c:1140:class_process_proc_param()) 
>> carbonfs-MDT: unknown param quota_type=ug
>> 
>> Am I missing some component or parameter syntax? I went by 
>> http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438217_31982
>>  .
>> 
>> 
>> I actually got "Function not implemented" already with mkfs.lustre --param 
>> mdt.quota_type=ug2 ...  I decided to defer quota setup until later because I 
>> had a note to this effect from a much earlier install. So, I got everything 
>> up and running without quota, MGS MDS on same node but different LUNs, 4 OSS 
>> with external journals, client.
>> 
>> The system for all nodes (servers and client) is:
>> 
>>  - CentOS-6.2, but with the wc1 kernel for CentOS-6.1 = 
>> 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64
>> 
>>  (Is this asking for trouble?  Things worked under 1.8.x with 
>> sticking to a previous minor-version OS kernel.)
>>  
>>  - Failover for the MGS/MDS pair and 2 OSS pairs.
>> 
>> I had been using quota with 1.8.6 on CentOS-5.6 without problems. The 2.1 fs 
>> is on new hardware to migrate to.
>> 
>>  
>> Best, Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre-2.1, CentOS-6.2: unknown param quota_type=ug

2012-01-04 Thread Michael Sternberg
Hello,

I cannot get quota to work on a clean new file system with Lustre-2.1 on 
CentOS-6.2.

# tunefs.lustre --param mdt.quota_type=ug /dev/mapper/mdt1
# mount /mnt/mdt1
mount.lustre: mount /dev/mapper/mdt1 at /mnt/mdt1 failed: Function not 
implemented

This logs:

LustreError: 3957:0:(obd_config.c:1140:class_process_proc_param()) 
carbonfs-MDT: unknown param quota_type=ug

Am I missing some component or parameter syntax? I went by 
http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438217_31982
 .


I actually got "Function not implemented" already with mkfs.lustre --param 
mdt.quota_type=ug2 ...  I decided to defer quota setup until later because I 
had a note to this effect from a much earlier install. So, I got everything up 
and running without quota, MGS MDS on same node but different LUNs, 4 OSS with 
external journals, client.

The system for all nodes (servers and client) is:

- CentOS-6.2, but with the wc1 kernel for CentOS-6.1 = 
2.6.32-131.6.1.el6_lustre.g65156ed.x86_64

(Is this asking for trouble?  Things worked under 1.8.x with 
sticking to a previous minor-version OS kernel.)

- Failover for the MGS/MDS pair and 2 OSS pairs.

I had been using quota with 1.8.6 on CentOS-5.6 without problems. The 2.1 fs is 
on new hardware to migrate to.


Best, Michael



[root@mds03 ~]# rpm -qa | grep lustre
kernel-2.6.32-131.6.1.el6_lustre.g65156ed.x86_64
lustre-ldiskfs-3.3.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64
lustre-modules-2.1.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64
lustre-2.1.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64

[root@mds03 ~]# lsmod | grep quota
lquota198183  5 mdd
ptlrpc   1070449  13 
cmm,mdt,mdd,mds,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld
obdclass  804693  29 
cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc
lvfs   28053  18 
cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass
libcfs338203  21 
cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvfs

[root@oss03 ~]# lsmod | grep quota
lquota198183  5 obdfilter
ptlrpc   1070449  10 
obdfilter,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld
obdclass  804693  24 
obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc
lvfs   28053  14 
obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass
libcfs338203  17 
obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass,lvfs,ksocklnd,ko2iblnd,lnet


Syslog says upon the "mount /mnt/mdt1"

Jan  5 01:04:05 mds03 kernel: LDISKFS-fs warning (device dm-1): 
ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
Jan  5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): warning: maximal mount count 
reached, running e2fsck is recommended
Jan  5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): mounted filesystem with 
ordered data mode
Jan  5 01:04:05 mds03 kernel: LDISKFS-fs warning (device dm-1): 
ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
Jan  5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): warning: maximal mount count 
reached, running e2fsck is recommended
Jan  5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): mounted filesystem with 
ordered data mode
Jan  5 01:04:05 mds03 kernel: Lustre: Enabling ACL
Jan  5 01:04:05 mds03 kernel: Lustre: carbonfs-MDT: used disk, loading
Jan  5 01:04:05 mds03 kernel: LustreError: 
3957:0:(obd_config.c:1140:class_process_proc_param()) carbonfs-MDT: unknown 
param quota_type=ug
Jan  5 01:04:05 mds03 kernel: LustreError: 
3957:0:(obd_config.c:1361:class_config_llog_handler()) Err -38 on cfg command:
Jan  5 01:04:05 mds03 kernel: Lustre:cmd=cf00f 0:carbonfs-MDT  
1:mdt.quota_type=ug  
Jan  5 01:04:05 mds03 kernel: LustreError: 15c-8: MGC172.17.120.3@o2ib: The 
configuration from log 'carbonfs-MDT' failed (-38). This may be the result 
of communication errors between this node and the MGS, a bad configuration, or 
other errors. See the syslog for more information.
Jan  5 01:04:05 mds03 kernel: LustreError: 
3901:0:(obd_mount.c:1192:server_start_targets()) failed to start server 
carbonfs-MDT: -38
Jan  5 01:04:05 mds03 kernel: LustreError: 
3901:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -38
Jan  5 01:04:05 mds03 kernel: Lustre: Failing over carbonfs-MDT
Jan  5 01:04:05 mds03 kernel: Lustre: Skipped 2 previous similar messages
Jan  5 01:04:05 mds03 kernel: Lustre: 
3901:0:(quota_master.c:793:close_quota_files()) quota[0] is off already
Jan  5 01:04:05 mds03 kernel: Lustre: 
3901:0:(quota_master.c:793:close_quota_files()) Skipped 1 

[Lustre-discuss] Solved: panic on jbd:journal_dirty_metadata

2010-08-03 Thread Michael Sternberg
Hello Wojciech,

Confirmed - I built and installed the patch as well, and the problem hasn't 
occurred again here either - Thank you!

For reference, I'm using the released kernel and e2fsprogs rpm plus three 
rebuilt rpms. The patch only affects obdfilter.ko in lustre-modules.  "nm 
/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/obdfilter.ko" 
produced identical output before and after the patch, which I found reassuring.

# rpm -qa | grep -e e2fs -e lustre | sort
e2fsprogs-1.41.10.sun2-0redhat
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3_
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_


With best regards,
Michael

 

On Jul 25, 2010, at 18:08 , Wojciech Turek wrote:

> Hi Michael,
> 
> Our OST's were also nearly full when the problem occured, after installing 
> the patch we didn't have a single occurence of that problem.
> 
> Cheers
> 
> Wojciech
> 
> On 24 July 2010 17:06, Michael Sternberg  wrote:
> Wojciech,
> 
> Thank you very much for your pointer.  Perhaps the fact that the OSTs are 
> nearly full contributes(?). I also see higher usage.
> 
> In any case, I'll attempt compilation with the patch applied.
> 
> 
> With best regards,
> Michael
> 
> 
> On Jul 22, 2010, at 9:16 , Wojciech Turek wrote:
> 
> > Hi Michael,
> >
> > This looks like the problem we had some time ago after upgrading to 1.8.3
> >
> >   https://bugzilla.lustre.org/show_bug.cgi?id=22889
> >
> > Best regards
> > Wojciech
> >
> > On 20 July 2010 00:00, Michael Sternberg  wrote:
> > Hello,
> >
> > I use OSSs with external journal partitions and since lustre-1.8.1 about 
> > one to two times a week I get frustrating panics on OSSs as follows:
> >
> >:libcfs:cfs_alloc ...
> >:lvfs:lprocfs_counter_add ...
> >...
> >
> >RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3
> >  RSP 
> >  <0>Kernel panic - not syncing: Fatal exception
> >
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] panic on jbd:journal_dirty_metadata

2010-07-24 Thread Michael Sternberg
Wojciech,

Thank you very much for your pointer.  Perhaps the fact that the OSTs are 
nearly full contributes(?). I also see higher usage.

In any case, I'll attempt compilation with the patch applied.


With best regards,
Michael


On Jul 22, 2010, at 9:16 , Wojciech Turek wrote:

> Hi Michael,
> 
> This looks like the problem we had some time ago after upgrading to 1.8.3
> 
>   https://bugzilla.lustre.org/show_bug.cgi?id=22889
> 
> Best regards
> Wojciech
> 
> On 20 July 2010 00:00, Michael Sternberg  wrote:
> Hello,
> 
> I use OSSs with external journal partitions and since lustre-1.8.1 about one 
> to two times a week I get frustrating panics on OSSs as follows:
> 
>:libcfs:cfs_alloc ...
>:lvfs:lprocfs_counter_add ...
>...
> 
>RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3
>  RSP 
>  <0>Kernel panic - not syncing: Fatal exception
> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] panic on jbd:journal_dirty_metadata

2010-07-19 Thread Michael Sternberg
Hello,

I use OSSs with external journal partitions and since lustre-1.8.1 about one to 
two times a week I get frustrating panics on OSSs as follows:

:libcfs:cfs_alloc ...
:lvfs:lprocfs_counter_add ...
...

RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3
  RSP 
  <0>Kernel panic - not syncing: Fatal exception

(full graphical screenshot attached, hoping it passes through)

Clients sometimes report:

Message from syslogd@ at Mon Jul 19 04:11:46 2010 ...
login4 kernel: journal commit I/O error


I have recently updated to 1.8.3, where I e2fsck'd and re-initialized the 
external journals, but still get those panics. I use 2 OSS with heartbeat 
failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming from 
4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs (pre-ext4 
proper, if I understand correctly) with the  journals located on partitions of 
2 further LUNs. I use a variant of the script at bug 20807 to account for 
different device numbers of the external journals on the two OSSs.  Failover 
usually works, eventually, after a load peak of up to 100 on the OSS taking 
over, and messages about hung threads (see below).


Is there anything I could do besides giving up on external journals?  My data 
stores are RAID1, and the journal disks are a single pair of disks also in 
RAID1.

I had difficulties locating further information on googling 
"journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There are 
old discussions at:

https://bugzilla.redhat.com/show_bug.cgi?id=183119  2007/2008 - 
kernel 2.4.7
http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html
(ahem)


With best regards,
Michael


[r...@mds01 ~]# cat /proc/fs/lustre/version
lustre: 1.8.3
kernel: patchless_client
build:  1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3

[r...@mds01 ~]# uname -r
2.6.18-164.11.1.el5_lustre.1.8.3

[r...@mds01 ~]# lctl dl
 0 UP mgs MGS MGS 725
 1 UP mgc mgc172.17.12...@o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5
 2 UP mdt MDS MDS_uuid 3
 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4
 8 UP mds sandbox-MDT sandbox-MDT_UUID 719
 9 UP osc sandbox-OST-osc sandbox-mdtlov_UUID 5
10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5

[r...@mds01 ~]# ssh mds02 lctl dl
 0 UP mgc mgc172.17.12...@o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5
 1 UP mdt MDS MDS_uuid 3
 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4
 3 UP mds carbonfs-MDT carbonfs-MDT_UUID 719
 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5
 5 UP osc carbonfs-OST-osc carbonfs-mdtlov_UUID 5

[r...@oss01 ~]# lctl dl
 0 UP mgc mgc172.17.12...@o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5
 1 UP ost OSS OSS_uuid 3
 2 UP obdfilter sandbox-OST sandbox-OST_UUID 721
 3 UP obdfilter carbonfs-OST carbonfs-OST_UUID 721
 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721
 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721

[client]# lctl dl
 0 UP mgc mgc172.17.12...@o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5
 1 UP lov sandbox-clilov-8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 4
 2 UP mdc sandbox-MDT-mdc-8101da93c400 
30094b3a-b246-e667-ef8a-f6690e4d051c 5
 3 UP osc sandbox-OST-osc-8101da93c400 
30094b3a-b246-e667-ef8a-f6690e4d051c 5
 4 UP osc sandbox-OST0001-osc-8101da93c400 
30094b3a-b246-e667-ef8a-f6690e4d051c 5
 5 UP lov carbonfs-clilov-81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 
4
 6 UP mdc carbonfs-MDT-mdc-81041dc0f800 
4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
 7 UP osc carbonfs-OST0001-osc-81041dc0f800 
4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
 8 UP osc carbonfs-OST-osc-81041dc0f800 
4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5


--
"thread hung" messages:
--
Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period over 
after 1:05, of 359 clients 358 recovered and 0 were evicted.
Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed replies 
to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: 
received MDS connection from 172.17.12...@o2ib
Jul 19 04:03:36 oss01 kernel: LustreError: 
0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired 
after 151s: evicting client at 172.17.0@o2ib  ns: 
filter-carbonfs-OST0001_UUID loc
k: 810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 
rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 
0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68
06 timeout 4356934523
Jul 19 04:04:16 oss01 kernel: Lustre: 
6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp 
8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: 
6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar 
messages
Jul 19 0

Re: [Lustre-discuss] A Failed client soft lockup one OSS

2010-03-26 Thread Michael Sternberg
PS: The syslog snippet I posted is slightly out of order; I merged from login2 
and oss01 and did a simple sort(1).


Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] A Failed client soft lockup one OSS

2010-03-26 Thread Michael Sternberg
+1 on this one, in my case using lustre-1.8.2 on RHEL-5.4 over o2ib, with 
patchless clients.

My OSS complains about hung service threads:

Service thread pid 16590 was inactive for 200.00s. The thread might be 
hung, or it might only be slow and will resume later. Dumping the stack trace 
for debugging purposes:
. . .
Service thread pid 16590 completed after 2403.00s. This indicates the 
system was overloaded (too many service threads, or there were not enough 
hardware resources).

In this case, n337 (172.16.1.87) went dead, and another client (login2) 
suffered.  Solution was to reboot n337, the initial hung client.


NB:  I take the overload warning at its word.  The OST is one giant RAID5, 
which I have scheduled to split into several RAID 1+0 sets next week.


regards,
Michael


On Mar 26, 2010, at 2:29 , Lu Wang wrote:
>   We find bug on Lustre 1.8.1.1. Sometimes one client's dead may cause 
> soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and 
> then became unreachable through "lctl ping" from now and then.


Mar 13 18:45:13 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has 
failed due to network error (7s prior to deadline).
Mar 13 18:45:13 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3 previous similar 
messages
Mar 13 18:45:13 oss01 kernel: LustreError: 
16590:0:(events.c:66:request_out_callback()) Skipped 59 previous similar 
messages
Mar 13 18:45:13 oss01 kernel: LustreError: 
16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 2 
fl Rpc:/0/0 rc 0/0
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) 
Error sending PUT to 12345-172.16.1...@tcp: -113
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) 
Skipped 57 previous similar messages
Mar 13 18:45:13 oss01 kernel:   r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 1 
fl Rpc:/0/0 rc 0/0
Mar 13 18:45:22 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has 
failed due to network error (7s prior to deadline).
Mar 13 18:45:22 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 454325 previous 
similar messages
Mar 13 18:45:22 oss01 kernel:   r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527529 ref 1 
fl Rpc:/2/0 rc 0/0
Mar 13 18:45:29 login2 -- MARK --
Mar 13 18:45:41 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has 
failed due to network error (7s prior to deadline).
Mar 13 18:45:41 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 903192 previous 
similar messages
Mar 13 18:45:41 oss01 kernel:   r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527548 ref 1 
fl Rpc:/2/0 rc 0/0
Mar 13 18:46:18 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has 
failed due to network error (7s prior to deadline).
Mar 13 18:46:18 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 1804452 previous 
similar messages
Mar 13 18:46:18 oss01 kernel:   r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527585 ref 1 
fl Rpc:/2/0 rc 0/0
Mar 13 18:46:28 oss01 kernel: LustreError: 
16590:0:(events.c:66:request_out_callback()) Skipped 3611781 previous similar 
messages
Mar 13 18:46:28 oss01 kernel: LustreError: 
16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
r...@810126fc x1329698739270352/t0 
o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527595 ref 2 
fl Rpc:/2/0 rc 0/0
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) 
Error sending PUT to 12345-172.16.1...@tcp: -113
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) 
Skipped 3611782 previous similar messages
Mar 13 18:47:27 oss01 kernel: LustreError: 138-a: carbonfs-OST: A client on 
nid 172.16.1...@tcp was evicted due to a lock blocking callback to 
172.16.1...@tcp timed out: rc -107
Mar 13 18:47:33 oss01 kernel: Lustre: 
16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has 
failed due to network error (7s prior to deadline).
Mar 13 1

Re: [Lustre-discuss] concurrent open() fails sporadically

2009-10-28 Thread Michael Sternberg
On Oct 28, 2009, at 16:40 , David Singleton wrote:
> Michael Sternberg wrote:
>> I'm seeing open() failures when attempting concurrent access in a  
>> lustre fs.
>>
>> The following Fortran program fails sporadically when run under
>> mpirun, even on the same host.  Note that there is no MPI statement;
>> the mpirun simply keeps the startup times very close together:
>
> See https://bugzilla.lustre.org/show_bug.cgi?id=17545

Thanks - was preparing to file one when a search for "concurrent open"  
didn't return a hit.

Good to see: "Landed on 1.8.2".


Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] concurrent open() fails sporadically

2009-10-28 Thread Michael Sternberg
On Oct 28, 2009, at 15:47 , Brian J. Murrell wrote:

> On Wed, 2009-10-28 at 15:38 -0500, Michael Sternberg wrote:
>> I'm seeing open() failures when attempting concurrent access in a  
>> lustre fs.
>> [..]
>> A C version never failed (thus far):
>
> This might be indicative.  Maybe not.  Fortran might just be  
> exposing a
> race condition that the C version is not.
> [..]
> What would be ideal is an strace of the fortran program failing so  
> that
> we can see what the system calls did.


Great suggestion!  Turns out the file in question has mode 0440, but  
since the open() is not otherwise specified, Fortran first tries to  
open read-write, and only then read-only.

I'm using:

mpirun -np 2 bash -c 'strace -tt ./a.out 2> strace7-$$.err' >  
strace7.out

Here's a failure case where the first process fails, and the seconds  
succeeds.  The difference is that in the first process the initial   
open(.., O_RDWR)  returns with ENOENT (fatal) vs. EACCES (will  
retry).  If the timestamps can be trusted, the failing open() comes  
0.1 ms *after* the succeeding PID's open(.., O_RDONLY).


$ tail -n 15 strace7*
==> strace7-10831.err <==
17:27:42.630621 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136,  
2), ...}) = 0
17:27:42.630686 fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
17:27:42.630753 open("test.dat", O_RDWR) = -1 ENOENT (No such file or  
directory)
17:27:42.631044 write(2, "At line ", 8At line ) = 8
17:27:42.63 write(2, "2", 12)= 1
17:27:42.631171 write(2, " of file ", 9 of file ) = 9
17:27:42.631248 write(2, "test.f", 6test.f)   = 6
17:27:42.631322 write(2, "\n", 1
)   = 1
17:27:42.631385 write(2, "Fortran runtime error: ", 23Fortran runtime  
error: ) = 23
17:27:42.631443 write(2, "No such file or directory", 25No such file  
or directory) = 25
17:27:42.631500 write(2, "\n", 1
)   = 1
17:27:42.631563 close(0)= 0
17:27:42.631615 exit_group(2)   = ?

==> strace7-10832.err <==
17:27:42.629790 fstat(2, {st_mode=S_IFREG|0664, st_size=5542, ...}) = 0
17:27:42.629984 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS,  
0x7fff8a624490) = -1 ENOTTY (Inappropriate ioctl for device)
17:27:42.630076 stat("test.dat", {st_mode=S_IFREG|0440,  
st_size=805891, ...}) = 0
17:27:42.630163 fstat(2, {st_mode=S_IFREG|0664, st_size=5813, ...}) = 0
17:27:42.630235 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136,  
3), ...}) = 0
17:27:42.630299 fstat(0, {st_mode=S_IFCHR|0666, st_rdev=makedev(1,  
3), ...}) = 0
17:27:42.630364 open("test.dat", O_RDWR) = -1 EACCES (Permission denied)
17:27:42.630648 open("test.dat", O_RDONLY) = 3
17:27:42.630921 fstat(3, {st_mode=S_IFREG|0440, st_size=805891, ...})  
= 0
17:27:42.630998 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS,  
0x7fff8a623240) = -1 ENOTTY (Inappropriate ioctl for device)
17:27:42.631055 close(3)= 0
17:27:42.631133 write(1, " OK", 3)  = 3
17:27:42.631193 write(1, "\n", 1)   = 1
17:27:42.631252 close(0)= 0
17:27:42.631331 exit_group(0)   = ?


A workaround for my user is to either "chmod u+w datafile" or, more  
cleanly, be explicit in the Fortran open() by saying ACTION='READ'.


With best regards,
Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] concurrent open() fails sporadically

2009-10-28 Thread Michael Sternberg
Greetings,


I'm seeing open() failures when attempting concurrent access in a  
lustre fs.

The following Fortran program fails sporadically when run under  
mpirun, even on the same host.  Note that there is no MPI statement;  
the mpirun simply keeps the startup times very close together:


$ cat test.f
   program test
   open(1, file = 'test.dat',  status = 'old')
   close(1)
   write(*,*) "OK"
   end

$ gfortran test.f
$ mpirun -np 8 a.out
  OK
  OK
  OK
  OK
  OK
  OK
  OK
  OK
$ mpirun -np 8 a.out
  OK
  OK
  OK
  OK
  OK
  OK
At line 2 of file test.f
Fortran runtime error: No such file or directory
  OK


The "status= 'old'" seems to be the trigger.   A C version never  
failed (thus far):


$ cat test.c
#include 
#include 
#include 

main ()
{
 if (fopen("test.dat", "r") == NULL) {
perror("test.dat");
 } else {
char hostname[20];
gethostname(hostname, 20);
printf("%s: OK\n", hostname);
 }
}



I run 2.6.18-92.1.17.el5_lustre.1.6.7.1smp on RHEL-5.3.  The error  
shows up with both gfortran-4.1.2 20080704 (Red Hat 4.1.2-44) and  
Intel Fortran 10.1 20090817.   The data file size is some 800K.   
Nothing from lustre shows up in syslog on the clients or servers.

The error is quite unexpected for such a basic operation.  Where  
should I look for parameters to tweak?

I have mounted on the client:
mds01...@o2ib:mds02...@o2ib:/sandbox on /sandbox type lustre (rw)

on the MDS:
/dev/dm-2 on /mnt/mdt-sandbox type lustre (rw)
and OSS:
/dev/dm-2 on /mnt/ost0-sandbox type lustre (rw)

The MGS/MDS sit on the same disk, /dev/dm-1 (which also serves /home)


With best regards,
Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] autofs parser tantrum

2009-05-03 Thread Michael Sternberg
Hello Daire,

Thank you for your note.  I ultimately decided to use static mounts;  
will see ...

I've been very happy with the 1.6.x releases, even more so because tcp  
nids now work for me (I use a custom tcp port, and never got clients  
to connect previously.)


Regards, Michael



On Apr 27, 2009, at 5:58 , Daire Byrne wrote:
> We had to patch the autofs package to add "@" as a special  
> character. I've never tried the failover MDS mount case but I'm sure  
> it will work fine as long as autofs knows to ignore the '@' and ':'  
> correctly and pass that on to mount.
>
> Daire

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] autofs parser tantrum

2009-04-24 Thread Michael Sternberg
Hi,

I'm struggling with the autofs(5) parser (on RHEL-5.2), which does not  
understand the lustre mount spec.  This is not a lustre problem, but I  
wonder if someone has used automount successfully to mount a lustre  
file system.


My /etc/auto.master is:

   /-  /etc/auto.direct

and the direct map is (/etc/auto.direct):

  /sandbox   -fstype=lustre  md...@tcp:md...@tcp:/sandbox


Upon a mount request I get the following in the autofs log, showing  
how the parser mangles the mount spec:

   lookup_mount: lookup(file): looking up /sandbox
   lookup_mount: lookup(file): /sandbox -> -rw md...@tcp:md...@tcp:/ 
sandbox
   parse_mount: parse(sun): expanded entry: -rw  md...@tcp:md...@tcp// 
sandbox
  ^^

If I leave out the failover server, the parser complains of "invalid  
location":

   lookup_mount: lookup(file): looking up /sandbox
   lookup_mount: lookup(file): /sandbox -> -fstype=lustre 
md...@tcp:/sandbox
   parse_mount: parse(sun): expanded entry: -fstype=lustre 
md...@tcp:/sandbox
   parse_mount: parse(sun): gathered options: fstype=lustre
   parse(sun): invalid location md...@tcp:/sandbox


What's a workable way to automount lustre, or is this misguided?

The reason I want to use automount is that the client is the  
management node of an HPC cluster.  It needs the mount only  
occasionally, such as for creating home directories and debugging.   
Since the node is critical for the cluster (there's actually two, in  
high-availability), I'd like to shield it (them) from any lustre  
recovery outages.


Regards, Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NAME field in lustre-client rpms

2009-04-23 Thread Michael Sternberg

On Apr 23, 2009, at 12:44 , Brian J. Murrell wrote:

On Thu, 2009-04-23 at 12:18 -0500, Michael Sternberg wrote:

Still, how could one serve both the patched and patchless lustre rpms
out of the same yum repository?


See bug 13908.




https://bugzilla.lustre.org/show_bug.cgi?id=13908

--- Comment #37 From Brian Murrell 2009-04-22 10:57:38 [reply]  
---
(From update of attachment 22951 [details]) Landed to lustre-build  
HEAD which means landed for lustre releases/branches 1.6.8, 1.8.1  
and 2.0 (HEAD).
--- Comment #38 From Brian Murrell 2009-04-22 10:57:58 [reply]  
---

Landed, closing.


Very nice - fix is coming already - thank you!


Michael___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] NAME field in lustre-client rpms

2009-04-23 Thread Michael Sternberg
I was trying to install a patchless client of lustre-1.6.7.1 on  
RHEL-5.3 from a local yum repository and noticed an oddity with the  
rpm metadata:  the NAME fields of the "lustre-client" rpms are the  
same as in the patched-kernel rpms.  I looked back at all the versions  
I've used (1.6.x), and saw that this has always been the case there,  
so apparently that's by design, e.g.:

   lustre-1.6.7.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp.x86_64.rpm
   lustre-modules-1.6.7.1-2.6.18_92.1.17.el5_lustre. 
1.6.7.1smp.x86_64.rpm
vs.
   lustre-client-1.6.7.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp.x86_64.rpm
   lustre-client-modules-1.6.7.1-2.6.18_92.1.17.el5_lustre. 
1.6.7.1smp.x86_64.rpm

# rpm -qip lustre-[1mc]*rpm  | grep Name
Name: lustre   Relocations: (not  
relocatable)
Name: lustre   Relocations: (not  
relocatable)
Name: lustre-modules   Relocations: (not  
relocatable)
Name: lustre-modules   Relocations: (not  
relocatable)

This means that the patched and patchless rpms collide in a yum  
repository, and only one will be available for installation (the last  
by collation, I believe).

This is the first time I've tried to use the patchless client.   
However, I've put it on the back burner (since there is no support yet  
for the RHEL-5.3 kernel) and pulled it out of my yum repo.

Still, how could one serve both the patched and patchless lustre rpms  
out of the same yum repository?


Regards, Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Upgrade-Procedure

2009-01-06 Thread Michael Sternberg
On Jan 6, 2009, at 11:21 , Thomas Roth wrote:
> I've just upgraded a 1 MDT - 2 OST - 2 Clients - test cluster from
> Lustre version 1.6.5.1 to 1.6.6
> However, I did not follow the manual (
> http://manual.lustre.org/manual/LustreManual16_HTML/UpgradingLustre.html#50548855_pgfId-1289726)
>  
> :
> I did not  use the tunefs.lustre command on MGS/MDT or OSTs.
> My data seem to have survived, the quota had to be switched back on  
> with
> lfs quotacheck.
>
> My question:  is this treatment with tunefs.lustre ( in particular  
> with
> the somewhat scary "writeconf" parameter)  actually necessary at all?


I also did not perform this step.  I assumed since the manual at  
docs.sun.com wasn't updated, there were no specific steps required.

http://www.sun.com/software/products/lustre/get.jsp
 links (eventually) to
http://docs.sun.com/app/docs/doc/820-3681
 holding
Lustre_1.6_man_v1.14 September 2008

But lustre.org (linked by Thomas) holds the newer Lustre_1.6_man_v1.15  
November 2008.  However, the example therein is still the previous  
one, as it refers to kernel 2.6.18-8.1.14.el5_lustre.1.6.4.2smp, i.e.  
lustre 1.6.4.

Hmm - Could *not running* tune2fs have been the cause of my hiccup  
with lov_objids that I mentioned last week?  As I wrote then, my data  
was also still intact (at least it was before I opened write access -  
ahem.)


Regards, Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Solved: incorrect num_osts; LBUG

2009-01-04 Thread Michael Sternberg
I wrote:

On Jan 2, 2009, at 19:24 , Michael Sternberg wrote:
> While performing an LFSCK after upgrade lustre-1.6.6, the MDS claims
> to have 512 OSTs [while there is only 1]:
>
>   MDS: num_osts = 512


Well, I found an interesting section "How to fix bad LAST_ID on an  
OST" in the operations manual, Appendix D, which pointed more or less  
in the right direction.

Briefly, /lov_objid within the MDS ldiskfs is constructed from /O/0/ 
LAST_ID in all the OSTs' ldiskfs.  In my case, with only one OST, the  
first and only entry agrees between these two files, and the rest of  
lov_objid was padded with NULs to 4KB (block size?).  I do not  
understand what caused this, but, given prior e2fsck output indicating  
length 8, I decided to simply chop off the NULs, which indeed made  
e2fsck happy.  lfsck on a client found a handful of empty orphans, but  
a backup on the client side still matches the contents of the rest of  
the fs (phew!).


Regards, Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] incorrect num_osts; LBUG

2009-01-02 Thread Michael Sternberg
While performing an LFSCK after upgrade lustre-1.6.6, the MDS claims  
to have 512 OSTs:

MDS: num_osts = 512

I am in the middle of an update, in a currently asymmetric heartbeat  
config, with a passive MDS and OSS on lustre-1.6.5.1, and the  
currently active MDS and OSS (where num_osts comes out wrong) on  
lustre-1.6.6.

The rest of the e2fsck output looks perfectly normal and in line with  
previous runs.  I run two lustre file systems, /home and /sandbox off  
the same MGS, and e2fsck for BOTH file systems reports 512 OSTs.   
Running e2fsck on the 1.6.5.1 servers gives the same blown up OST list.

Does this mean my MGS is trashed?


In context:

= 
= 
= 
= 

# e2fsck -n -v --mdsdb /tmp/mdsdb-sandbox /dev/dm-2; date
e2fsck 1.40.11.sun1 (17-June-2008)
device /dev/dm-2 mounted by lustre per /proc/fs/lustre/mds/sandbox- 
MDT/mntdev
Warning!  /dev/dm-2 is mounted.
Warning: skipping journal recovery because doing a read-only  
filesystem check.
sandbox-MDT has been mounted 47 times without being checked, check  
forced.
Pass 1: Checking inodes, blocks, and sizes
MDS: ost_idx 0 max_id 1192499
MDS: ost_idx 1 max_id 0
MDS: ost_idx 2 max_id 0
MDS: ost_idx 3 max_id 0
MDS: ost_idx 4 max_id 0
...
MDS: ost_idx 509 max_id 0
MDS: ost_idx 510 max_id 0
MDS: ost_idx 511 max_id 0
MDS: got 4096 bytes = 512 entries in lov_objids
MDS: max_files = 36420
MDS: num_osts = 512
mds info db file written
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Pass 6: Acquiring information for lfsck
MDS: max_files = 36420
MDS: num_osts = 512
MDS: 'sandbox-MDT_UUID' mdt idx 0: compat 0x4 rocomp 0x1 incomp 0x4

36420 inodes used (0.05%)
3 non-contiguous inodes (0.0%)
  # of inodes with ind/dind/tind blocks: 1/0/0
  9283672 blocks used (12.67%)
0 bad blocks
1 large file

32592 regular files
 3818 directories
0 character device files
0 block device files
0 fifos
1 link
1 symbolic link (1 fast symbolic link)
0 sockets

36412 files
= 
= 
= 
= 



I did an LFSCK just prior to the upgrade, which showed num_osts = 1.


This is on RHEL-5.2 with the latest e2fsprogs:

# e2fsck -V
e2fsck 1.40.11.sun1 (17-June-2008)
Using EXT2FS Library version 1.40.11.sun1, 17-June-2008


I run two MDS and OSS each in heartbeat; I encountered an LBUG during  
a heartbeat hiccup (output appended).

Kernels for the updated and active MDS and OSS, and some clients are:
2.6.18-92.1.10.el5_lustre.1.6.6smp

the passive MDS and OSS, and remaining clients are:
2.6.18-53.1.14.el5_lustre.1.6.5.1smp



Regards, Michael


Jan  2 15:41:20 mds02 kernel: LustreError: 14500:0:(mgs_handler.c: 
194:mgs_setup()) ASSERTION(!lvfs_check_rdonly(lvfs_sbdev(mnt- 
 >mnt_sb))) failed
Jan  2 15:41:20 mds02 kernel: LustreError: 14500:0:(tracefile.c: 
450:libcfs_assertion_failed()) LBUG
Jan  2 15:41:20 mds02 kernel: Lustre: 14500:0:(linux-debug.c: 
185:libcfs_debug_dumpstack()) showing stack for process 14500
Jan  2 15:41:20 mds02 kernel: mount.lustre  R  running task   0  
14500  14499 (NOTLB)
Jan  2 15:41:20 mds02 kernel:  0040 0020  
8102105a5568 88966410
Jan  2 15:41:20 mds02 kernel:  888d3b80 810200b242c0  
 
Jan  2 15:41:20 mds02 kernel:     
8102105a5598 80143a09
Jan  2 15:41:20 mds02 kernel: Call Trace:
Jan  2 15:41:20 mds02 kernel:  [] dump_trace 
+0x211/0x23a
Jan  2 15:41:20 mds02 kernel:   
[] :ptlrpc:lprocfs_rd_pool_state+0x0/0x200
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:lprocfs_wr_atomic+0x0/0x60
Jan  2 15:41:20 mds02 kernel:  [] show_trace+0x34/0x47
Jan  2 15:41:20 mds02 kernel:  [] _show_stack+0xdb/ 
0xea
Jan  2 15:41:20 mds02 kernel:   
[] :libcfs:lbug_with_loc+0x7a/0xc0
Jan  2 15:41:20 mds02 kernel:   
[] :libcfs:libcfs_assertion_failed+0x54/0x60
Jan  2 15:41:20 mds02 kernel:  [] :mgs:mgs_setup 
+0x301/0x800
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:class_setup+0x942/0xc70
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:class_process_config+0x14bd/0x19e0
Jan  2 15:41:20 mds02 kernel:  [] :obdclass:do_lcfg 
+0x924/0xb20
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:lustre_start_simple+0x130/0x1d0
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:server_start_mgs+0x223/0x320
Jan  2 15:41:20 mds02 OpenSM[14448]: Entering MASTER state
Jan  2 15:41:20 mds02 kernel:   
[] :obdclass:server_fill_super+0x18a2/0x1fb0
Jan  2 15:41:21 mds02 kernel:   
[] :sunrpc:rpc_call_sync+0x9e/0xa8
Jan  2 15:41:21 mds02 kernel:  [] recalc_sigpending 
+0xe/0x25
Jan  2 15:41:21 mds02 attrd: [14276]: info: main: Starting mainloop...
Jan  2 15:

Re: [Lustre-discuss] Interoperability between lustre 1.4.11 server and lustre 1.6 client

2008-09-02 Thread Michael Sternberg
Thanks for the confirmation.

I have a similar issue, but with different version *between servers*:
Is it acceptable to run MDS/MGS with lustre-1.6.5.1 but the OSSs with  
lustre-1.6.4.3?

An MDS update to 1.6.5.1 and RHEL-5.2 went fine, but the same on the  
OSS fails.
So, I am considering rollback of the OSS the 1.6.4.3 and RHEL-5.1  
which is working.
The clients are also a mix of the two versions mentioned.


Best regards, Michael


On Apr 28, 2008, at 2:53 , Peter Kjellstrom wrote:

> On Wednesday 23 April 2008, ashok bharat bayana wrote:
>> hi!
>> Is there an interoperability between lustre 1.4.11 server and  
>> lustre 1.6
>> patchless client? i mean can we make lustre 1.4.11 server talk to  
>> lustre
>> 1.6 patchless client? I want help in this regard..
>
> FWIW, we've been running 1.6 clients with 1.4 servers without  
> problems for a
> while. Note though that we don't run patchless client. Here are the  
> exact
> versions we currently use:
> client : lustre-1.6.4.3-2.6.9_67.0.4.EL_lustre.1.6.4.3smp
> servers: lustre-1.4.11.1-2.6.9_55.0.9.EL_lustre.1.4.11.1smp
>
> /Peter
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS: bad header in inode - invalid magic

2008-07-02 Thread Michael Sternberg
On Jul 2, 2008, at 9:26 , Brian J. Murrell wrote:
> On Wed, 2008-07-02 at 09:01 -0500, Michael Sternberg wrote:
>> Ah - I tried, but ran into a symbol error:
>>
>> [EMAIL PROTECTED] ~]# wget 
>> http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm
>> [EMAIL PROTECTED] ~]# rpm -Fvh ./
>> e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm
>> [EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3
>> e2fsck: symbol lookup error: e2fsck: undefined symbol:  
>> ext2fs_mmp_update
>
> Ahhh.  This isn't good.  Can you file a bug in our bugzilla about  
> that?

Done:   https://bugzilla.lustre.org/show_bug.cgi?id=16265


Update to CentOS-5.2 in progress ...


Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS: bad header in inode - invalid magic

2008-07-02 Thread Michael Sternberg
Hi,

On Jul 2, 2008, at 8:35 , Brian J. Murrell wrote:
> On Tue, 2008-07-01 at 19:52 -0500, Michael Sternberg wrote:
>> Jul  1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):
>> ldiskfs_ext_find_extent: bad header in inode #405012501: invalid  
>> magic
>> - magic 0, entries 0, max 0(0), depth 0(0)
>
> I would suggest making sure you have the *latest* e2fsprogs from Sun  
> and
> running an e2fsck on that volume.  You may have to flags to force a
> thorough check.  Check the manpage.


Ah - I tried, but ran into a symbol error:

[EMAIL PROTECTED] ~]# wget 
http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm
[EMAIL PROTECTED] ~]# rpm -Fvh ./ 
e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm
[EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3
e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2fs_mmp_update

This is on RHEL5.1.


>> Message from syslogd@ at Tue Jul  1 16:00:02 2008 ...
>> login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake())  
>> ASSERTION(pc != NULL) failed
>> Message from syslogd@ at Tue Jul  1 16:00:02 2008 ...
>> login1 kernel: LustreError: 5612:0:(tracefile.c: 
>> 431:libcfs_assertion_failed()) LBUG
>
> This looks like bug 13888 fixed in 1.6.5.

OK, will go forward with  1.6.5.1, hoping OFED-1.3 etc. issues are  
solved.


Thank you very much,
Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] OSS: bad header in inode - invalid magic

2008-07-01 Thread Michael Sternberg

Hi,

I repeatedly encounter "invalid magic" in one particular inode  of one  
of my OSS volumes (1 of 4, each 5 TB), with the consequence of lustre  
remounting R/O.


I run 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on RHEL5.1 on a cluster  
with approx. 150 client nodes.


The error appears on the OSS as:

Jul  1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):  
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic  
- magic 0, entries 0, max 0(0), depth 0(0)

Jul  1 15:43:58 oss01 kernel: Remounting filesystem read-only
Jul  1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3):  
ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic  
- magic 0, entries 0, max 0(0), depth 0(0)
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 
417:fsfilt_ldiskfs_brw_start()) can't get handle for 45 credits: rc =  
-30
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 
417:fsfilt_ldiskfs_brw_start()) Skipped 6 previous similar messages
Jul  1 15:43:58 oss01 kernel: LustreError: 25462:0:(filter_io_26.c: 
705:filter_commitrw_write()) error starting transaction: rc = -30
Jul  1 15:43:58 oss01 kernel: LustreError: 19569:0:(filter_io_26.c: 
705:filter_commitrw_write()) error starting transaction: rc = -30

[... many repeats]


Three login nodes signaled, about 10 .. 15 minutes apart the same  
wall(8) message:


Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake())  
ASSERTION(pc != NULL) failed

Message from syslogd@ at Tue Jul 1 16:00:02 2008 ...
login1 kernel: LustreError: 5612:0:(tracefile.c: 
431:libcfs_assertion_failed()) LBUG




Twice in the past, I followed this recovery procedure from the Manual  
and the Wiki:



http://wiki.lustre.org/index.php?title=Fsck_Support#Using_e2fsck_on_a_backing_filesystem%7Cusing
Using e2fsck on a backing filesystem
-- nice walk-through


http://manual.lustre.org/manual/LustreManual16_HTML/Failover.html#50446391_pgfId-1287654
8.4.1 Starting/Stopping a Resource

[i.e., simply umounting the device on the OSS - is this correct?]


http://manual.lustre.org/manual/LustreManual16_HTML/LustreInstallation.html#50446385_43530
4.2.1.5 Stopping a Server


In other words:
umount the OSS
perform fsck on the block device
remount the OSS

So, last time I did:

[EMAIL PROTECTED] ~]# umount /mnt/ost2
[EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3

lustre-OST0002: recovering journal
lustre-OST0002: ext3 recovery flag is clear, but journal has data.
lustre-OST0002: Run journal anyway

lustre-OST0002: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)

[EMAIL PROTECTED] ~]# mount -t ldiskfs /dev/dm-3 /mnt/ost2
[EMAIL PROTECTED] ~]# umount /mnt/ost2
[EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3
	lustre-OST0002: 342355/427253760 files (4.2% non-contiguous),  
139324997/1708984375 blocks



To my surprise, there were no errors.  I did the same today after the  
error above, but left out the "-p" flag; still, fsck did not find an  
error (except the journal replay??):


[EMAIL PROTECTED] ~]# e2fsck -f /dev/dm-3
e2fsck 1.40.4.cfs1 (31-Dec-2007)
lustre-OST0002: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

lustre-OST0002: * FILE SYSTEM WAS MODIFIED *
	lustre-OST0002: 343702/427253760 files (4.4% non-contiguous),  
137003893/1708984375 blocks

[EMAIL PROTECTED] ~]#

I haven't mounted back yet for fear this would stall the system again  
in a couple of days.



How can I locate the "bad" inode - should I try?  Is this an inode of  
the lustre FS or the underlying ext3 on the OST?


Are there version dependencies of e2fsck with lustre?  I am running  
lustre-1.6.4.3 and e2fsck-1.40.4.



I would appreciate any pointers.


Thank you for your attention and help.
Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] o2ib module prevents shutdown

2008-04-18 Thread Michael Sternberg
Hello Wojciech,

Sorry for the delayed response;  lustre_rmmod worked in a manual test  
to remove modules after ib0 was down;  I have yet to try this as part  
of the init.d shutdown scripts; an alternate solution with a script  
didn't quite work.

Thanks for the hint!


Regards, Michael

On Apr 15, 2008, at 12:33 , Wojciech Turek wrote:
> Hi,
>
> This usually happens when you try to remove IB card drivers before  
> stopping lustre network. What I do is after clean umount I run  
> lustre_rmmod script which removes all lustre modules and stops  
> lustre network. Then you can safety  remove IB card driver and  
> nothing should get stuck.
>
> Cheers,
>
> Wojciech
>
> On 15 Apr 2008, at 18:22, Michael Sternberg wrote:
>
>>
>> On Apr 15, 2008, at 12:15, Brian J. Murrell wrote:
>>> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote:
>>>> Hello,
>>>>
>>>> Not sure if this is the right forum:  I'm encountering difficulties
>>>> with o2ib which prevents an LNET shutdown from proceeding:
>>>>
>>>>Unloading OpenIB kernel modules:NET: Unregistered protocal  
>>>> family 27
>>>>Failed to unload rdma_cm
>>>>Failed to unload rdma_cm
>>>>Failed to unload ib_cm
>>>>Failed to unload ib_sa
>>>>LustreError: 131-3: Received notification of device removal
>>>>Please shutdown LNET to allow this to proceed
>>>>
>>>> This happens on server and client nodes alike.  We run RHEL5.1 and
>>>> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun.
>>>>
>>>> I narrowed it down to module ko2iblnd, which I attempt to remove
>>>> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it
>>>> doesn't work.  Strangely, in "lsmod" the use count of the module is
>>>> one, but I don't see where it's used.
>>>
>>> To ask what might sound like a stupid question, but you do have  
>>> all of
>>> your lustre filesystems unmounted before you try to unload ko2iblnd,
>>> yes?  Can you show us what's in /proc/mounts when you try to unload
>>> ko2iblnd but it shows a refcount > 0?
>>
>> No problem with the question - anything that helps:
>>
>> # cat /proc/mounts
>> rootfs / rootfs rw 0 0
>> /dev/root / ext3 rw,data=ordered 0 0
>> /dev /dev tmpfs rw 0 0
>> /proc /proc proc rw 0 0
>> /sys /sys sysfs rw 0 0
>> /proc/bus/usb /proc/bus/usb usbfs rw 0 0
>> devpts /dev/pts devpts rw 0 0
>> /dev/sda1 /boot ext3 rw,data=ordered 0 0
>> tmpfs /dev/shm tmpfs rw 0 0
>> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>> 172.16.100.3:/drbd/exports/opt /opt nfs
>> rw 
>> ,vers 
>> =3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2,
>> sec=sys,addr=172.16.100.3 0 0
>> /etc/auto.misc /misc autofs
>> rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0
>> -hosts /net autofs
>> rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0
>>
>>
>> This was even after:
>>
>>  # ifconfig ib0 down
>>
>> I also have:
>>
>>  # grep lnet /etc/modprobe.conf
>>  options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988
>>
>> (the accept_port spec doesn't work either on a tcp-only node, but
>> that's a separate issue, or so I believe.)
>>
>>
>> Regards, Michael
>>
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Download Problems with lustre 1.6.4.3

2008-04-18 Thread Michael Sternberg
I had success the other day at Sun's official download site - looks  
like the issues it had are fixed.


Regards, Michael


On Apr 16, 2008, at 14:28 , Frank Leers wrote:
> ...OK, so should there be anything here?  The directory seems empty,  
> or
> possibly the files have incorrect permissions set, or we changed our
> minds :-/
>
> http://downloads.lustre.org/public/lustre/v1.6/
>
> -frank
>
> On Thu, 2008-03-20 at 05:46 -0400, Oleg Drokin wrote:
>> Hello!
>>
>> On Mar 20, 2008, at 4:55 AM, Patrick Winnertz wrote:
>>> I've tried just some minutes ago to download the sourcecode of
>>> lustre from
>>> [1], but it failed with this error:
>>
>> Please try http://downloads.lustre.org
>> This is new downloads site that went onto comission just yesterday,  
>> so
>> no wide announcements were made yet, it seems.
>>
>> Bye,
>> Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] o2ib module prevents shutdown

2008-04-15 Thread Michael Sternberg

On Apr 15, 2008, at 12:15, Brian J. Murrell wrote:
> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote:
>> Hello,
>>
>> Not sure if this is the right forum:  I'm encountering difficulties
>> with o2ib which prevents an LNET shutdown from proceeding:
>>
>>  Unloading OpenIB kernel modules:NET: Unregistered protocal family 27
>>  Failed to unload rdma_cm
>>  Failed to unload rdma_cm
>>  Failed to unload ib_cm
>>  Failed to unload ib_sa
>>  LustreError: 131-3: Received notification of device removal
>>  Please shutdown LNET to allow this to proceed
>>
>> This happens on server and client nodes alike.  We run RHEL5.1 and
>> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun.
>>
>> I narrowed it down to module ko2iblnd, which I attempt to remove
>> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it
>> doesn't work.  Strangely, in "lsmod" the use count of the module is
>> one, but I don't see where it's used.
>
> To ask what might sound like a stupid question, but you do have all of
> your lustre filesystems unmounted before you try to unload ko2iblnd,
> yes?  Can you show us what's in /proc/mounts when you try to unload
> ko2iblnd but it shows a refcount > 0?

No problem with the question - anything that helps:

# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/sda1 /boot ext3 rw,data=ordered 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
172.16.100.3:/drbd/exports/opt /opt nfs  
rw,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, 
sec=sys,addr=172.16.100.3 0 0
/etc/auto.misc /misc autofs  
rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs  
rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0


This was even after:

# ifconfig ib0 down

I also have:

# grep lnet /etc/modprobe.conf
options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988

(the accept_port spec doesn't work either on a tcp-only node, but  
that's a separate issue, or so I believe.)


Regards, Michael

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] o2ib module prevents shutdown

2008-04-15 Thread Michael Sternberg
Hello,

Not sure if this is the right forum:  I'm encountering difficulties  
with o2ib which prevents an LNET shutdown from proceeding:

Unloading OpenIB kernel modules:NET: Unregistered protocal family 27
Failed to unload rdma_cm
Failed to unload rdma_cm
Failed to unload ib_cm
Failed to unload ib_sa
LustreError: 131-3: Received notification of device removal
Please shutdown LNET to allow this to proceed

This happens on server and client nodes alike.  We run RHEL5.1 and  
OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun.

I narrowed it down to module ko2iblnd, which I attempt to remove  
first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it  
doesn't work.  Strangely, in "lsmod" the use count of the module is  
one, but I don't see where it's used.


# umount /mnt/lustre
# ifconfig ib0 down
# modprobe -r ko2iblnd
FATAL: Module ko2iblnd is in use.
# lsmod | grep ko2
ko2iblnd  143136  1
lnet  258088  5 lustre,ksocklnd,ko2iblnd,ptlrpc,obdclass
libcfs189784  12  
osc,mgc,lustre,lov,lquota,mdc,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvf 
s
rdma_cm65940  4 ko2iblnd,ib_iser,rdma_ucm,ib_sdp
ib_core88576  16  
ko2iblnd,ib_iser,rdma_ucm,ib_ucm,ib_srp,ib_sdp,rdma_cm,ib_cm,iw_cm,ib_lo 
cal_sa,ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad


I'd be grateful for any hints.



Regards, Michael



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Download Problems with lustre 1.6.4.3

2008-03-20 Thread Michael Sternberg
> On Thu, Mar 20, 2008 at 2:46 AM, Oleg Drokin  sun.com> wrote:
> >  On Mar 20, 2008, at 4:55 AM, Patrick Winnertz wrote:
> >  > I've tried just some minutes ago to download the sourcecode of
> >  > lustre from
> >  > [1], but it failed with this error:
> >
> >  Please try http://downloads.lustre.org
> >  This is new downloads site that went onto comission just  
> yesterday, so
> >  no wide announcements were made yet, it seems.
>
> I've been attempting to convince sun that their download website sucks
> and they need to change it, I could paste the entire email thread but
> that would be cruel, to you guys.

I had the *very same* experience, receiving from SDLC talk about  
firewall settings etc., culminating in an off-topic and canned  
referral to the Java development community -- very frustrating.
I found the RPMs I was seeking at the download site Oleg mentioned.   
Feels like arriving at an oasis of sanity.  -- Thank you very much!

Regards, Michael
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss