subject:"\[lustre\-discuss\] old Lustre 2.8.0 panic'ing continously"

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-13 Thread Andreas Dilger

One thing to check if you are not seeing any benefit from running e2fsck, is to
make sure you are using the latest e2fsprogs-1.45.2.wc1.

You could also try upgrading the server to Lustre 2.10.8.

Based on the kernel version, it looks like RHEL6.7, which should still work 
with 2.10
(the previous LTS branch), but has a lot more fixes than 2.8.0.

Cheers, Andreas

On Mar 5, 2020, at 00:48, Torsten Harenberg 
mailto:harenb...@physik.uni-wuppertal.de>> 
wrote:

Dear all,

I know it's dared to ask for help for such an old system.

We still run a CentOS 6 based Lustre 2.8.0 system
(kernel-2.6.32-573.12.1.el6_lustre.x86_64,
lustre-2.8.0-2.6.32_573.12.1.el6_lustre.x86_64.x86_64).

It's out of warrenty and about to be replaced. The approval process for
the new grant took over a year and we're currently preparing an EU wide
tender, all of that takes and took much more time than we expected.

The problem is:

one OSS server is always running into a kernel panic. It seems that this
kernel panic is related to one of the OSS mount points. If we mount the
LUNs of that server (all data is on a 3par SAN) to a different server,
this one is panic'ing, too.

We always run file system checks after such a panic but these show only
minor issues that you would expect after a crashed machine like

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

We would love to avoid an upgrade to CentOS 7 with these old machines,
but these crashes happen really often meanwhile and yesterday it
panic'ed after only 30mins.

Now we're running out of ideas.

If anyone has an idea how we could identify the source of the problem,
we would really appreciate it.

Kind regards

 Torsten


--
Dr. Torsten Harenberg 
harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-12 Thread Torsten Harenberg

Dear all,

Am 10.03.20 um 08:18 schrieb Torsten Harenberg:
> During the last days (since thursday), our Lustre instance was
> surprisingly stable. We lowered a bit the load by limiting the # of
> running jobs which might also helped to stablize the system.
> 
> We enabled kdump, so if another crash is happening anytime soon, we hope
> to get at least a dump for a hint where the problem is.

now it crashed again. But we got a backtrace and a dump.

the backtrace is:

<4>general protection fault:  [#1] SMP
<4>last sysfs file:
/sys/devices/pci:00/:00:01.0/:04:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:14/state
<4>CPU 13
<4>Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U)
osd_ldiskfs(U) ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U)
lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic
crc32c_intel libcfs(U) autofs4 bonding ipt_REJECT nf_conntrack_ipv4
nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
iTCO_wdt iTCO_vendor_support hpilo hpwdt serio_raw lpc_ich mfd_core
ioatdma dca ses enclosure sg bnx2x ptp pps_core libcrc32c mdio
power_meter acpi_ipmi ipmi_si ipmi_msghandler shpchp ext4 jbd2 mbcache
dm_round_robin sd_mod crc_t10dif qla2xxx scsi_transport_fc scsi_tgt
pata_acpi ata_generic ata_piix dm_multipath dm_mirror dm_region_hash
dm_log dm_mod hpsa [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 18657, comm: ll_ost_io02_077 Not tainted
2.6.32-573.12.1.el6_lustre.x86_64 #1 HP ProLiant DL360p Gen8
<4>RIP: 0010:[]  []
ldiskfs_ext_insert_extent+0xb3/0x10c0 [ldiskfs]
<4>RSP: 0018:8806fa8136c0  EFLAGS: 00010246
<4>RAX:  RBX: 0002 RCX: dead00200200
<4>RDX: 8806fa813800 RSI: 88196f62a2c0 RDI: 880106fc3901
<4>RBP: 8806fa813790 R08:  R09: 8807ff69f3c0
<4>R10: 0009 R11: 0002 R12: 88196f62a240
<4>R13:  R14: 0002 R15: 88196f62a2c0
<4>FS:  () GS:88009a5a()
knlGS:
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
<4>CR2: 00426820 CR3: 01a8d000 CR4: 000407e0
<4>DR0:  DR1:  DR2: 
<4>DR3:  DR6: 0ff0 DR7: 0400
<4>Process ll_ost_io02_077 (pid: 18657, threadinfo 8806fa81,
task 8806faffaab0)
<4>Stack:
<4> 8806fa8136f0 811beb8b 0002 dead00200200
<4> 881f80925c00 880106fc39b0 8806fa813790 a0be3dd1
<4> 88196f62a240 880106fc39b0 8806fa8137e8 fa8137d4
<4>Call Trace:
<4> [] ? __mark_inode_dirty+0x3b/0x160
<4> [] ? ldiskfs_mb_new_blocks+0x241/0x640 [ldiskfs]
<4> [] ldiskfs_ext_new_extent_cb+0x5d9/0x6d0 [osd_ldiskfs]
<4> [] ? call_rwsem_wake+0x18/0x30
<4> [] ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
<4> [] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
<4> [] osd_ldiskfs_map_nblocks+0x7d/0x110 [osd_ldiskfs]
<4> [] osd_ldiskfs_map_inode_pages+0x278/0x2e0
[osd_ldiskfs]
<4> [] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
<4> [] osd_write_commit+0x39b/0x9a0 [osd_ldiskfs]
<4> [] ofd_commitrw_write+0x664/0xfa0 [ofd]
<4> [] ofd_commitrw+0x5bf/0xb10 [ofd]
<4> [] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
<4> [] obd_commitrw+0x114/0x380 [ptlrpc]
<4> [] tgt_brw_write+0xc70/0x1540 [ptlrpc]
<4> [] ? enqueue_task+0x66/0x80
<4> [] ? check_preempt_curr+0x6d/0x90
<4> [] ? try_to_wake_up+0x24e/0x3e0
<4> [] ? lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
<4> [] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
<4> [] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [] ptlrpc_main+0xd21/0x1800 [ptlrpc]
<4> [] ? pick_next_task_fair+0xd0/0x130
<4> [] ? schedule+0x176/0x3a0
<4> [] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [] kthread+0x9e/0xc0
<4> [] child_rip+0xa/0x20
<4> [] ? kthread+0x0/0xc0
<4> [] ? child_rip+0x0/0x20
<4>Code: 48 85 c9 0f 84 05 10 00 00 4d 85 ff 74 0a f6 45 8c 08 0f 84 33
07 00 00 45 31 ed 4c 89 e8 66 2e 0f 1f 84 00 00 00 00 00 49 63 de <44>
0f b7 49 02 48 8d 14 dd 00 00 00 00 49 89 df 49 c1 e7 06 49
<1>RIP  [] ldiskfs_ext_insert_extent+0xb3/0x10c0 [ldiskfs]
<4> RSP 
[root@lustre3 127.0.0.1-2020-03-11-19:14:18]#


I still have to read (I am not experienced with kernel debugging) how to
attach the vmcore to a gdb.

But if you already have a guess reading the trace I would be very happy
to take any advice.

By the way: we mounted the OST now exactly the other way round than
usual and now the other machine crashed, so it seems that it has
something to do with the content on the LUNs rather than it's a server
hardware problem.

Thanks again

  Torsten


-- 
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



smime.p7s
Description: S/MIME Cryptographic Signature
__

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-10 Thread Torsten Harenberg

Dear all,

thanks for all of your replies.

Am 09.03.20 um 13:32 schrieb Andreas Dilger:
> It would be better to run a full e2fsck, since that not only rebuilds
> the quota tables, but also ensures that the values going into the quota
> tables are correct.  Since the time taken by "tune2fs -O quota" is
> almost the same as running e2fsck, it is better to do it the right way.

We already ran e2fsck -f on all LUNs after every crash, so it seems that
was all we could do, right?

During the last days (since thursday), our Lustre instance was
surprisingly stable. We lowered a bit the load by limiting the # of
running jobs which might also helped to stablize the system.

We enabled kdump, so if another crash is happening anytime soon, we hope
to get at least a dump for a hint where the problem is.

Thanks again

  Torsten

-- 
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-09 Thread Andreas Dilger



On Mar 5, 2020, at 09:11, Mohr Jr, Richard Frank 
mailto:rm...@utk.edu>> wrote:



On Mar 5, 2020, at 2:48 AM, Torsten Harenberg 
mailto:harenb...@physik.uni-wuppertal.de>> 
wrote:

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

I assume you are running ldiskfs as the backend?  If so, have you tried 
regenerating the quota info for the OST?  I believe the command is “tune2fs -O 
^quota ” to clear quotas and then “tune2fs -O quota 
” to reenable/regenerate them.  I don’t know if that would 
work, but it might be worth a shot.

Just to clarify, the "tune2fs -O ^quota; tune2fs -O quota" trick is not really 
the best way to do this, even though this is widely circulated.

It would be better to run a full e2fsck, since that not only rebuilds the quota 
tables, but also ensures that the values going into the quota tables are 
correct.  Since the time taken by "tune2fs -O quota" is almost the same as 
running e2fsck, it is better to do it the right way.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-05 Thread Mohr Jr, Richard Frank

> On Mar 5, 2020, at 2:48 AM, Torsten Harenberg 
>  wrote:
> 
> [QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
> != expected (664182784, 215)

I assume you are running ldiskfs as the backend?  If so, have you tried 
regenerating the quota info for the OST?  I believe the command is “tune2fs -O 
^quota ” to clear quotas and then “tune2fs -O quota 
” to reenable/regenerate them.  I don’t know if that would 
work, but it might be worth a shot.

—
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
University of Tennessee

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-05 Thread Degremont, Aurelien

I understand you want to avoid deploying kdump, but you should focus on saving 
your console history somewhere. It will be difficult to help without the panic 
message.

For 2.10, maybe I was a bit optimistic. I think you should be able to build the 
RPMs from sources, but pre-build packages are not available.


Aurélien

Le 05/03/2020 11:47, « Torsten Harenberg »  
a écrit :

Hi Aurélien,

thanks for your quick reply, really appreciate it.

Am 05.03.20 um 10:20 schrieb Degremont, Aurelien:
> - What is the exact error message when the panic happens? Could you 
copy/paste few log messages from this panic message?

a running ssh shell only shows

kernel panic - not synced.

No trace, nothing. I will try to capture the console next time that happens.

> - Did you try searching for this pattern onto jira.whamcloud.com, to see 
if this is an already known bug.

yes.. though not deeply. Was difficult, as kdump is not installed on
that one and I was a bit afraid to deeply touch the system as we bought
it pre-installed and the kdump-tools are not installed and /var/crash is
empty, so I fear kdump was not configured :-( So my only knowledge is
that the kernel panics :-(

> - It seems related to quota. Is disabling quota an option for you?

not easily, the system is always full and enforcing group quotas was our
way so far to keep the system from physically reach 100%. But of course
we could talk to the users and ask them to behave nicely.

> - Lustre 2.10.8 supports CentOS 6 and was a LTS Lustre version, it got a 
lot more fixes and is very stable. It could be an easy upgrade path for you 
before getting your new system?

That would be great!! Where do you find the server RPMs for this
version? I couldn't find them on whamcloud:

https://downloads.whamcloud.com/public/lustre/lustre-2.10.8/

only contains server RPMS for el7 as far as I can see.

Thanks again

 Torsten


-- 
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-05 Thread Torsten Harenberg

Hi Aurélien,

thanks for your quick reply, really appreciate it.

Am 05.03.20 um 10:20 schrieb Degremont, Aurelien:
> - What is the exact error message when the panic happens? Could you 
> copy/paste few log messages from this panic message?

a running ssh shell only shows

kernel panic - not synced.

No trace, nothing. I will try to capture the console next time that happens.

> - Did you try searching for this pattern onto jira.whamcloud.com, to see if 
> this is an already known bug.

yes.. though not deeply. Was difficult, as kdump is not installed on
that one and I was a bit afraid to deeply touch the system as we bought
it pre-installed and the kdump-tools are not installed and /var/crash is
empty, so I fear kdump was not configured :-( So my only knowledge is
that the kernel panics :-(

> - It seems related to quota. Is disabling quota an option for you?

not easily, the system is always full and enforcing group quotas was our
way so far to keep the system from physically reach 100%. But of course
we could talk to the users and ask them to behave nicely.

> - Lustre 2.10.8 supports CentOS 6 and was a LTS Lustre version, it got a lot 
> more fixes and is very stable. It could be an easy upgrade path for you 
> before getting your new system?

That would be great!! Where do you find the server RPMs for this
version? I couldn't find them on whamcloud:

https://downloads.whamcloud.com/public/lustre/lustre-2.10.8/

only contains server RPMS for el7 as far as I can see.

Thanks again

 Torsten


-- 
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-05 Thread Degremont, Aurelien

Hello Torsten,

- What is the exact error message when the panic happens? Could you copy/paste 
few log messages from this panic message?
- Did you try searching for this pattern onto jira.whamcloud.com, to see if 
this is an already known bug.
- It seems related to quota. Is disabling quota an option for you?
- Lustre 2.10.8 supports CentOS 6 and was a LTS Lustre version, it got a lot 
more fixes and is very stable. It could be an easy upgrade path for you before 
getting your new system?

Aurélien

Le 05/03/2020 08:49, « lustre-discuss au nom de Torsten Harenberg » 
 a écrit :

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Dear all,

I know it's dared to ask for help for such an old system.

We still run a CentOS 6 based Lustre 2.8.0 system
(kernel-2.6.32-573.12.1.el6_lustre.x86_64,
lustre-2.8.0-2.6.32_573.12.1.el6_lustre.x86_64.x86_64).

It's out of warrenty and about to be replaced. The approval process for
the new grant took over a year and we're currently preparing an EU wide
tender, all of that takes and took much more time than we expected.

The problem is:

one OSS server is always running into a kernel panic. It seems that this
kernel panic is related to one of the OSS mount points. If we mount the
LUNs of that server (all data is on a 3par SAN) to a different server,
this one is panic'ing, too.

We always run file system checks after such a panic but these show only
minor issues that you would expect after a crashed machine like

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

We would love to avoid an upgrade to CentOS 7 with these old machines,
but these crashes happen really often meanwhile and yesterday it
panic'ed after only 30mins.

Now we're running out of ideas.

If anyone has an idea how we could identify the source of the problem,
we would really appreciate it.

Kind regards

  Torsten


--
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-04 Thread Torsten Harenberg

Dear all,

I know it's dared to ask for help for such an old system.

We still run a CentOS 6 based Lustre 2.8.0 system
(kernel-2.6.32-573.12.1.el6_lustre.x86_64,
lustre-2.8.0-2.6.32_573.12.1.el6_lustre.x86_64.x86_64).

It's out of warrenty and about to be replaced. The approval process for
the new grant took over a year and we're currently preparing an EU wide
tender, all of that takes and took much more time than we expected.

The problem is:

one OSS server is always running into a kernel panic. It seems that this
kernel panic is related to one of the OSS mount points. If we mount the
LUNs of that server (all data is on a 3par SAN) to a different server,
this one is panic'ing, too.

We always run file system checks after such a panic but these show only
minor issues that you would expect after a crashed machine like

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

We would love to avoid an upgrade to CentOS 7 with these old machines,
but these crashes happen really often meanwhile and yesterday it
panic'ed after only 30mins.

Now we're running out of ideas.

If anyone has an idea how we could identify the source of the problem,
we would really appreciate it.

Kind regards

  Torsten


-- 
Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

[lustre-discuss] old Lustre 2.8.0 panic'ing continously

9 matches

Site Navigation

Mail list logo

Footer information