Re: [Lustre-discuss] Metadata storage in test script files

2012-05-07 Thread Chris Gearing
On Mon, May 7, 2012 at 7:33 PM, Nathan Rutman  wrote:

>
> On May 4, 2012, at 7:46 AM, Chris Gearing wrote:
>
> > Hi Roman,
> >
> > I think we may have rat-holed here and perhaps it's worth just
> > re-stating what I'm trying to achieve here.
> >
> > We have a need to be able to test in a more directed and targeted
> > manner, to be able to focus on a unit of code like lnet or an attribute
> > of capability like performance. However since starting work on the
> > Lustre test infrastructure it has become clear to me that knowledge
> > about the capability, functionality and purpose of individual tests is
> > very general and held in the heads of Lustre engineers. Because we are
> > talking about targeting tests we require knowledge about the capability,
> > functionality and purpose of the tests not the outcome of running the
> > tests, or to put it another way what the tests can do not what they have
> > done.
> >
> > One key fact about cataloguing the the capabilities of the tests is that
> > for almost every imaginable case the capability of the test only changes
> > if the test itself changes and so the rate of change of the data in the
> > catalogue is the same and actually much less than the rate of change
> > test code itself. The only exception to this could be that a test
> > suddenly discovers a new bug which has to have a new ticket attached to
> > it, although this should be a very very rare if we manage our
> > development process properly.
> >
> > This requirement leads to the conclusion that we need to catalogue all
> > of the tests within the current test-framework and a catalogue equates
> > to a database, hence we need a database of the capability, functionality
> > and purpose of the individual tests. With this requirement in mind it
> > would be easy to create a database using something like mysql that could
> > be used by applications like the Lustre test system, but using an
> > approach like that would make the database very difficult to share and
> > will be even harder to attach the knowledge to the Lustre tree which is
> > were it belongs.
> >
> > So the question I want to solve is how to catalogue the capabilities of
> > the individual tests in a database, store that data as part of the
> > Lustre source and as a bonus make the data readable and even carefully
> > editable by people as well as machines. Now to focus on the last point I
> > do not think we should constrain ourselves to something that can be read
> > by machine using just bash, we do have access to structure languages and
> > should make use of that fact.
> >
> I think we all agree 100% on the above...
>
> > The solution to all of this seemed to be to store the catalogue about
> > the tests as part of the tests themselves
> ... but not necessarily that conclusion.
>
>
> , this provides for human and
> > machine accessibility, implicit version control and certainty the what
> > ever happens to Lustre source the data goes with it. It is also the case
> > that by keeping the catalogue with the subject the maintenance of the
> > catalogue is more likely to occur than if the two are separate.
>
> I agree with all those.  But there are some difficulties with this as well:
> 1. bash isn't a great language to encapsulate this metadata
>

The thing to focus on I think is the data captured not the format. The
parser for yaml encapsulated in the source or anywhere else is a small
amount of effort compared to capturing the data in the first place. If we
capture the data and it's machine readable then changing the format is easy.

There are many advantages today to keeping the source and the metadata in
the same place, one being that when reviewing new or updated tests the
reviewers can and will be encouraged to by the locality to ensure the
metadata matches the new or revised test. If the two are not together then
they have very little chance of being kept in sync.

2. this further locks us in to current test implementation - there's not
> much possibility to start writing tests in another language if we're
> parsing through looking for bash-formatted metadata. Sure, multiple parsers
> could be written...
>

I don't think it is a lock in at all, the data is machine readable and
moving to a new format when and should we need it will be easy. Let's focus
on capturing the data so we increase our knowledge, once we have the data
we can manipulate it however we want. The data and the metadata together in
my opinion increases the chance of capturing and updating the data given
todays methods and tools.

3. difficulty changing md of groups of tests en-mass - eg. add "slow"
> keyword to a set of tests
>

The data can read and written by machine and the libraries/application to
do this would be written. Referring back to the description of the metadata
we would not be making sweeping changes to test metadata because the
metadata should only change when the test changes [exceptions will always
apply but we should not optimize for excepti

Re: [Lustre-discuss] recovery from multiple disks failure on the same md

2012-05-07 Thread Mark Hahn
> I'd also recommend to start periodic scrubbing: We do this once per month
>with low priority (~5MBPS) with little impact to the users.

yes.  and if you think a rebuild might overstress marginal disks,
throttling via the dev.raid.speed_limit_max sysctl can help.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] recovery from multiple disks failure on the same md

2012-05-07 Thread Adrian Ulrich
Hi,


> A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. 
> While recovering it, another disk failed. so recovering procedure seems to be 
> halt,

So did the md-array stop itself on the 3th disk failure (or at least turn 
read-only)?

If it did you might be able to get it running again without catastrophic 
corruption.


This is what i would try (without any warranty!):


 -> Forget about the 2 syncing spares

 -> Take the 3th failed disk and attach it to some pc

 -> Copy as much data as possible to a new spare using dd_rescue
(-r might help)

 -> Put the drive with the fresh copy (= the good, new drive) into the array 
and assemble + start it.
Use --force if mdadm complains about outdated metadata.
(and starting it as 'readonly' for now would also be a good idea)

 -> Add a new spare to the array and sync it as fast as possible to get at 
least 1 parity disk.

 -> Run 'fsck -n /dev/mdX' to see how badly damaged your filesystem is.
If you think that fsck can fix the errors (and will not cause more 
damadge), run it without '-n'

 -> Add the 2nd parity disk, sync it, mount the filesystem and pray.


The amount of data corruption will be linked to the success of dd_rescue: You 
are probably lucky if it only failed to read a few sectors.


And i agree with Kevin:

If you have a support contract: ask them to fix it.
(..and if you have enough hardware + time: create a backup of ALL drives in the 
failed raid via 'dd' before touching anything!)


I'd also recommend to start periodic scrubbing: We do this once per month with 
low priority (~5MBPS) with little impact to the users.


Regards and good luck,
 Adrian
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Metadata storage in test script files

2012-05-07 Thread Nathan Rutman

On May 4, 2012, at 7:46 AM, Chris Gearing wrote:

> Hi Roman,
> 
> I think we may have rat-holed here and perhaps it's worth just 
> re-stating what I'm trying to achieve here.
> 
> We have a need to be able to test in a more directed and targeted 
> manner, to be able to focus on a unit of code like lnet or an attribute 
> of capability like performance. However since starting work on the 
> Lustre test infrastructure it has become clear to me that knowledge 
> about the capability, functionality and purpose of individual tests is 
> very general and held in the heads of Lustre engineers. Because we are 
> talking about targeting tests we require knowledge about the capability, 
> functionality and purpose of the tests not the outcome of running the 
> tests, or to put it another way what the tests can do not what they have 
> done.
> 
> One key fact about cataloguing the the capabilities of the tests is that 
> for almost every imaginable case the capability of the test only changes 
> if the test itself changes and so the rate of change of the data in the 
> catalogue is the same and actually much less than the rate of change 
> test code itself. The only exception to this could be that a test 
> suddenly discovers a new bug which has to have a new ticket attached to 
> it, although this should be a very very rare if we manage our 
> development process properly.
> 
> This requirement leads to the conclusion that we need to catalogue all 
> of the tests within the current test-framework and a catalogue equates 
> to a database, hence we need a database of the capability, functionality 
> and purpose of the individual tests. With this requirement in mind it 
> would be easy to create a database using something like mysql that could 
> be used by applications like the Lustre test system, but using an 
> approach like that would make the database very difficult to share and 
> will be even harder to attach the knowledge to the Lustre tree which is 
> were it belongs.
> 
> So the question I want to solve is how to catalogue the capabilities of 
> the individual tests in a database, store that data as part of the 
> Lustre source and as a bonus make the data readable and even carefully 
> editable by people as well as machines. Now to focus on the last point I 
> do not think we should constrain ourselves to something that can be read 
> by machine using just bash, we do have access to structure languages and 
> should make use of that fact.
> 
I think we all agree 100% on the above...

> The solution to all of this seemed to be to store the catalogue about 
> the tests as part of the tests themselves
... but not necessarily that conclusion.

> , this provides for human and 
> machine accessibility, implicit version control and certainty the what 
> ever happens to Lustre source the data goes with it. It is also the case 
> that by keeping the catalogue with the subject the maintenance of the 
> catalogue is more likely to occur than if the two are separate.

I agree with all those.  But there are some difficulties with this as well:
1. bash isn't a great language to encapsulate this metadata
2. this further locks us in to current test implementation - there's not much 
possibility to start writing tests in another language if we're parsing through 
looking for bash-formatted metadata. Sure, multiple parsers could be written...
3. difficulty changing md of groups of tests en-mass - eg. add "slow" keyword 
to a set of tests
4. no inheritance of characteristics - each test must explicitly list every 
piece of md.  This not only blows up the amount of md it also is a source for 
typos, etc. to cause problems.
5. no automatic modification of characteristics.  In particular, one piece of 
md I would like to see is "maximum allowed test time" for each test.  Ideally, 
this could be measured and adjusted automatically based on historical and 
ongoing run data.  But it would be dangerous to allow automatic modification to 
the script itself.

To address those problems, I think a database-type approach is exactly right, 
or perhaps a YAML file with hierarchical inheritance.
To some degree, this is a "evolution vs revolution" question, and I prefer to 
come down on the revolution-enabling design, despite the problems you list.  
Basically, I believe the separated MD model allows for the replacement of 
test-framework, and this, to my mind, is the majority driver for adding the MD 
at all.


> 
> My original use of the term test metadata is intended as a more modern 
> term for catalogue or the [test] library.
> 
> So to refresh everybody's mind, I'd like to suggest that we place test 
> metadata in the source code itself using the following format, where the 
> here doc is inserted into the copy about the test function itself.
> 
> ===
> < Name:
>   before_upgrade_create_data
> Summary:
>   Copies lustre source into a node specific directory and then

Re: [Lustre-discuss] recovery from multiple disks failure on the same md

2012-05-07 Thread Kevin Van Maren

On May 6, 2012, at 10:13 PM, Tae Young Hong wrote:

Hi,

I found the terrible situation on our lustre system.
A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While 
recovering it, another disk failed. so recovering procedure seems to be halt, 
and the spare disk which were in resync fell into "spare" status again. (I 
guess that resync procedure almost finished more than 95%)
Right now we have just 7 disks for this md. Is there any possibility to recover 
from this situation?



It might be possible, but not something I've done.  If the array has not been 
written to since a drive failed, you might be able to power-cycle the failed 
drives (to reset the firmware) and force re-add them (without a rebuild)?  If 
the array _has_ been modified (most likely) you could write a sector of 0's to 
the bad sector, which will corrupt just that stripe, and force-re-add the last 
failed drive and attempt to rebuild again.

Certainly if you have a support contract I'd recommend you get professional 
assistance.



Unfortunately, the failure mode you encountered is all too common.  Because the 
Linux SW RAID code does not read the parity blocks unless there is a problem, 
hard drive failures are NOT independent: drives appear to fail more often 
during a rebuild than at any other time.  The only way to work around this 
problem is to periodically do a "verify" of the MD array.

A verify allows the drive, which is failing in the 20% of the space that 
contains parity, to fail _before_ the data becomes unreadable, rather than fail 
_after_ the data becomes unreadable.  Don't do it on a degraded array, but it 
is a good way to ensure healthy arrays are really healthy.

"echo check > /sys/block/mdX/md/sync_action" to force a verify.  Parity 
mis-matches will be reported (not corrected), but drive failures can be dealt 
with sooner, rather than letting them stack up.  Do "man md" and see the 
"sync_action" section.

Also note that Lustre 1.8.7 has a fix to the SW RAID code (corruption when 
rebuilding under load).  Oracle's release called the patch 
md-avoid-corrupted-ldiskfs-after-rebuild.patch, while Whamcloud called it 
raid5-rebuild-corrupt-bug.patch

Kevin



The following is detailed log.
#1 the original configuration before any failure

 Number   Major   Minor   RaidDevice State
   0   8  1760  active sync   /dev/sdl
   1   8  1921  active sync   /dev/sdm
   2   8  2082  active sync   /dev/sdn
   3   8  2243  active sync   /dev/sdo
   4   8  2404  active sync   /dev/sdp
   5  6505  active sync   /dev/sdq
   6  65   166  active sync   /dev/sdr
   7  65   327  active sync   /dev/sds
   8  65   488  active sync   /dev/sdt
   9  65   969  active sync   /dev/sdw

  10  65   64-  spare   /dev/sdu

#2 a disk(sdl) failed, and resync started after adding spare disk(sdu)
May  7 04:53:33 oss07 kernel: sd 1:0:10:0: SCSI error: return code = 0x0802
May  7 04:53:33 oss07 kernel: sdl: Current: sense key: Medium Error
May  7 04:53:33 oss07 kernel: Add. Sense: Unrecovered read error
May  7 04:53:33 oss07 kernel:
May  7 04:53:33 oss07 kernel: Info fld=0x74241ace
May  7 04:53:33 oss07 kernel: end_request: I/O error, dev sdl, sector 1948523214
... ...
May  7 04:54:15 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: md: syncing RAID array md12


#3 another disk(sdp) failed
May  7 04:54:42 oss07 kernel: end_request: I/O error, dev sdp, sector 1949298688
May  7 04:54:42 oss07 kernel: mptbase: ioc1: LogInfo(0x3108): 
Originator={PL}, Code={SATA NCQ FaCommands After Error}, SubCode(0x)
May  7 04:54:4

Re: [Lustre-discuss] Not sure how we should configure our RAID arrays (HW limitation)

2012-05-07 Thread Kevin Van Maren
The 512K stripe size should be fine for Lustre, and 128KB per disk is enough to 
get good performance from the underlying hard drive.

I don't know anything about the E18s beyond what you've posted, so I can't 
guess which configuration is more optimal, so I would suggest you create the 
RAID arrays, format the LUNs for Lustre, and run the Lustre iokit and see how 
the various configurations perform (3 * 4+2, 2 * 8+1, 2 * 7+2).  Then please 
post results (with mkfs, etc command lines) here so others can benefit from 
your experiments and/or suggest additional tunings.

Kevin


On May 4, 2012, at 3:14 PM, Frank Riley wrote:

>> How about doing 3 4+2 RAIDs?  12 usable disks, instead of 14 or 16, but still
>> better than 8 with RAID1.  Doing 4*128KB, resulting in 2 full-stripe writes 
>> for
>> each 1MB IO is not that bad.
> 
> Yes, of course. I had thought of this option earlier but forgot to include 
> it. Thanks for reminding me. So using a stripe width of 512K will not harm 
> performance that much? Note also that the E18s have two active/active 
> controllers in them so that means one controller will be handling I/O 
> requests for 2 arrays, which will reduce performance somewhat. Would this 
> affect your decision between 3 4+2 (512K) or 2 7+2 (896K)?
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Transparent Huge Pages?

2012-05-07 Thread Johann Lombardi
On Sat, May 05, 2012 at 12:07:15AM -0600, Andreas Dilger wrote:
> On 2012-05-03, at 8:30 AM, Kent Engström wrote:
> > A quick question: does the Lustre client version 1.8.x and/or 2.x
> > running on RHEL 6 support Transparent Huge Pages? Or do you need to turn
> > that feature off?
> 
> I don't know for sure, but you could check the kernel configs we use, in 
> lustre/kernel_patches/kernel_configs/ to see if it is enabled by default.

AFAIK, transparent huge pages support is enabled by default in the RHEL6 
kernel, so we have this feature already enabled on RHEL6 lustre clients.
That said, we had to disable page migration (see LU-130) since we need to 
implement our own page migration handler.

Cheers,
Johann
-- 
Johann Lombardi
Whamcloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss