Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein borepst...@gmail.com wrote:

 Hello all,

 I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
 backups. The backups are run via rsync/rsnapshot and are large in terms of
 the number of files: over 10 million each.

 Now the machine is not particularly powerful: it is 64-bit machine, dual
 core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
 following problem: once in awhile that XFS partition starts generating
 multiple I/O errors, files that had content become 0 byte, directories
 disappear, etc. Every time a reboot fixes that, however. So far I've looked
 at logs but could not find a cause of precipitating event.

 Hence the question: has anyone experienced anything along those lines?
 What could be the cause of this?

 Thanks.

 Boris.


Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
should matter in the context of this particular situation).

Also, here's somethine else I have discovered. Apparently there is an
potential intermittent RAID disk trouble. At least I found the following in
the system log:

Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
Drive ECC error reported:port=4, unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
Source drive error occurred:port=4, unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
Rebuild failed:unit=0.
Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
Rebuild paused:unit=0.

...

Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
(0x04:0x000F): SMART threshold exceeded:port=9.
Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
(0x04:0x000F): SMART threshold exceeded:port=9.
Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
Rebuild started:unit=0.

Even if a disk is misbehaving in a RAID6 that should not be causing I/O
errors. Plus, why is it never straight after a rebbot and is always fixed
by a reboot?

Be that as it may, I am still puzzled.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha

 Now the machine is not particularly powerful: it is 64-bit machine, dual
 core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
 following problem: once in awhile that XFS partition starts generating
 multiple I/O errors, files that had content become 0 byte, directories
 disappear, etc. Every time a reboot fixes that, however. So far I've looked
 at logs but could not find a cause of precipitating event.

Is the CentOS you are running a 64 bit one?

The reason I am asking this is because the use of XFS under a 32 bit OS 
is NOT recommended.
If you search this list's archives you will find some discussion about 
this subject.

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha
 Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
 should matter in the context of this particular situation).

Yes, it does matter:

Read this:

*[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster*
http://lists.centos.org/pipermail/centos/2011-April/109142.html
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 2:27 PM, Miguel Medalha miguelmeda...@sapo.ptwrote:

 Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
 should matter in the context of this particular situation).


 Yes, it does matter:

 Read this:

 *[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster*
 http://lists.centos.org/**pipermail/centos/2011-April/**109142.htmlhttp://lists.centos.org/pipermail/centos/2011-April/109142.html


Miguel,

Thanks, but based on the uname output:

uname -a
Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST 2011
x86_64 x86_64 x86_64 GNU/Linux

this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha


 uname -a
 Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST 
 2011 x86_64 x86_64 x86_64 GNU/Linux

 this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.


Ok! Since you didn't inform us in your initial post, I thought I should 
ask you in order to eliminate that possible cause.

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha

Nevertheless, it seems to me that you should have more than 3GB of RAM 
on a 64 bit system...
Since the width of the binary word is 64 bit in this case, 3GB 
correspond to 1.5GB on a 32 bit system...
If you have a 64 bit system you should give it space to work properly.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha

 Nevertheless, it seems to me that you should have more than 3GB of RAM
 on a 64 bit system...
 Since the width of the binary word is 64 bit in this case, 3GB
 correspond to 1.5GB on a 32 bit system...
 If you have a 64 bit system you should give it space to work properly.

... and the fact that a reboot seems to fix the problem could also point 
in that direction.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 2:35 PM, Miguel Medalha miguelmeda...@sapo.ptwrote:


 Nevertheless, it seems to me that you should have more than 3GB of RAM on
 a 64 bit system...
 Since the width of the binary word is 64 bit in this case, 3GB correspond
 to 1.5GB on a 32 bit system...
 If you have a 64 bit system you should give it space to work properly.


Don't worry, you asked exactly the right question - but, unfortunately, it
is not a 32-bit OS here that's the culprit so the situation is more
involved than that.

You are right - it would indeed be desirable to have more than 3 GB of RAM
on that system. However it is not obvious to me that having that little RAM
should cause I/O failure? Why? That it would make the machine slow is to be
expected - and especially so given that I had to jack the swap up to some
40 GB. But I do not necessarily see why I should have outright failures due
solely to not having more RAM.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 2:37 PM, Miguel Medalha miguelmeda...@sapo.ptwrote:


  Nevertheless, it seems to me that you should have more than 3GB of RAM
 on a 64 bit system...
 Since the width of the binary word is 64 bit in this case, 3GB
 correspond to 1.5GB on a 32 bit system...
 If you have a 64 bit system you should give it space to work properly.


 ... and the fact that a reboot seems to fix the problem could also point
 in that direction.


That is entirely possible. It does seem to me that some sort of a resourse
accumulation is indeed occurring on the system - and I hope there is a way
to stop that because filesystem I/O should be a self-balancing process.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Miguel Medalha



 You are right - it would indeed be desirable to have more than 3 GB of 
 RAM on that system. However it is not obvious to me that having that 
 little RAM should cause I/O failure? Why? That it would make the 
 machine slow is to be expected - and especially so given that I had to 
 jack the swap up to some 40 GB. But I do not necessarily see why I 
 should have outright failures due solely to not having more RAM.


If I were you, I would be monitoring the system's memory usage. Maybe 
some software component has a memory leak which keeps worsening until a 
reboot cleans it.
Also, I wouldn't discard the possibility of a physical memory problem. 
Can you test it?
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 2:43 PM, Miguel Medalha miguelmeda...@sapo.ptwrote:




 You are right - it would indeed be desirable to have more than 3 GB of
 RAM on that system. However it is not obvious to me that having that little
 RAM should cause I/O failure? Why? That it would make the machine slow is
 to be expected - and especially so given that I had to jack the swap up to
 some 40 GB. But I do not necessarily see why I should have outright
 failures due solely to not having more RAM.


 If I were you, I would be monitoring the system's memory usage. Maybe some
 software component has a memory leak which keeps worsening until a reboot
 cleans it.
 Also, I wouldn't discard the possibility of a physical memory problem. Can
 you test it?


Miguel, thanks!

All that you are saying makes perfect sense. I have tried monitoring the
system to see if any memory hogs emerge and found no obvious culprits thus
far. I.e., there are processes running that consume large volumes or RAM
but none that seem to keep growing overtime. Or at least I failed to locate
such processes thus far.

As for testing the RAM - it is always a good test when in doubt. Too bad
you have to stop your machine in order to do it and for that reason I
haven't done it yet. Though this is on the short list of things to try.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Joseph L. Casale
I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
backups. The backups are run via rsync/rsnapshot and are large in terms of
the number of files: over 10 million each.

Now the machine is not particularly powerful: it is 64-bit machine, dual
core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
following problem: once in awhile that XFS partition starts generating
multiple I/O errors, files that had content become 0 byte, directories
disappear, etc. Every time a reboot fixes that, however. So far I've looked
at logs but could not find a cause of precipitating event.

Hence the question: has anyone experienced anything along those lines? What
could be the cause of this?

In every situation like this that I have seen, it was hardware that never had
adequate memory provisioned.

Another consideration is you almost certainly wont be able to run a repair on 
that
fs with so little ram.

Finally, it would be interesting to know how you architected the storage 
hardware.
Hardware raid, BBC, drive cache status, barrier status etc...
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 2:56 PM, Joseph L. Casale jcas...@activenetwerx.com
 wrote:

 I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
 backups. The backups are run via rsync/rsnapshot and are large in terms of
 the number of files: over 10 million each.
 
 Now the machine is not particularly powerful: it is 64-bit machine, dual
 core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
 following problem: once in awhile that XFS partition starts generating
 multiple I/O errors, files that had content become 0 byte, directories
 disappear, etc. Every time a reboot fixes that, however. So far I've
 looked
 at logs but could not find a cause of precipitating event.
 
 Hence the question: has anyone experienced anything along those lines?
 What
 could be the cause of this?

 In every situation like this that I have seen, it was hardware that never
 had
 adequate memory provisioned.

 Another consideration is you almost certainly wont be able to run a repair
 on that
 fs with so little ram.

 Finally, it would be interesting to know how you architected the storage
 hardware.
 Hardware raid, BBC, drive cache status, barrier status etc...


Joseph,

If I remember correctly I pretty much went with the defaults when I created
this XFS on top of a 16-drive RAID6 configuration.

Now as far as memory - I think for the purpose of XFS repair RAM and swap
ought to be the same. And I've got plenty of swap on this system. I also
host an 5 TB XFS in a file there and I ran XFS repair on it and it ran
within no more than 5 minutes. Now this is 20% of the larger XFS, roughly
speaking.

I should try to collect the info you mentioned, though - that was a good
thought, some clue might be contained in there for sure.

Thanks for your input.

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Ross Walker
On Jan 22, 2012, at 10:00 AM, Boris Epstein borepst...@gmail.com wrote:

 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
 Drive ECC error reported:port=4, unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
 Source drive error occurred:port=4, unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
 Rebuild failed:unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
 Rebuild paused:unit=0.

From 3ware's site:
004h Rebuild failed

The 3ware RAID controller was unable to complete a rebuild operation. This 
error can be caused by drive errors on either the source or the destination of 
the rebuild. However, due to ATA drives' ability to reallocate sectors on write 
errors, the rebuild failure is most likely caused by the source drive of the 
rebuild detecting some sort of read error. The default operation of the 3ware 
RAID controller is to abort a rebuild if an error is encountered. If it is 
desired to continue on error, you can set the Continue on Source Error During 
Rebuild policy for the unit on the Controller Settings page in 3DM.

026h Drive ECC error reported

This AEN may be sent when a drive returns the ECC error response to an 3ware 
RAID controller command. The AEN may or may not be associated with a host 
command. Internal operations such as Background Media Scan post this AEN 
whenever drive ECC errors are detected.

Drive ECC errors are an indication of a problem with grown defects on a 
particular drive. For redundant arrays, this typically means that dynamic 
sector repair would be invoked (see AEN 023h). For non-redundant arrays (JBOD, 
RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID 
controller returning failed status to the associated host command.


Sounds awfully like a hardware error on one of the drives. Replace the failed 
drive and try rebuilding.

-Ross 
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Ross Walker
On Jan 22, 2012, at 4:41 PM, Ross Walker rswwal...@gmail.com wrote:

 On Jan 22, 2012, at 10:00 AM, Boris Epstein borepst...@gmail.com wrote:
 
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
 Drive ECC error reported:port=4, unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
 Source drive error occurred:port=4, unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
 Rebuild failed:unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
 Rebuild paused:unit=0.
 
 From 3ware's site:
 004h Rebuild failed
 
 The 3ware RAID controller was unable to complete a rebuild operation. This 
 error can be caused by drive errors on either the source or the destination 
 of the rebuild. However, due to ATA drives' ability to reallocate sectors on 
 write errors, the rebuild failure is most likely caused by the source drive 
 of the rebuild detecting some sort of read error. The default operation of 
 the 3ware RAID controller is to abort a rebuild if an error is encountered. 
 If it is desired to continue on error, you can set the Continue on Source 
 Error During Rebuild policy for the unit on the Controller Settings page in 
 3DM.
 
 026h Drive ECC error reported
 
 This AEN may be sent when a drive returns the ECC error response to an 3ware 
 RAID controller command. The AEN may or may not be associated with a host 
 command. Internal operations such as Background Media Scan post this AEN 
 whenever drive ECC errors are detected.
 
 Drive ECC errors are an indication of a problem with grown defects on a 
 particular drive. For redundant arrays, this typically means that dynamic 
 sector repair would be invoked (see AEN 023h). For non-redundant arrays 
 (JBOD, RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID 
 controller returning failed status to the associated host command.
 
 Sounds awfully like a hardware error on one of the drives. Replace the failed 
 drive and try rebuilding.
 

This error code does not bode well.

02Dh Source drive error occurred

If an error is encountered during a rebuild operation, this AEN is generated if 
the error was on a source drive of the rebuild. Knowing if the error occurred 
on the source or the destination of the rebuild is useful for troubleshooting.



It's possible the whole RAID6 is corrupt.

-Ross


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Keith Keller
On 2012-01-22, Boris Epstein borepst...@gmail.com wrote:

 Also, here's somethine else I have discovered. Apparently there is an
 potential intermittent RAID disk trouble. At least I found the following in
 the system log:

 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
 Drive ECC error reported:port=4, unit=0.
 Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
 Source drive error occurred:port=4, unit=0.

Which 3ware controller is this?  I have had lots of problems with the
3ware 9550SX controller and WD-EA[RD]S drives in a similar
configuration.  (Yes, I know all about the EARS drives, but they work
mostly fine with the 3ware 9650 controller, so I suspect some weird
interaction between the cheap drives and the old not-so-great
controller.  I also suspect an intermittently failing port, which I'll
be testing more later this week.)

 Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
 (0x04:0x000F): SMART threshold exceeded:port=9.
 Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
 (0x04:0x000F): SMART threshold exceeded:port=9.
 Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
 Rebuild started:unit=0.

What does your RAID look like?  Are you using the 3ware's RAID6 (in
which case it's not a 9550) or mdraid?  Are the 3ware errors in the logs
across a large number of ports or just a few?  Have you used the drive
tester for your drives to verify that they're still good?  On all my
other systems, when the controller has reported a failure, and I've run
it through the tester, it's reported a failure.  (Often when my 9550
reports a failure the drive passes all tests.)

If you happen to have real RAID drive models, you may also try
contacting LSI support.  They will steadfastly refuse to help if you
have desktop-edition drives, but can be at least somewhat helpful if you
have enterprise drives.

--keith


-- 
kkel...@wombat.san-francisco.ca.us


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Boris Epstein
On Sun, Jan 22, 2012 at 1:34 PM, Keith Keller 
kkel...@wombat.san-francisco.ca.us wrote:

 On 2012-01-22, Boris Epstein borepst...@gmail.com wrote:
 
  Also, here's somethine else I have discovered. Apparently there is an
  potential intermittent RAID disk trouble. At least I found the following
 in
  the system log:
 
  Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
 (0x04:0x0026):
  Drive ECC error reported:port=4, unit=0.
  Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
 (0x04:0x002D):
  Source drive error occurred:port=4, unit=0.

 Which 3ware controller is this?  I have had lots of problems with the
 3ware 9550SX controller and WD-EA[RD]S drives in a similar
 configuration.  (Yes, I know all about the EARS drives, but they work
 mostly fine with the 3ware 9650 controller, so I suspect some weird
 interaction between the cheap drives and the old not-so-great
 controller.  I also suspect an intermittently failing port, which I'll
 be testing more later this week.)

  Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
  (0x04:0x000F): SMART threshold exceeded:port=9.
  Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
  (0x04:0x000F): SMART threshold exceeded:port=9.
  Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
  Rebuild started:unit=0.

 What does your RAID look like?  Are you using the 3ware's RAID6 (in
 which case it's not a 9550) or mdraid?  Are the 3ware errors in the logs
 across a large number of ports or just a few?  Have you used the drive
 tester for your drives to verify that they're still good?  On all my
 other systems, when the controller has reported a failure, and I've run
 it through the tester, it's reported a failure.  (Often when my 9550
 reports a failure the drive passes all tests.)

 If you happen to have real RAID drive models, you may also try
 contacting LSI support.  They will steadfastly refuse to help if you
 have desktop-edition drives, but can be at least somewhat helpful if you
 have enterprise drives.

 --keith


 --
 kkel...@wombat.san-francisco.ca.us




Keith, thanks!

The RAID is on the controller level. Yes, I believe the controller is a
3Ware 9xxx series - I don't recall the details right now.

What are you referring to as drive tester?

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] weird XFS problem

2012-01-22 Thread Keith Keller
On 2012-01-22, Boris Epstein borepst...@gmail.com wrote:

 The RAID is on the controller level. Yes, I believe the controller is a
 3Ware 9xxx series - I don't recall the details right now.

The details are important in this context--the 9550 is the problematic
one (at least for me, though I've seen others with similar issues).  But
if it's a hardware RAID6, it's a later controller, as the 9550 doesn't
support RAID6.  I have had some issues with the WD-EARS drives with 96xx
controllers, but much less frequently.

 What are you referring to as drive tester?

Some drive vendors distribute their own bootable CD image, with which
you can run tests specific to their drives, which can return proper
error codes to help determine whether there is actually a problem on the
drive.  Seagate used to require you give them the diagnostic code their
tester returned in order for them to accept a drive for an RMA; I don't
think they do that any more, but they still distribute their tester.
But it's a good way to get another indicator of a problem; if both the
controller and the drive tester report an error, it's very likely that
you have a bad drive; if the tester says the drive is fine, and does
this for a few drives the controller reports as failed, you can suspect
something behind the drives as a problem.  (This is how I came to
suspect the 9550: it would say my drives had failed, but the WD tester
repeatedly said they were fine.)

The latest version of UBCD has the latest versions of these various
testers; I recall WD, Seagate, and Hitachi testers, and I'm pretty sure
there are others.

--keith

-- 
kkel...@wombat.san-francisco.ca.us


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos