Re: OSD and MON memory usage

2012-11-28 Thread Cláudio Martins

On Wed, 28 Nov 2012 13:00:17 -0800 Samuel Just sam.j...@inktank.com wrote:
 What replication level are you using?

 Hi,

 The replication level is 3.

Thanks

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD and MON memory usage

2012-11-28 Thread Cláudio Martins

On Wed, 28 Nov 2012 13:08:08 -0800 Samuel Just sam.j...@inktank.com wrote:
 Can you post the output of ceph -s?

 'ceph -s' right now gives

   health HEALTH_WARN 923 pgs degraded; 8666 pgs down; 9606 pgs peering; 7 pgs 
recovering; 406 pgs recovery_wait; 3769 pgs stale; 9606 pgs stuck inactive; 
3769 pgs stuck stale; 11052 pgs stuck unclean; recovery 121068/902868 degraded 
(13.409%); 4824/300956 unfound (1.603%); 2/18 in osds are down
   monmap e1: 1 mons at {0=193.136.128.202:6789/0}, election epoch 1, quorum 0 0
   osdmap e7669: 62 osds: 16 up, 18 in
pgmap v47643: 12480 pgs: 35 active, 1223 active+clean, 129 stale+active, 
321 active+recovery_wait, 198 stale+active+clean, 236 peering, 2 
active+remapped, 2 stale+active+recovery_wait, 6126 down+peering, 249 
active+degraded, 2 stale+active+recovering+degraded, 598 stale+peering, 7 
active+clean+scrubbing, 29 active+recovery_wait+remapped, 2067 
stale+down+peering, 618 stale+active+degraded, 52 
active+recovery_wait+degraded, 61 remapped+peering, 365 down+remapped+peering, 
2 stale+active+recovery_wait+degraded, 45 stale+remapped+peering, 108 
stale+down+remapped+peering, 5 active+recovering; 1175 GB data, 1794 GB used, 
25969 GB / 27764 GB avail; 121068/902868 degraded (13.409%); 4824/300956 
unfound (1.603%)
   mdsmap e1: 0/0/1 up



 The cluster has been in this state since the last attempt to get it
going. I added about 100GB of swap on each machine to avoid the OOM
killer. Running like this resulted in the machines trashing wildly and
getting to ~2000 load avg, and after a while the osds started
dying/commited suicide, but *not* from OOM. Some of the few that remain
have bloated to around 1.9GB of mem usage.

 If you want, I can try to restart the whole thing tomorrow and collect
fresh log output from the dying OSDs, or any other action or debug info
that you might find useful.


Thanks!

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD and MON memory usage

2012-11-28 Thread Cláudio Martins

On Thu, 29 Nov 2012 00:13:25 +0100 Sylvain Munaut 
s.mun...@whatever-company.com wrote:
 Hi,
 
   If you want, I can try to restart the whole thing tomorrow and collect
  fresh log output from the dying OSDs, or any other action or debug info
  that you might find useful.
 
 Is the clock synchronized on all machines ?
 


 Yup. All machines synched by ntp.

Cláudio


 What you describe (growing mem, recovery that doesn't seem to end)
 seems pretty similar to what I experienced when clocks of OSD were off
 ...
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD and MON memory usage

2012-11-27 Thread Cláudio Martins

On Fri, 23 Nov 2012 16:46:00 + Joao Eduardo Luis joao.l...@inktank.com 
wrote:
 On 11/16/2012 05:24 PM, Cláudio Martins wrote:
   As for the monitor daemon on this cluster (running on a dedicated
  machine), it is currently using 3.2GB of memory, and it got to that
  point again in a matter of minutes after being restarted. Would it be
  good if we tested with the changes from the wip-mon-leaks-fix branch?
 
 Following up on this, wip-mon-leaks-fix was merged into master a couple
 of days ago. If you have the chance to check if that fixes your memory
 consumption issues on the monitor, it would be much appreciated!
 

 Hi João,

 I've had a chance to test it and it does indeed seem to make a big
difference on mon memory usage.

 As for the OSD memory usage issue, it's still looking really bad. I'm
preparing to do more testing and send more info about this, but a lot
of unrelated stuff crept up this week and things are going slowly on
this front. I hope to talk more about this before the weekend.

Thanks!

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD and MON memory usage

2012-11-16 Thread Cláudio Martins

 Hi,

 We're testing ceph using a recent build from the 'next' branch (commit
b40387d) and we've run into some interesting problems related to memory
usage.

 The setup consists of 64 OSDs (4 boxes, each with 16 disks, most of
them 2TB, some 1.5TB, XFS filesystems, Debian Wheezy). After the
initial mkcephfs, a 'ceph -s' reports 12480 pgs total.

 For generating some load we used

rados -p rbd bench 28000 write -t 25

and left it running overnight.

 After several hours most of the OSDs had eaten up around 1GB or more
of memory each, which caused trashing on the servers (12GB of RAM
per box), and eventually the OOM killer was invoked, killing many OSDs
and even the SSH daemons. This seems to have caused a domino effect,
and in the morning only around 18 of the OSD were still up.

 After a hard reboot of the boxes that were unresponsive, we are now in
a situation in which there is simply not enough memory for the cluster
to recover. That is, after restarting the OSDs, in 2 to 3 minutes we
have many of them using 1~1.5GB of RAM and the trashing starts all over
again, the OOM killer comes in and things go downhill again. Efectively
the cluster is not able to recover no matter how many times we restart
the daemons.

 We're not using any non-default options in the OSD section of the
config. file. We checked that there is free space for logging on the
system partitions.

 While I know that 12GB per machine can be hardly called to much RAM,
the question I put forward is: is it reasonable for a OSD to consume so
much memory in normal usage, or even recovery situations, when there is
just around ~200 PGs per OSD and only around ~3TB of objects created by
rados bench?

 Is there a rule of thumb to estimate the amount of memory consumed as
a function of PG count, object count and perhaps the number of PGs
trying to recover in a given instant? One of my concerns here is also
to understand if memory consumption during recovery is bounded and
deterministic at all, or if we're simply hitting a severe memory leak
in the OSDs.

 As for the monitor daemon on this cluster (running on a dedicated
machine), it is currently using 3.2GB of memory, and it got to that
point again in a matter of minutes after being restarted. Would it be
good if we tested with the changes from the wip-mon-leaks-fix branch?

 We would appreciate any advice on the best way to understand if the
OSDs are leaking memory or not.

 We will gladly provide any config or debug info that you might be
interested in, or run any tests.

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some snapshot problems

2012-11-09 Thread Cláudio Martins

On Thu, 8 Nov 2012 09:30:55 -0800 (PST) Sage Weil s...@inktank.com wrote:
 Lots of different snapshots:
 
  - librados lets you do 'selfmanaged snaps' in its API, which let an 
application control which snapshots apply to which objects. 
  - you can create a 'pool' snapshot on an entire librados pool.  this 
cannot be used at the same time as rbd, fs, or the above 'selfmanaged' 
snaps.

 Could you please clarify this? You mean that if a given pool has a
snapshot created, a subsequent 'rbd snap create' on an image placed in
that very same pool would fail?

 A quick search through the RADOS and RBD documentation didn't turn
anything up about this restriction, but I apologize if I missed it.

 Thanks

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD deadlock with cephfs client and OSD on same machine

2012-11-05 Thread Cláudio Martins

On Fri, 1 Jun 2012 11:35:37 +0200 Amon Ott a@m-privacy.de wrote:
 
 After backporting syncfs() support into Debian stable libc6 2.11 and 
 recompiling Ceph with it, our test cluster is now running with syncfs().
 

 Hi,

 We're running OSDs on top of Debian wheezy, which unfortunately has
libc6 2.13. By chance, do you still have that patch to backport syncfs()?

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bobtail timing

2012-10-31 Thread Cláudio Martins

On Wed, 31 Oct 2012 20:17:49 -0700 (PDT) Sage Weil s...@inktank.com wrote:
 On Thu, 1 Nov 2012, Cl?udio Martins wrote:
  On Wed, 31 Oct 2012 14:38:28 -0700 (PDT) Sage Weil s...@inktank.com wrote:
   On Wed, 31 Oct 2012, Noah Watkins wrote:
Which branch is the freeze taken against? master?
   
   Right.  Basically, every 3-4 weeks:
   
- next is tagged as v0.XX
  - and is merged back into master
- next branch is reset to current master
- testing branch is reset to just-tagged v0.XX
   
  
   Hmm, interesting. But doesn't that mean that when the real v0.XX is
  later officially _released_, its top commit might not be the commit
  that was tagged as v0.XX? Assuming that issues are found after the
  testing branch is reset to v0.XX, fixes would go on top of v0.XX, right?
  
   Am I missing something, or people checking out a v0.XX with git might
  not be getting the real v0.XX that was released as tarballs?
 
 The releases and tarballs contain *exactly* the content that is tagged 
 v0.X.  The branches may accumulate additional fixes after that, which 
 will later be tagged with a v0.X.Y point release.  Since we've started 
 maintaing a stable release, we haven't done point releases for 
 the development 'testing' releases, although if there are important bugs 
 we may need to do so in the future.
 
 The 'stable' branch tracks the last stable release (currently argonaut) 
 and is where bug fixes accumulate until the next release.  For example, 
 stable current contains v0.48.2 and several additional commits (mostly 
 backports of proviioning scripts to support the ceph-deploy tool that 
 we're working on).
 

 Ok, it makes perfect sense now. I wasn't realising that the tarball is
released as soon as the branch is tagged, sorry about that.

Thanks for the clarification.

Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] rbd: finish up basic format 2 support

2012-10-10 Thread Cláudio Martins

On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote:
 This series includes updates for two patches posted previously.
 
   -Alex

 Greetings,

 We're gearing up to test v0.52 (specifically the RBD stuff) on our
cluster. After reading this series of posts about rbd format 2 patches
I began wondering if we should start testing these patches as well or
not. To put it simply, what I'd like to know is:

 Is it enough to use the 3.6 vanilla kernel client to take full
advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)?

 Do we have any benefits from applying any of these patches on top of
v3.6 and using format 2, assuming that we stick to v0.52 on the
server, or is this strictly v0.53 and beyond stuff?


 I apologize if this is a dumb question, but by looking at the v0.52
changelog, at doc/rbd/* and the list, it doesn't seem clear how this
fits with v0.52.

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some thoughts about scrub

2011-02-01 Thread Cláudio Martins

On Tue, 1 Feb 2011 17:20:34 +0800 Henry Chang henry.cy.ch...@gmail.com wrote:
 
 Yeah. I expect that scrub can both detect disk errors and check data
 integrity (based on the checksum) in the background. For disk errors,
 I would like CEPH to mark the OSD down/failed and notify the sys
 admin immediately. For data errors, I expect that CEPH can repair
 them automatically (by fetching a right copy from other replicas).
 

 I suppose the best approach would be for this to be configurable with
per OSD granularity. Something like an io_error_threshold config
variable. I would set it to something like 50 or 100, but you could set
it to 1 and the OSD would put itself down or out after that many IO
errors that propagated up to the osd daemon. I guess that even if that
OSD becomes unresponsive for a while it won't be much trouble, since
ceph will mark it down and should recover later, or else the OSD will
be out soon by itself due to the error threshold.

 What do you think?

Cheers

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some thoughts about scrub

2011-01-31 Thread Cláudio Martins

On Mon, 31 Jan 2011 12:56:36 -0800 Colin McCabe cmcc...@alumni.cmu.edu wrote:
 Case #1:
 The hard disk where that the FileStore is reading from could be dying.
 In my experience, hard disks that are dying will tend to experience
 long delays in reading from the filesystem. Occasionally you will be
 unable to read some files, and you'll get EIO instead. When a hard
 disk is dying, all you want to do is get your data off there as soon
 as possible. You don't want to bother trying to fix the files on the
 disk. That disk is toast.
 

 I'm my experience, with recent disks not every read error means that
the disk is going to die anytime soon. I manage several dozens of
Western Digital Drives (Caviar black 2TB) in linux raid6 arrays. When
running MD array background check, MD will report a read error from
time to time on some drives. It will recover the data for that block and
rewrite it - but the bad block won't show as Reallocated or Pending
in SMART reports for that drive. Later, the same drive will do several
entire background checks just fine and will go some time before acting
up again.

 I also have seen some big Hitachi drives throwing some uncorrected
errors (but reallocating them, unlike WD drives), but otherwise work
just fine for months.

 So, granted, I may have flaky drives, but since they currently are not
causing significant hangs or timeouts on the array, why should I just
replace all of them? Even a flaky drive is a useful drive if it
contains a known good copy of your blocks for some time, just in case
your other good drive dies at the wrong time.

 So, I do agree that, as Brian Chrisman pointed out, background scrub
is always important as it helps to prevent your data redundancy going
bad without you knowing about it. I also agree with that sys. admin.
notification is important in either case.

 But I also think that Ceph should try to correct the errors it finds
through scrub, because some of today's drives may throw uncorrected
errors even if they are still useful - I'd rather have more copies of
my data, even if they're slightly unrealiable, since I should always be
able to tell the bad ones by BTRFS checksums. Besides, I think this
model of always trying to correct errors fits well with Ceph's
goal of working with unrealiable, comodity hardware, so it makes no
sense to just bail out and force the operator to swap every flaky drive.

 Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple disks per server.

2010-05-05 Thread Cláudio Martins

On Tue, 04 May 2010 14:18:25 +0200 Mickaël Canévet cane...@embl.fr wrote:
 Hi,
 
 I'm testing ceph on 4 old servers.
 
 As there is more then one disk per server available for data (2 with 6 
 disks and 2 with 10 disks for a total of 32 disks over 4 nodes), I was 
 wondering how to define OSDs.
 
 I have choice between one OSD per disk (32 OSDs on the cluster) or one 
 OSD per server with one btrfs filesystem over all disks of the server (4 
 OSDs on the cluster). Which one is the best solution ?
 
 In the first case, if I lose one disk, I lose only a small part of 
 available space. In the other case, if I lose one disk, I lose the whole 
 server (as btrfs filesystem is in stripping) much more space.

 Hi,

 I too am facing a similar dilemma:

 Scenario 1:
 I can set up an MD raid6 array for each OSD box and so can afford up
to 2 simultaneous disk failures without Ceph noticing anything wrong.
When the 3rd drive fails, a long time will be spent redistributing data
across the cluster (though much less time than a simple 25TB raid6
rebuild) . This setup should be quite simple, and a 16 disk raid6
should give generally nice performance, though. I probably would use
2-way data replication (on Ceph config.) for this case.

 Scenario 2:

 I can try to configure 1 OSD per disk. As soon as a drive fails, there
will be data redistribution across the remaining OSDs - but this should
be quite fast, as only the content of a single drive (or slightly more)
has to be redistributed across the cluster (worst case). In this case I
would use 3-way replication for added protection against simultaneous
double drive failures and to compensate for the OSDs not having a raid
array underneath them.

 I can see several potential advantages in Scenario 2:

 * Greater simplicity and ease of administration, as there's no need to
worry about RAID arrays, their configuration and their possible bugs.
You have one less layer in the stack to worry about, and that has to be
good news.

 * You can replace failed drives with different drives without worrying
about wasted capacity because they are bigger (as you would on raid),
and you can even take advantage of older, smaller drives that would
otherwise go to the trash can. This will give greater liberty when
upgrading hardware, overall.

 * Degradation of available cluster capacity and bandwidth would be
much softer. In fact, assuming that you don't have many power supplies
or mainboards burning up, your cluster will maintain redundancy as
drives go failing. That is, as long as you have more drives than
(amount_of_data * replication_level) your cluster will probably be in a
good, fully redundant state. That should make for better sleep at night.

 * Workloads with small, spread writes should perform better. In a RAID
array those could cause entire stripes to be read, thus requiring data
chunks to be read from a lot of disks just to compute the redundancy
chunks. This one should be quite an advantage for big mail server
workloads, which is one of the workloads I'm interested in.

 * Large write performance should be no worse than with raid, since Ceph
also spreads chunks across OSDs.


 Having said that, there are some aspects about how Ceph would behave
in Scenario 2 that I still have to investigate:

 * If multiple OSDs per node is a well suported option. Do multiple OSDs
per node play well with each other and with a node's resources?

 * If there are issues with network ports/addresses when setting up more
than 1 OSD per node.

 * OSD behaviour when getting I/O errors from its drive -- this is
really the most complex and important one, and the one I wish I could
hear your opinions about:

  Usually, in a RAID array, when there is a fatal failure, the upper
layers will just get permanent I/O errors and you can assume that
storage area is dead and go on with life.
 However, this is frequently not true when you consider single drives
as in Scenario 2, at least for reads: the drive may return read errors
for a small region but still be quite ok for the remaining data.

 So, ideally, a Ceph OSD receiving a read error from the filesystem
would request a copy of the Object in question from another OSD and try
to rewrite it several times before giving up and declaring the drive
dead (1). This is actually what Linux MD does on recent kernels and I
know from my experience that it increases array survivability a lot.
 Background data scrubbing would help a lot with the above, and I guess
BTRFS checksuming will simplify things here.


 Sorry for the huge email, but I hope that what I wrote are valid
points to make Ceph more robust, and hope to know what you think about
them.

Notes:
  (1) Better yet, if the error repeats, it could leave the old backing
file alone, and try to alloc a new one for that object, thus avoiding
declaring the drive completely dead too early.

 Best regards and thanks

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message