On Thu, 17 Jul 2014, Alphe Salas wrote:
> On 07/17/2014 12:35 PM, Sage Weil wrote:
> > On Thu, 17 Jul 2014, Alphe Salas wrote:
> > > Hello,
> > > I would like to know if there is something planned to correct the "forever
> > > growing" effet when using rbd image.
> > > My experience shows that the replicas of a rbd images are never discarded
> > > and
> > > never overwriten. Lets say my physical share is about 30 TB I make an
> > > image of
> > > 13TB (half the real space - 25% of disfunction osd support). My experience
> > > shows that the rbd image is overwriten so if I top the 13TB once i get a
> > > 26TB
> > > of real space used (replicas set to 2) if I delete 8TB from those 13TB I
> > > see
> > > the real space used unchanged.
> > > If I write back 4TB then ceph collapse it is nearfull and I have to go buy
> > > another 30TB integrate it to my cluster to hold the problem. But still
> > > soon I
> > > have in my ceph more useless replicas of "delete" datas than usefull data
> > > with
> > > they replicas.
> > >
> > > Usually when I talk to dev team about this problem they tell me that the
> > > real problem is the lack of trim in XFS, but my own analysis shows that
> > > the real problem is ceph internal way to handle data. It is ceph that
> > > never discard any replicas and never "clean" itself to only keep records
> > > of the data in use.
>
> >
> > You are correct that if XFS (or whatever FS you are using) does not issue
> > discard/trim, then deleting data inside the fs on top of RBD won't free
> > any space. Note that you usually have to explicitly enable this via a
> > mount option; most (all?) kernels still leave this off by default.
> >
> > Are you taking RBD snapshots? If not, then there will never be more than
> > the rbd image size * num_replicas space used (ignoring the few % of file
> > system overhead for the moment).
> >
> > If you are taking snapshots, then yes.. you will see more space used until
> > the snapshot is deleted because we will keep old copies of objects around.
>
> I am not using snapshot. I dont have enought space to write to the disk after
> some round of write / delete /write / delete so I can t affort to use fancy
> features like snapshots. I use regular image rbd type 1 not even able to be
> snapshoot.
>
> I tryed to activate XFS trim system but that shown no change at all. (discard
> mount option just have no real effect try in ubuntu 14.04)
I believe you have to have mounted with -o discard at the time the data is
deleted; simply enabling the option later won't help. This is what
the fstrim utility is for; see
http://man7.org/linux/man-pages/man8/fstrim.8.html
> Like I said what seems to grow in fact are the replica side of the data.
> There is no overwriting of the replicas when real data are overwriten so
> slowly I see the real disk weight of my datas in the ceph cluster grow, grow,
> grow and never come to a stable size.
This is simply not true. RADOS object are overwritten in place. If you
create a 10 TB image and write it 100x with dd, you will still only
consume 10 TB * num_replicas. If you are seeing something other
than this, ignore everything else in this email and go figure out what
else is writing files to the underlying volumes.
> There is the trick which layer of XFS are we talking about the layer inside
> the rbd image ? or the one below the RBD image ?
>
> I already see a bug ticket from 2009 in ceph bug track that state that
> XFS trim is not taken in consideration by ceph. That ticket doesn t seem to
> have got a solution.
>
> and if I have XFS as format on the low end Ceph cluster and ext4 in the rbd
> image how will trim works?
I assume you are using kvm/qemu? It may be that older versions aren't
passing through trims; Josh would know more. Or maybe the trim sizes are
too small to let rados effectively deallocate entire objects. Logs might
help there.
But, as I said, if you see more data written than the size of your image
then stop worrying about trim and sort that out first...
> Low level XFS (of the osd disks ) have mount options that are not managed by
> the user it is auto process of mount when the osd is activated in that
> consideration how do I activate the trim ? Do I have to put the hands on udev
> level scripts ?
Trim on the underlying XFS volumes isn't necessary or important. When RBD
gets a discard, it will either delete, truncate, or punch holes in the
underlying XFS object files the image maps too.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html