Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is modified
a message could be put on the bus to update the others. That opens up a
whole other can of worms though.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Jan 6, 2015 5:35 PM, "Josh Durgin" <josh.dur...@inktank.com> wrote:

> On 01/06/2015 04:19 PM, Robert LeBlanc wrote:
>
>> The bitmap certainly sounds like it would help shortcut a lot of code
>> that Xiaoxi mentions. Is the idea that the client caches the bitmap
>> for the RBD so it know which OSDs to contact (thus saving a round trip
>> to the OSD), or only for the OSD to know which objects exist on it's
>> disk?
>>
>
> It's purely at the rbd level, so librbd caches it and maintains its
> consistency. The idea is that since it's kept consistent, librbd can do
> things like delete exactly the objects that exist without any
> extra communication with the osds. Many things that were
> O(size of image) become O(written objects in image).
>
> The only restriction is that keeping the object map consistent requires
> a single writer, so this does not work for the rare case of e.g. ocfs2
> on top of rbd, where there are multiple clients writing to the same
> rbd image at once.
>
> Josh
>
>  On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin <josh.dur...@inktank.com>
>> wrote:
>>
>>> On 01/06/2015 10:24 AM, Robert LeBlanc wrote:
>>>
>>>>
>>>> Can't this be done in parallel? If the OSD doesn't have an object then
>>>> it is a noop and should be pretty quick. The number of outstanding
>>>> operations can be limited to 100 or a 1000 which would provide a
>>>> balance between speed and performance impact if there is data to be
>>>> trimmed. I'm not a big fan of a "--skip-trimming" option as there is
>>>> the potential to leave some orphan objects that may not be cleaned up
>>>> correctly.
>>>>
>>>
>>>
>>> Yeah, a --skip-trimming option seems a bit dangerous. This trimming
>>> actually is parallelized (10 ops at once by default, changeable via
>>> --rbd-concurrent-management-ops) since dumpling.
>>>
>>> What will really help without being dangerous is keeping a map of
>>> object existence [1]. This will avoid any unnecessary trimming
>>> automatically, and it should be possible to add to existing images.
>>> It should be in hammer.
>>>
>>> Josh
>>>
>>> [1] https://github.com/ceph/ceph/pull/2700
>>>
>>>
>>>  On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.c...@intel.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> When you shrinking the RBD, most of the time was spent on
>>>>>> librbd/internal.cc::trim_image(), in this function, client will
>>>>>> iterator
>>>>>> all
>>>>>> unnecessary objects(no matter whether it exists) and delete them.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
>>>>>> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
>>>>>> 170,227,200 Objects need to be deleted.That will definitely take a
>>>>>> long
>>>>>> time
>>>>>> since rbd client need to send a delete request to OSD, OSD need to
>>>>>> find
>>>>>> out
>>>>>> the object context and delete(or doesn’t exist at all). The time
>>>>>> needed
>>>>>> to
>>>>>> trim an image is ratio to the size needed to trim.
>>>>>>
>>>>>>
>>>>>>
>>>>>> make another image of the correct size and copy your VM's file system
>>>>>> to
>>>>>> the new image, then delete the old one will  NOT help in general, just
>>>>>> because delete the old volume will take exactly the same time as
>>>>>> shrinking ,
>>>>>> they both need to call trim_image().
>>>>>>
>>>>>>
>>>>>>
>>>>>> The solution in my mind may be we can provide a “—skip-triming” flag
>>>>>> to
>>>>>> skip the trimming. When the administrator absolutely sure there is no
>>>>>> written have taken place in the shrinking area(that means there is no
>>>>>> object
>>>>>> created in these area), they can use this flag to skip the time
>>>>>> consuming
>>>>>> trimming.
>>>>>>
>>>>>>
>>>>>>
>>>>>> How do you think?
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> That sounds like a good solution. Like doing "undo grow image"
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> From: Jake Young [mailto:jak3...@gmail.com]
>>>>>> Sent: Monday, January 5, 2015 9:45 PM
>>>>>> To: Chen, Xiaoxi
>>>>>> Cc: Edwin Peer; ceph-users@lists.ceph.com
>>>>>> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.c...@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> You could use rbd info <volume_name>  to see the block_name_prefix,
>>>>>> the
>>>>>> object name consist like <block_name_prefix>.<sequence_number>,  so
>>>>>> for
>>>>>> example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th
>>>>>> object
>>>>>> of
>>>>>> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
>>>>>>
>>>>>>        $ rbd info huge
>>>>>>           rbd image 'huge':
>>>>>>            size 1024 TB in 268435456 objects
>>>>>>            order 22 (4096 kB objects)
>>>>>>            block_name_prefix: rb.0.8a14.2ae8944a
>>>>>>            format: 1
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>> Behalf Of
>>>>>> Edwin Peer
>>>>>> Sent: Monday, January 5, 2015 3:55 AM
>>>>>> To: ceph-users@lists.ceph.com
>>>>>> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
>>>>>>
>>>>>> Also, which rbd objects are of interest?
>>>>>>
>>>>>> <snip>
>>>>>> ganymede ~ # rados -p client-disk-img0 ls | wc -l
>>>>>> 1672636
>>>>>> </snip>
>>>>>>
>>>>>> And, all of them have cryptic names like:
>>>>>>
>>>>>> rb.0.ff53.3d1b58ba.00000000e6ad
>>>>>> rb.0.6d386.1d545c4d.000000011461
>>>>>> rb.0.50703.3804823e.000000001c28
>>>>>> rb.0.1073e.3d1b58ba.00000000b715
>>>>>> rb.0.1d76.2ae8944a.00000000022d
>>>>>>
>>>>>> which seem to bear no resemblance to the actual image names that the
>>>>>> rbd
>>>>>> command line tools understands?
>>>>>>
>>>>>> Regards,
>>>>>> Edwin Peer
>>>>>>
>>>>>> On 01/04/2015 08:48 PM, Jake Young wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, January 4, 2015, Dyweni - Ceph-Users
>>>>>>> <6exbab4fy...@dyweni.com <mailto:6exbab4fy...@dyweni.com>> wrote:
>>>>>>>
>>>>>>>       Hi,
>>>>>>>
>>>>>>>       If its the only think in your pool, you could try deleting the
>>>>>>>       pool instead.
>>>>>>>
>>>>>>>       I found that to be faster in my testing; I had created 500TB
>>>>>>> when
>>>>>>>       I meant to create 500GB.
>>>>>>>
>>>>>>>       Note for the Devs: I would be nice if rbd create/resize would
>>>>>>>       accept sizes with units (i.e. MB GB TB PB, etc).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       On 2015-01-04 08:45, Edwin Peer wrote:
>>>>>>>
>>>>>>>           Hi there,
>>>>>>>
>>>>>>>           I did something stupid while growing an rbd image. I
>>>>>>> accidentally
>>>>>>>           mistook the units of the resize command for bytes instead
>>>>>>> of
>>>>>>>           megabytes
>>>>>>>           and grew an rbd image to 650PB instead of 650GB. This all
>>>>>>> happened
>>>>>>>           instantaneously enough, but trying to rectify the mistake
>>>>>>> is
>>>>>>>           not going
>>>>>>>           nearly as well.
>>>>>>>
>>>>>>>           <snip>
>>>>>>>           ganymede ~ # rbd resize --size 665600 --allow-shrink
>>>>>>>           client-disk-img0/vol-x318644f-0
>>>>>>>           Resizing image: 1% complete...
>>>>>>>           </snip>
>>>>>>>
>>>>>>>           It took a couple days before it started showing 1% complete
>>>>>>>           and has
>>>>>>>           been stuck on 1% for a couple more. At this rate, I should
>>>>>>> be
>>>>>>>           able to
>>>>>>>           shrink the image back to the intended size in about 2016.
>>>>>>>
>>>>>>>           Any ideas?
>>>>>>>
>>>>>>>           Regards,
>>>>>>>           Edwin Peer
>>>>>>>           _______________________________________________
>>>>>>>           ceph-users mailing list
>>>>>>>           ceph-users@lists.ceph.com
>>>>>>>           http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>       _______________________________________________
>>>>>>>       ceph-users mailing list
>>>>>>>       ceph-users@lists.ceph.com
>>>>>>>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> You can just delete the rbd header. See Sebastien's excellent blog:
>>>>>>>
>>>>>>> http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-
>>>>>>> bigger-than-your
>>>>>>> -ceph-cluster/
>>>>>>>
>>>>>>> Jake
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry, I misunderstood.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The simplest approach to me is to make another image of the correct
>>>>>> size
>>>>>> and copy your VM's file system to the new image, then delete the old
>>>>>> one.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The safest thing to do would be to mount the new file system from the
>>>>>> VM
>>>>>> and do all the formatting / copying from there (the same way you'd
>>>>>> move
>>>>>> a
>>>>>> physical server's root disk to a new physical disk)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I would not attempt to hack the rbd header. You open yourself up to
>>>>>> some
>>>>>> unforeseen problems.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Unless one of the ceph developers can comment there is a safe way to
>>>>>> shrink an image, assuming we know that the file system has not grown
>>>>>> since
>>>>>> growing the disk.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jake
>>>>>>
>>>>>
>>>
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to