Re: [ceph-users] RBD image "lightweight snapshots"

2018-09-05 Thread Alex Elder
On 08/09/2018 08:15 AM, Sage Weil wrote:
> On Thu, 9 Aug 2018, Piotr Dałek wrote:
>> Hello,
>>
>> At OVH we're heavily utilizing snapshots for our backup system. We think
>> there's an interesting optimization opportunity regarding snapshots I'd like
>> to discuss here.
>>
>> The idea is to introduce a concept of a "lightweight" snapshots - such
>> snapshot would not contain data but only the information about what has
>> changed on the image since it was created (so basically only the object map
>> part of snapshots).
>>
>> Our backup solution (which seems to be a pretty common practice) is as
>> follows:
>>
>> 1. Create snapshot of the image we want to backup
>> 2. If there's a previous backup snapshot, export diff and apply it on the
>> backup image
>> 3. If there's no older snapshot, just do a full backup of image
>>
>> This introduces one big issue: it enforces COW snapshot on image, meaning 
>> that
>> original image access latencies and consumed space increases. "Lightweight"
>> snapshots would remove these inefficiencies - no COW performance and storage
>> overhead.
> 
> The snapshot in 1 would be lightweight you mean?  And you'd do the backup 
> some (short) time later based on a diff with changed extents?
> 
> I'm pretty sure this will export a garbage image.  I mean, it will usually 
> be non-garbage, but the result won't be crash consistent, and in some 
> (many?) cases won't be usable.
> 
> Consider:
> 
> - take reference snapshot
> - back up this image (assume for now it is perfect)
> - write A to location 1
> - take lightweight snapshot
> - write B to location 1
> - backup process copie location 1 (B) to target
> 
> That's the wrong data.  Maybe that change is harmless, but maybe location 
> 1 belongs to the filesystem journal, and you have some records that now 
> reference location 10 that as an A-era value, or haven't been written at 
> all yet, and now your file system journal won't replay and you can't 
> mount...

Forgive me if I'm misunderstanding; this just caught my attention.

The goal here seems to be to reduce the storage needed to do backups of an
RBD image, and I think there's something to that.

This seems to be no different from any other incremental backup scheme.  It's
layered, and it's ultimately based on an "epoch" complete backup image (what
you call the reference snapshot).

If you're using that model, it would be useful to be able to back up only
the data present in a second snapshot that's the child of the reference
snapshot.  (And so on, with snapshot 2 building on snapshot 1, etc.)
RBD internally *knows* this information, but I'm not sure how (or whether)
it's formally exposed.

Restoring an image in this scheme requires restoring the epoch, then the
incrementals, in order.  The cost to restore is higher, but the cost
of incremental backups is significantly smaller than doing full ones.

I'm not sure how the "lightweight" snapshot would work though.  Without
references to objects there's no guarantee the data taken at the time of
the snapshot still exists when you want to back it up.

-Alex

> 
> sage
>  
>> At first glance, it seems like it could be implemented as extension to 
>> current
>> RBD snapshot system, leaving out the machinery required for copy-on-write. In
>> theory it could even co-exist with regular snapshots. Removal of these
>> "lightweight" snapshots would be instant (or near instant).
>>
>> So what do others think about this?
>>
>> -- 
>> Piotr Dałek
>> piotr.da...@corp.ovh.com
>> https://www.ovhcloud.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel BUG at net/ceph/osd_client.c:2103

2013-08-05 Thread Alex Elder
On 08/04/2013 08:07 PM, Olivier Bonvalet wrote:
> 
> Hi,
> 
> I've just upgraded a Xen Dom0 (Debian Wheezy with Xen 4.2.2) from Linux
> 3.9.11 to Linux 3.10.5, and now I have kernel panic after launching some
> VM which use RBD kernel client. 

A crash like this was reported last week.  I started looking at it
but I don't believe I ever sent out my findings.

The problem is that while formatting the write request it's
exhausting the space available in the front buffer for the
request message.  The size of that buffer is established at
request creation time, when rbd_osd_req_create() gets called
inside rbd_img_request_fill().

I think this is another unfortunate result of not setting the
image request pointer early enough.  Sort of related to this:

commit d2d1f17a0dad823a4cb71583433d26cd7f734e08
Author: Josh Durgin 
Date:   Wed Jun 26 12:56:17 2013 -0700

rbd: send snapshot context with writes

That is, when the osd request gets created, the object request
has not been associated with the image request yet.  And as a
result, the size set aside for the front of the osd write request
message does not take into account the bytes required to hold the
snapshot context.

It's possible a simple fix will be to move the call to
rbd_img_obj_request_add() in rbd_img_request_fill() even
further up, just after verifying the obj_request allocated
via rbd_obj_request_create() is non-null.

I haven't really verified this will work though, but it's a
hint at what might work.

-Alex


> 
> 
> In kernel logs, I have :
> 
> Aug  5 02:51:22 murmillia kernel: [  289.205652] kernel BUG at 
> net/ceph/osd_client.c:2103!
> Aug  5 02:51:22 murmillia kernel: [  289.205725] invalid opcode:  [#1] 
> SMP 
> Aug  5 02:51:22 murmillia kernel: [  289.205908] Modules linked in: cbc rbd 
> libceph libcrc32c xen_gntdev ip6table_mangle ip6t_REJECT ip6table_filter 
> ip6_tables xt_DSCP iptable_mangle xt_LOG xt_physdev ipt_REJECT xt_tcpudp 
> iptable_filter ip_tables x_tables bridge loop coretemp ghash_clmulni_intel 
> aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt 
> iTCO_vendor_support gpio_ich microcode serio_raw sb_edac edac_core evdev 
> lpc_ich i2c_i801 mfd_core wmi ac ioatdma shpchp button dm_mod hid_generic 
> usbhid hid sg sd_mod crc_t10dif crc32c_intel isci megaraid_sas libsas ahci 
> libahci ehci_pci ehci_hcd libata scsi_transport_sas igb scsi_mod i2c_algo_bit 
> ixgbe usbcore i2c_core dca usb_common ptp pps_core mdio
> Aug  5 02:51:22 murmillia kernel: [  289.210499] CPU: 2 PID: 5326 Comm: 
> blkback.3.xvda Not tainted 3.10-dae-dom0 #1
> Aug  5 02:51:22 murmillia kernel: [  289.210617] Hardware name: Supermicro 
> X9DRW-7TPF+/X9DRW-7TPF+, BIOS 2.0a 03/11/2013
> Aug  5 02:51:22 murmillia kernel: [  289.210738] task: 880037d01040 ti: 
> 88003803a000 task.ti: 88003803a000
> Aug  5 02:51:22 murmillia kernel: [  289.210858] RIP: 
> e030:[]  [] 
> ceph_osdc_build_request+0x2bb/0x3c6 [libceph]
> Aug  5 02:51:22 murmillia kernel: [  289.211062] RSP: e02b:88003803b9f8  
> EFLAGS: 00010212
> Aug  5 02:51:22 murmillia kernel: [  289.211154] RAX: 880033a181c0 RBX: 
> 880033a182ec RCX: 
> Aug  5 02:51:22 murmillia kernel: [  289.211251] RDX: 880033a182af RSI: 
> 8050 RDI: 880030d34888
> Aug  5 02:51:22 murmillia kernel: [  289.211347] RBP: 2000 R08: 
> 88003803ba58 R09: 
> Aug  5 02:51:22 murmillia kernel: [  289.211444] R10:  R11: 
>  R12: 880033ba3500
> Aug  5 02:51:22 murmillia kernel: [  289.211541] R13: 0001 R14: 
> 88003847aa78 R15: 88003847ab58
> Aug  5 02:51:22 murmillia kernel: [  289.211644] FS:  7f775da8c700() 
> GS:88003f84() knlGS:
> Aug  5 02:51:22 murmillia kernel: [  289.211765] CS:  e033 DS:  ES:  
> CR0: 80050033
> Aug  5 02:51:22 murmillia kernel: [  289.211858] CR2: 7fa21ee2c000 CR3: 
> 2be14000 CR4: 00042660
> Aug  5 02:51:22 murmillia kernel: [  289.211956] DR0:  DR1: 
>  DR2: 
> Aug  5 02:51:22 murmillia kernel: [  289.212052] DR3:  DR6: 
> 0ff0 DR7: 0400
> Aug  5 02:51:22 murmillia kernel: [  289.212148] Stack:
> Aug  5 02:51:22 murmillia kernel: [  289.212232]  2000 
> 00243847aa78  880039949b40
> Aug  5 02:51:22 murmillia kernel: [  289.212577]  2201 
> 880033811d98 88003803ba80 88003847aa78
> Aug  5 02:51:22 murmillia kernel: [  289.212921]  880030f24380 
> 880002a38400 2000 a029584c
> Aug  5 02:51:22 murmillia kernel: [  289.213264] Call Trace:
> Aug  5 02:51:22 murmillia kernel: [  289.213358]  [] ? 
> rbd_osd_req_format_write+0x71/0x7c [rbd]
> Aug  5 02:51:22 murmillia kernel: [  289.213459]  [] ? 
> rbd_img_request_fill+0x695/0x736 [rbd]
>

Re: [ceph-users] Kernel panic on rbd map when cluster is out of monitor quorum

2013-05-17 Thread Alex Elder
On 05/17/2013 03:49 PM, Joe Ryner wrote:
> Hi All,
> 
> I have had an issue recently while working on my ceph clusters.  The
> following issue seems to be true on bobtail and cuttlefish.  I have
> two production clusters in two different data centers and a test
> cluster.  We are using ceph to run virtual machines.  I use rbd as
> block devices for sanlock.

Also, do you have any of the information that the kernel
might have dumped when it panicked?

That might be helpful identifying the problem.

-Alex

> I am running Fedora 18.
> 
> I have been moving monitors around and in the process I got the
> cluster out of quorum, so ceph stopped responding.  During this time
> I decided to reboot a ceph node that performs an rbd map during
> startup.  The system boots ok but the service script that is
> performing the rbd map doesn't finish and eventually the system will
> OOPS and then finally panic.  I was able to disable the rbd map
> during boot and finally got the cluster back in quorum and everything
> settled down nicely.
> 
> Question, has anyone seen this behavior of crashing/panic?  I have
> seen this happen on both of my production clusters. Secondly, the
> ceph command hangs when the cluster is out of quorum, is there a
> timeout available?
> 
> Thanks Joe ___ ceph-users
> mailing list ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backporting the kernel client

2013-04-29 Thread Alex Elder
On 04/29/2013 08:16 AM, Juha Aatrokoski wrote:
> I'm probably not the only one who would like to run a
> distribution-provided kernel (which for Debian Wheezy/Ubuntu Precise is
> 3.2) and still have a recent-enough Ceph kernel client. So I'm wondering
> whether it's feasible to backport the kernel client to an earlier
> kernel. The plan is as follows:
> 
> 1) Grab the Ceph files from https://github.com/ceph/ceph-client (and put
> them over the older kernel sources). If I got it right the files are:
> include/keys/ceph-type.h include/linux/ceph/* fs/ceph/* net/ceph/*
> drivers/block/rbd.c drivers/block/rbd_types.h

That is the correct and complete list of source files.

(I didn't know about include/keys/ceph-type.h.)

> 2) Make (trivial) adjustments to the source code to account for changed
> kernel interfaces.
> 
> 3) Compile as modules and install the new Ceph modules under /lib/modules.

> 4) Reboot to a standard distribution kernel with up-to-date Ceph client.

That's roughly correct.  There may be some little details
(like running "depmod") but you've got it.

> Now the main questions are:
> 
> Q1: Is the Ceph client contained in the files mentioned in 1), or does
> it include changes elsewhere in the kernel that cannot be built as modules?

They are as you described above.

> Q2: Regarding 2), are there any nontrivial interface changes i.e. Ceph
> actually using newly introduced features instead of just adapting to a
> changed syntax?

I'm not aware of any, but I haven't tried a particular port.

> Any other thoughts or comments on this?

Let me know if you try it and run into trouble, I may be
able to help.

-Alex

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel BUG when mapping unexisting rbd device

2013-03-27 Thread Alex Elder
On 03/25/2013 06:03 AM, Dan van der Ster wrote:
> Hi,
> Apologies if this is already a known bug (though I didn't find it).
> 
> If we try to map a device that doesn't exist, we get an immediate and
> reproduceable kernel BUG (see the P.S.). We hit this by accident
> because we forgot to add the --pool .
> 
> This works:
> 
> [root@afs245 /]# rbd map afs254-vicepa --pool afs --id afs --keyring
> /etc/ceph/ceph.client.afs.keyring
> [root@afs245 /]# rbd showmapped
> id pool image snap device
> 1  afs  afs254-vicepa -/dev/rbd1
> 
> But this BUGS:
> 
> [root@afs245 /]# rbd map afs254-vicepa
> BUG...
> 
> Any clue?

Yes.  You've found a bug in the kernel rbd client, and I've
posted a fix for review.

The reason you're hitting the bug is because your "map" command
is not supplying the pool name for the image you want to map.
You have no image named "afs254-vicepa" in the default pool (named
"rbd").  To identify what you want to map you need to indicate
which pool it's defined in, using "--pool=afs" (or "--pool afs").

The kernel bug is occurring because it wasn't properly handling
the case where the image was not found.

Thanks for calling attention to this, and I'm sorry you hit it.

-Alex


> Cheers,
> Dan, CERN IT

. . .


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel BUG when mapping unexisting rbd device

2013-03-26 Thread Alex Elder
On 03/25/2013 06:03 AM, Dan van der Ster wrote:
> Hi,
> Apologies if this is already a known bug (though I didn't find it).
> 
> If we try to map a device that doesn't exist, we get an immediate and
> reproduceable kernel BUG (see the P.S.). We hit this by accident
> because we forgot to add the --pool .

I have begun looking at this.  I'd like to reproduce it myself
so I can more easily troubleshoot it.

> This works:
> 
> [root@afs245 /]# rbd map afs254-vicepa --pool afs --id afs --keyring
> /etc/ceph/ceph.client.afs.keyring
> [root@afs245 /]# rbd showmapped
> id pool image snap device
> 1  afs  afs254-vicepa -/dev/rbd1
> 
> But this BUGS:
> 
> [root@afs245 /]# rbd map afs254-vicepa

You are doing this independent of the above command, right?
I.e., are you running the command after mapping it as shown
above, or are you doing it on a fairly pristine system?

Do you know if there is a problem using the default pool (rbd)?

I'll let you know if I am able to reproduce it before I
hear back from you.

Thanks a lot for reporting this.  I've created an issue
to track it.  http://tracker.ceph.com/issues/4559

-Alex

> BUG...
> 
> Any clue?
> 
> Cheers,
> Dan, CERN IT
> 
> 
> Mar 25 11:48:25 afs245 kernel: kernel BUG at mm/slab.c:3130!
> Mar 25 11:48:25 afs245 kernel: invalid opcode:  [#1] SMP
> Mar 25 11:48:25 afs245 kernel: Modules linked in: rbd libceph
> libcrc32c cpufreq_ondemand ipv6 ext2 iTCO_wdt iTCO_vendor_support
> coretemp acpi_cpufreq freq_tabl
> e mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
> pcspkr serio_raw i2c_i801 lpc_ich joydev e1000e ses enclosure sg ixgbe
> hwmon dca ptp pps_core
> mdio ext3 jbd mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd
> lrw aes_x86_64 xts gf128mul ahci libahci 3w_9xxx mpt2sas
> scsi_transport_sas raid_class v
> ideo mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod
> Mar 25 11:48:25 afs245 kernel: CPU 3
> Mar 25 11:48:25 afs245 kernel: Pid: 7444, comm: rbd Not tainted
> 3.8.4-1.el6.elrepo.x86_64 #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM
> Mar 25 11:48:25 afs245 kernel: RIP: 0010:[]
> [] cache_alloc_refill+0x270/0x3c0
> Mar 25 11:48:25 afs245 kernel: RSP: 0018:8808028e5c48  EFLAGS: 00010082
> Mar 25 11:48:25 afs245 kernel: RAX:  RBX:
> 88082f000e00 RCX: 88082f000e00
> Mar 25 11:48:25 afs245 kernel: RDX: 8808055fba80 RSI:
> 88082f0028d0 RDI: 88082f002900
> Mar 25 11:48:25 afs245 kernel: RBP: 8808028e5ca8 R08:
> 88082f0028e0 R09: 8808010068c0
> Mar 25 11:48:25 afs245 kernel: R10: dead00200200 R11:
> 0003 R12: 
> Mar 25 11:48:25 afs245 kernel: R13: 880807a71ec0 R14:
> 88082f0028c0 R15: 0004
> Mar 25 11:48:25 afs245 kernel: FS:  7ff85056e760()
> GS:88082fd8() knlGS:
> Mar 25 11:48:25 afs245 kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Mar 25 11:48:25 afs245 kernel: CR2: 00428220 CR3:
> 0007eee7e000 CR4: 001407e0
> Mar 25 11:48:25 afs245 kernel: DR0:  DR1:
>  DR2: 
> Mar 25 11:48:25 afs245 kernel: DR3:  DR6:
> 0ff0 DR7: 0400
> Mar 25 11:48:25 afs245 kernel: Process rbd (pid: 7444, threadinfo
> 8808028e4000, task 8807ef6fb520)
> Mar 25 11:48:25 afs245 kernel: Stack:
> Mar 25 11:48:25 afs245 kernel: 8808028e5d68 8112fd5d
> 8808028e5de8 880800ac7000
> Mar 25 11:48:25 afs245 kernel: 028e5c78 80d0
> 8808028e5fd8 88082f000e00
> Mar 25 11:48:25 afs245 kernel: 1078 0010
> 80d0 80d0
> Mar 25 11:48:25 afs245 kernel: Call Trace:
> Mar 25 11:48:25 afs245 kernel: [] ?
> get_page_from_freelist+0x22d/0x710
> Mar 25 11:48:25 afs245 kernel: [] __kmalloc+0x168/0x340
> Mar 25 11:48:25 afs245 kernel: [] ?
> ceph_parse_options+0x65/0x410 [libceph]
> Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0x20/0x20 [rbd]
> Mar 25 11:48:25 afs245 kernel: []
> ceph_parse_options+0x65/0x410 [libceph]
> Mar 25 11:48:25 afs245 kernel: [] ?
> kmem_cache_alloc_trace+0x214/0x2e0
> Mar 25 11:48:25 afs245 kernel: [] ? __kmalloc+0x277/0x340
> Mar 25 11:48:25 afs245 kernel: [] ? kzalloc+0xf/0x20 [rbd]
> Mar 25 11:48:25 afs245 kernel: []
> rbd_add_parse_args+0x1fa/0x250 [rbd]
> Mar 25 11:48:25 afs245 kernel: [] rbd_add+0x84/0x2b4 [rbd]
> Mar 25 11:48:25 afs245 kernel: [] bus_attr_store+0x27/0x30
> Mar 25 11:48:25 afs245 kernel: [] 
> sysfs_write_file+0xef/0x170
> Mar 25 11:48:25 afs245 kernel: [] vfs_write+0xb4/0x130
> Mar 25 11:48:25 afs245 kernel: [] sys_write+0x5f/0xa0
> Mar 25 11:48:25 afs245 kernel: [] ?
> __audit_syscall_exit+0x246/0x2f0
> Mar 25 11:48:25 afs245 kernel: []
> system_call_fastpath+0x16/0x1b
> Mar 25 11:48:25 afs245 kernel: Code: 48 8b 00 48 8b 55 b0 8b 4d b8 48
> 8b 75 a8 4c 8b 45 a0 4c 8b 4d c0 a8 40 0f 84 b8 fe ff ff 49 83 cf 01
> e9 af fe f