Re: [ceph-users] Data still in OSD directories after removing
On Thu, May 22, 2014 at 12:56 PM, Olivier Bonvaletwrote: > > Le mercredi 21 mai 2014 à 18:20 -0700, Josh Durgin a écrit : >> On 05/21/2014 03:03 PM, Olivier Bonvalet wrote: >> > Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit : >> >> You're certain that that is the correct prefix for the rbd image you >> >> removed? Do you see the objects lists when you do 'rados -p rbd ls - | >> >> grep '? >> > >> > I'm pretty sure yes : since I didn't see a lot of space freed by the >> > "rbd snap purge" command, I looked at the RBD prefix before to do the >> > "rbd rm" (it's not the first time I see that problem, but previous time >> > without the RBD prefix I was not able to check). >> > >> > So : >> > - "rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29" return nothing >> > at all >> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026 >> > error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such >> > file or directory >> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29. >> > error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such >> > file or directory >> > - # ls -al >> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9 >> > -rw-r--r-- 1 root root 4194304 oct. 8 2013 >> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9 >> > >> > >> >> If the objects really are orphaned, teh way to clean them up is via 'rados >> >> -p rbd rm '. I'd like to get to the bottom of how they ended >> >> up that way first, though! >> > >> > I suppose the problem came from me, by doing CTRL+C while "rbd snap >> > purge $IMG". >> > "rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026" don't remove >> > thoses files, and just answer with a "No such file or directory". >> >> Those files are all for snapshots, which are removed by the osds >> asynchronously in a process called 'snap trimming'. There's no >> way to directly remove them via rados. >> >> Since you stopped 'rbd snap purge' partway through, it may >> have removed the reference to the snapshot before removing >> the snapshot itself. >> >> You can get a list of snapshot ids for the remaining objects >> via the 'rados listsnaps' command, and use >> rados_ioctx_selfmanaged_snap_remove() (no convenient wrapper >> unfortunately) on each of those snapshot ids to be sure they are all >> scheduled for asynchronous deletion. >> >> Josh >> > > Great : "rados listsnaps" see it : > # rados listsnaps -p sas3copies rb.0.14bfb5a.238e1f29.0002f026 > rb.0.14bfb5a.238e1f29.0002f026: > cloneid snaps sizeoverlap > 41554 35746 4194304 [] > > So, I have to write a wrapper to > rados_ioctx_selfmanaged_snap_remove(), and find a way to obtain a list > of all "orphan" objects ? > > I also try to recreate the object (rados put) then remove it (rados rm), > but snapshots still here. > > Olivier Hi, there is a certainly an issue with (at least) old FileStore and snapshot chunks as they are getting completely unreferenced even for listsnaps example from above but are presented in omap and on filesystem after complete image and snapshot removal. Given the fact that the control flow has not been interrupted ever, e.g. snap deletion command was always been successful on exit either was image removal itself, what could be done for those poor data chunks? In fact this leakage on a long-scale runs like eight months in a given case could be quite problematic to handle, as orphans do consume almost as much data as the 'active' rest of the storage on selected OSDs. Since the chunks are still referenced in omap by some reason, they must not be deleted directly, so my question could be narrowed down to a possible existing workaround for this. Thanks! 3.1b0_head$ find . -type f -name '*64ba14d3dd381*' -mtime +90 ./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__23116_25FB11B0__3 ./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__241e9_25FB11B0__3 ./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__2507f_25FB11B0__3 ./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__25dfd_25FB11B0__3 find . -type f -name '*64ba14d3dd381*snap*' ./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__snapdir_25FB11B0__3 ./DIR_0/DIR_B/DIR_1/DIR_2/DIR_3/rbd\udata.64ba14d3dd381.00010eb3__snapdir_2B8321B0__3 ./DIR_0/DIR_B/DIR_1/DIR_4/DIR_6/rbd\udata.64ba14d3dd381.0001c715__snapdir_F5D641B0__3 ./DIR_0/DIR_B/DIR_1/DIR_4/DIR_9/rbd\udata.64ba14d3dd381.0002b694__snapdir_CC4941B0__3 ./DIR_0/DIR_B/DIR_1/DIR_5/DIR_9/rbd\udata.64ba14d3dd381.0001b6f7__snapdir_08B951B0__3 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD on XFS ENOSPC at 84% data / 5% inode and inode64?
On Thu, Nov 26, 2015 at 1:29 AM, Laurent GUERBYwrote: > Hi, > > After our trouble with ext4/xattr soft lockup kernel bug we started > moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel > and ceph 0.94.5. > > We have two out of 28 rotational OSD running XFS and > they both get restarted regularly because they're terminating with > "ENOSPC": > > 2015-11-25 16:51:08.015820 7f6135153700 0 > filestore(/var/lib/ceph/osd/ceph-11) error (28) No space left on device not > handled on operation 0xa0f4d520 (12849173.0.4, or op 4, counting from 0) > 2015-11-25 16:51:08.015837 7f6135153700 0 > filestore(/var/lib/ceph/osd/ceph-11) ENOSPC handling not implemented > 2015-11-25 16:51:08.015838 7f6135153700 0 > filestore(/var/lib/ceph/osd/ceph-11) transaction dump: > ... > { > "op_num": 4, > "op_name": "write", > "collection": "58.2d5_head", > "oid": > "53e4fed5\/rbd_data.11f20f75aac8266.000a79eb\/head\/\/58", > "length": 73728, > "offset": 4120576, > "bufferlist length": 73728 > }, > > (Writing the last 73728 bytes = 72 kbytes of 4 Mbytes if I'm reading > this correctly) > > Mount options: > > /dev/sdb1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime,attr2,inode64,noquota > > Space and Inodes: > > Filesystem Type 1K-blocks Used Available Use% Mounted on > /dev/sdb1 xfs 1947319356 1624460408 322858948 84% > /var/lib/ceph/osd/ceph-11 > > Filesystem TypeInodes IUsed IFree IUse% Mounted on > /dev/sdb1 xfs 48706752 1985587 467211655% > /var/lib/ceph/osd/ceph-11 > > We're only using rbd devices, so max 4 MB/object write, how > can we get ENOSPC for a 4MB operation with 322 GB free space? > > The most surprising thing is that after the automatic restart > disk usage keep increasing and we no longer get ENOSPC for a while. > > Did we miss a needed XFS mount option? Did other ceph users > encounter this issue with XFS? > > We have no such issue with ~96% full ext4 OSD (after setting the right > value for the various ceph "fill" options). > > Thanks in advance, > > Laurent > Hi, from given numbers one can conclude that you are facing some kind of XFS preallocation bug, because ((raw space divided by number of files)) is four times lower than the ((raw space divided by 4MB blocks)). At a glance it could be avoided by specifying relatively small allocsize= mount option, of course by impacting overall performance, appropriate benchmarks could be found through ceph-users/ceph-devel. Also do you plan to preserve overcommit ratio to be that high forever? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
On Mon, Sep 7, 2015 at 12:54 PM, Paul Mansfieldwrote: > > > On 04/09/15 20:55, Richard Bade wrote: >> We have a Ceph pool that is entirely made up of Intel S3700/S3710 >> enterprise SSD's. >> >> We are seeing some significant I/O delays on the disks causing a “SCSI >> Task Abort” from the OS. This seems to be triggered by the drive >> receiving a “Synchronize cache command”. > > I've heard from other sources that the new Intel 3610 and 3710 have been > afflicted by a bug, possibly now fixed with new firmware, that might be > the cause of your problem. > The person who first reported it said that they upgraded from 3600 units > and never had a problem but started seeing issues with 3610 model. > > When they look at their log they see this > > Aug 9 21:50:39 cetacea kernel: [177609.957939] ata2.00: exception Emask > 0x0 SAct 0x6000 SErr 0x0 action 0x6 frozen > Aug 9 21:50:39 cetacea kernel: [177609.958480] ata2.00: failed command: > READ FPDMA QUEUED > Aug 9 21:50:39 cetacea kernel: [177609.958995] ata2.00: cmd > 60/00:68:00:08:db/04:00:0a:00:00/40 tag 13 ncq 524288 in > Aug 9 21:50:39 cetacea kernel: [177609.958995] res > 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) > Aug 9 21:50:39 cetacea kernel: [177609.960074] ata2.00: status: { DRDY } > Aug 9 21:50:39 cetacea kernel: [177609.960628] ata2.00: failed command: > READ FPDMA QUEUED > Aug 9 21:50:39 cetacea kernel: [177609.961198] ata2.00: cmd > 60/f0:70:00:0c:db/00:00:0a:00:00/40 tag 14 ncq 122880 in > Aug 9 21:50:39 cetacea kernel: [177609.961198] res > 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > Aug 9 21:50:39 cetacea kernel: [177609.962405] ata2.00: status: { DRDY } > Aug 9 21:50:39 cetacea kernel: [177609.963001] ata2: hard resetting link > Aug 9 21:50:40 cetacea kernel: [177610.281881] ata2: SATA link up 6.0 > Gbps (SStatus 133 SControl 300) > Aug 9 21:50:40 cetacea kernel: [177610.282865] ata2.00: configured for > UDMA/133 > Aug 9 21:50:40 cetacea kernel: [177610.282887] ata2.00: device reported > invalid CHS sector 0 > Aug 9 21:50:40 cetacea kernel: [177610.282890] ata2.00: device reported > invalid CHS sector 0 > Aug 9 21:50:40 cetacea kernel: [177610.282896] ata2: EH complete > > Intel had a mess with consequent NCQ command handling [1] on 3x10 and issued a firmware fix recently [2]. LSI controllers apparently has a different bug as people reporting bus resets which are different from the ones on C602. The firmware release fixed problem for me, same for complete NCQ disablement mentioned in the thread below. 1. https://communities.intel.com/thread/77801 2. https://downloadcenter.intel.com/download/23931/Intel-Solid-State-Drive-Data-Center-Tool ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests during ceph osd boot
On Wed, Jul 15, 2015 at 12:15 PM, Jan Schermer j...@schermer.cz wrote: We have the same problems, we need to start the OSDs slowly. The problem seems to be CPU congestion. A booting OSD will use all available CPU power you give it, and if it doesn’t have enough nasty stuff happens (this might actually be the manifestation of some kind of problem in our setup as well). It doesn’t do that always - I was restarting our hosts this weekend and most of them came up fine with simple “service ceph start”, some just sat there spinning the CPU and not doing any real world (and the cluster was not very happy about that). Jan On 15 Jul 2015, at 10:53, Kostis Fardelas dante1...@gmail.com wrote: Hello, after some trial and error we concluded that if we start the 6 stopped OSD daemons with a delay of 1 minute, we do not experience slow requests (threshold is set on 30 sec), althrough there are some ops that last up to 10s which is already high enough. I assume that if we spread the delay more, the slow requests will vanish. The possibility of not having tuned our setup to the most finest detail is not zeroed out but I wonder if at any way we miss some ceph tuning in terms of ceph configuration. We run firefly latest stable version. Regards, Kostis On 13 July 2015 at 13:28, Kostis Fardelas dante1...@gmail.com wrote: Hello, after rebooting a ceph node and the OSDs starting booting and joining the cluster, we experience slow requests that get resolved immediately after cluster recovers. It is improtant to note that before the node reboot, we set noout flag in order to prevent recovery - so there are only degraded PGs when OSDs shut down- and let the cluster handle the OSDs down/up in the lightest way. Is there any tunable we should consider in order to avoid service degradation for our ceph clients? Regards, Kostis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com As far as I`ve seen this problem, the main issue for regular disk-backed OSDs is an IOPS starvation during some interval after reading maps from filestore and marking itself as 'in' - even if in-memory caches are still hot, I/O will significantly degrade for a short period. The possible workaround for an otherwise healthy cluster and node-wide restart is to set norecover flag, it would greatly reduce a chance of hitting slow operations. Of course it is applicable only to non-empty cluster with tens of percents of an average utilization for rotating media. I pointed this issue a couple of years ago first (it *does* break 30s I/O SLA for returning OSD, but refilling same OSDs from scratch would not violate the same SLA, giving out way bigger completion time for a refill). From UX side, it would be great to introduce some kind of recovery throttler for newly started OSDs, as recovery_ delay_start does not prevent immediate recovery procedures. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected issues with simulated 'rack' outage
The question is: is this behavior indeed expected? The answer can be positive if you are using large number of placement groups, 16k is indeed a large one. The peering may take a long time, blocking I/O requests effectively during this period. Do you have a ceph -w log during this transition to share? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected issues with simulated 'rack' outage
http://pastebin.com/HfUPDTK4 Yes, you are experiencing issues with I/O because of a slow peering. You may put monstores behind a faster storage if they are served from rotating disks right now or greatly decrease number of placement groups. if it is possible - with 100 OSDs I would try something like 4096 and 8192, though it may impact data placement flatness. I`ve seen a couple of off-list reports where slow peering on a large number of placement group caused persistent problems, for example if user added new OSDs in the middle of slow-going peering process, it stood still forever. If none of those suggestions helps, please feel free to report this problem to a bugtracker, possibly it would give a bump bump a very nice blueprint initiative for reducing overall peering time ( https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%3A_Faster_Peering). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew detected
On Wed, Jun 10, 2015 at 4:11 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote: Hi! Immediately after a reboot of mon.3 host its clock was unsynchronized and clock skew detected on mon.3 warning is appeared. But now (more then 1 hour of uptime) the clock is synced, but the warning still showing. Is this ok? Or I have to restart monitor after clock synchronization? Pavel. The quorum should report OK after a five-minute interval but there is a bug which is preventing quorum for doing so at least on oldest supported stable versions of Ceph. I`ve never reported it because of its almost zero importance, but things are what they are - the theoretical behavior should be different and warning should disappear without restart. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Tue, Jun 9, 2015 at 7:59 AM, Alexandre DERUMIER aderum...@odiso.com wrote: host conf : rbd_cache=true : guest cache=none : result : cache (wrong) Thanks Alexandre, so you are confirming that this exact case misbehaves? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Tue, Jun 9, 2015 at 11:51 AM, Alexandre DERUMIER aderum...@odiso.com wrote: Thanks Alexandre, so you are confirming that this exact case misbehaves? The rbd_cache value from ceph.conf always override the cache value from qemu. My personnal opinion is this is wrong. qemu value should overrive the ceph.conf value. I don't known what happen in a live migration for example, if rbd_cache in ceph.conf is different on source and target host ? Yes, you are right. The destination process in a live migration behaves as an independently launched copy, it does not inherit those kind of parameters from source emulator. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 10:43 PM, Josh Durgin jdur...@redhat.com wrote: On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote: Hi, looking at the latest version of QEMU, It's seem that it's was already this behaviour since the add of rbd_cache parsing in rbd.c by josh in 2012 http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85 I'll do tests on my side tomorrow to be sure. It seems like we should switch the order so ceph.conf is overridden by qemu's cache settings. I don't remember a good reason to have it the other way around. Josh Erm, doesn`t this code *already* represent the right priorities? Cache=none setting should set a BDRV_O_NOCACHE which is effectively disabling cache in a mentioned snippet. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 6:50 PM, Jason Dillaman dilla...@redhat.com wrote: Hmm ... looking at the latest version of QEMU, it appears that the RBD cache settings are changed prior to reading the configuration file instead of overriding the value after the configuration file has been read [1]. Try specifying the path to a new configuration file via the conf=/path/to/my/new/ceph.conf QEMU parameter where the RBD cache is explicitly disabled [2]. [1] http://git.qemu.org/?p=qemu.git;a=blob;f=block/rbd.c;h=fbe87e035b12aab2e96093922a83a3545738b68f;hb=HEAD#l478 [2] http://ceph.com/docs/master/rbd/qemu-rbd/#usage Actually the mentioned snippet presumes *expected* behavior with cache=xxx driving the overall cache behavior. Probably the pass itself (from cache=none to proper bitmask values in a backend properties) is broken in some way. CCing qemu-devel for possible bug. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 1:24 PM, Arnaud Virlet avir...@easter-eggs.com wrote: Hi Actually we use libvirt VM with ceph rbd pool for storage. By default we want to have disk cache=writeback for all disks in libvirt. In /etc/ceph/ceph.conf, we have rbd cache = true and for each VMs XML we set cache=writeback for all disks in VMs configuration. We want to use one ocfs2 volume on our rbd pool. For this volume we want set cache=none. When we set cache=none in libvirt template for this hosts, it doesn't work. Can you please describe this more specifically? Does the libvirt produce a launch string with cache=none or you are measuring the cache (mis)presence in some other way? The rbd_cache setting and cache=xxx for qemu should show a conjugate behavior. The only way that's work is when we set rbd cache = false in /etc/ceph/ceph.conf. How can I set cache=none just for one volume specifically without modifing default settings in ceph.conf? Regards, Arnaud ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 2:48 PM, Arnaud Virlet avir...@easter-eggs.com wrote: Thanks for you reply, On 06/08/2015 12:31 PM, Andrey Korolyov wrote: On Mon, Jun 8, 2015 at 1:24 PM, Arnaud Virlet avir...@easter-eggs.com wrote: Hi Actually we use libvirt VM with ceph rbd pool for storage. By default we want to have disk cache=writeback for all disks in libvirt. In /etc/ceph/ceph.conf, we have rbd cache = true and for each VMs XML we set cache=writeback for all disks in VMs configuration. We want to use one ocfs2 volume on our rbd pool. For this volume we want set cache=none. When we set cache=none in libvirt template for this hosts, it doesn't work. Can you please describe this more specifically? Does the libvirt produce a launch string with cache=none or you are measuring the cache (mis)presence in some other way? The rbd_cache setting and cache=xxx for qemu should show a conjugate behavior. With rbd_cache = true in ceph.conf and cache = none in XML, Libvirt produce a launch string with cache=none for the ocfs2 volume. Am I understand you right that you are using certain template engine for both OCFS- and RBD-backed volumes within a single VM` config and it does not allow per-disk cache mode separation in a suggested way? The only way that's work is when we set rbd cache = false in /etc/ceph/ceph.conf. How can I set cache=none just for one volume specifically without modifing default settings in ceph.conf? Regards, Arnaud ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 3:44 PM, Arnaud Virlet avir...@easter-eggs.com wrote: On 06/08/2015 01:59 PM, Andrey Korolyov wrote: Am I understand you right that you are using certain template engine for both OCFS- and RBD-backed volumes within a single VM` config and it does not allow per-disk cache mode separation in a suggested way? My VM has 3 disks on RBD backend. disks 1 and 2 have cache=writeback, disk 3 ( for ocfs2 ) has cache=none in my VM XML file. When I start the VM, libvirt produce a launch string with cache=wtriteback for disk 1/2, and with cache=none for disk 3. Even if cache = none for disk 3, it seems doesn't take effect without set rbd cache = false in ceph.conf. It is very strange and contradictive to what it should be. Could you post a resulting qemu argument string, by a chance? Also please share a method which you are using to determine if disk uses emulator cache or not. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache + libvirt
On Mon, Jun 8, 2015 at 6:36 PM, Arnaud Virlet avir...@easter-eggs.com wrote: On 06/08/2015 03:17 PM, Andrey Korolyov wrote: On Mon, Jun 8, 2015 at 3:44 PM, Arnaud Virlet avir...@easter-eggs.com wrote: On 06/08/2015 01:59 PM, Andrey Korolyov wrote: Am I understand you right that you are using certain template engine for both OCFS- and RBD-backed volumes within a single VM` config and it does not allow per-disk cache mode separation in a suggested way? My VM has 3 disks on RBD backend. disks 1 and 2 have cache=writeback, disk 3 ( for ocfs2 ) has cache=none in my VM XML file. When I start the VM, libvirt produce a launch string with cache=wtriteback for disk 1/2, and with cache=none for disk 3. Even if cache = none for disk 3, it seems doesn't take effect without set rbd cache = false in ceph.conf. It is very strange and contradictive to what it should be. Could you post a resulting qemu argument string, by a chance? Also please share a method which you are using to determine if disk uses emulator cache or not. Here my qemu arguments strings for the related VM: /usr/bin/qemu-system-x86_64 -name www-pa2-01 -S -machine pc-i440fx-1.6,accel=kvm,usb=off -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 3542c57c-dd47-44cd-933f-7dae0b949012 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/www-pa2-01.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=c,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=rbd:libvirt-pool/www-pa2-01:id=libvirt:key=:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=rbd:libvirt-pool/www-pa2-01-data:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk1,format=raw,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=rbd:libvirt-pool/www-pa2-webmutu:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk2,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk2,id=virtio-disk2 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:90:53:6b,bus=pci.0,addr=0x3 -netdev tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:7b:b9:85,bus=pci.0,addr=0x8 -netdev tap,fd=37,id=hostnet2,vhost=on,vhostfd=38 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:2e:ce:f6,bus=pci.0,addr=0xa -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:5 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on For the disk without cache: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=rbd:libvirt-pool/www-pa2-webmutu:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk2,format=raw,cache=none I don't really have method to determine if disk use emulator's cache or not. When I was testing if my ocfs2 cluster work correctly, I realized that if rbd cache = true in ceph.conf and cache=none in XML file, my ocfs2 cluster doesn't work. Cluster's member doesn't see if members join or leave the cluster. But if rbd cache = false in ceph.conf and cache = none in XML. OCFS2 cluster work, cluster's members see the other members when they join or leave. Thanks, can you please also add a description about how your ocfs cluster set up? Honestly, there is no chance to intersect with any kind of software bug because of different entities you are using (userspace storage emulation versus kernel file system). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Snap operation throttling (again)
Hello, this question was brought many times before, and also solved in a various ways - snap trimmer, scheduler` priorities and persistent fix (for a ReplicatedPG issue), but it seems that the current Ceph versions may suffer as well during the rollback operations on large images and on large scale. Given CFQ scheduler for rotating media and ~10 percentage of an utilization as an initial preconditions, the rollback operation of an one-fourth terabyte image over 100 OSDs may result in a significant latency impact and, for such configuration, breakage of a 30s request completion barrier. Although recent improvements did very well in means of the congestion control for a slow media for many kind of non-client ops, this exact issue remains. I think it can be solved by another sleeper knob but unsure where its proper place should be. Thanks for suggestions! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Preliminary RDMA vs TCP numbers
On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy somnath@sandisk.com wrote: Hi, Please find the preliminary performance numbers of TCP Vs RDMA (XIO) implementation (on top of SSDs) in the following link. http://www.slideshare.net/somnathroy7568/ceph-on-rdma The attachment didn't go through it seems, so, I had to use slideshare. Mark, If we have time, I can present it in tomorrow's performance meeting. Thanks Regards Somnath Those numbers are really impressive (for small numbers at least)! What are TCP settings you using?For example, difference can be lowered on scale due to less intensive per-connection acceleration on CUBIC on a larger number of nodes, though I do not believe that it was a main reason for an observed TCP catchup on a relatively flat workload such as fio generates. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rebalance after empty bucket addition
Hello, after reaching certain ceiling of host/PG ratio, moving empty bucket in causes a small rebalance: ceph osd crush add-bucket 10.10.2.13 ceph osd crush move 10.10.2.13 root=default rack=unknownrack I have two pools, one is very large and it is keeping up with proper amount of pg/osd but another one contains in fact lesser amount of PGs than the number of active OSDs and after insertion of empty bucket in it goes to a rebalance, though that the actual placement map is not changed. Keeping in mind that this case is very far from being offensive to any kind of a sane production configuration, is this an expected behavior? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Intel 750 PCIe SSD
On Thu, Apr 2, 2015 at 8:03 PM, Mark Nelson mnel...@redhat.com wrote: Thought folks might like to see this: http://hothardware.com/reviews/intel-ssd-750-series-nvme-pci-express-solid-state-drive-review Quick summary: - PCIe SSD based on the P3700 - 400GB for $389! - 1.2GB/s writes and 2.4GB/s reads - power loss protection - 219TB write endurance So basically looks extremely attractive on paper except for the write endurance. I suspect this is not actually using HET cells (a summary I read said it was). How far beyond Intel's endurance rating the card can go is the big question. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com All the characteristics are awesome except resource - it would burn out in a couple of months and replacement is coupled with node poweroff... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD slow requests causing disk aborts in KVM
On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki krzysztof.a.nowi...@gmail.com wrote: Hi all, I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage backend for a set of KVM virtual machines. The VMs use RBD for disk storage. On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain DISCARD support. Each OSD node is running on a separate machine, using 3TB WD Black drive + Samsung SSD for journal. The machines used for OSD nodes are not equal in spec. Three of them are small servers, while one is a desktop PC. The last node is the one causing trouble. During high loads caused by remapping due to one of the other nodes going down I've experienced some slow requests. To my surprise however these slow requests caused aborts from the block device on the VM side, which ended up corrupting files. What I wonder if such behaviour (aborts) is normal in case slow requests pile up. I always though that these requests would be delayed but eventually they'd be handled. Are there any tunables that would help me avoid such situations? I would really like to avoid VM outages caused by such corruption issues. I can attach some logs if needed. Best regards Chris Hi, this is unevitable payoff for using scsi backend on a storage which is capable to slow enough operations. There was some argonaut/bobtail-era discussions in ceph ml, may be those readings can be interesting for you. AFAIR the scsi disk would about after 70s of non-receiving ack state for a pending operation. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down
Yep, it's a silly bug and I'm surprised we haven't noticed until now! http://tracker.ceph.com/issues/10762 https://github.com/ceph/ceph/pull/3631 Thanks! sage Thanks Sage, is dumpling missing from backport list by a purpose? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd resize (shrink) taking forever and a day
On Sun, Jan 4, 2015 at 10:43 PM, Edwin Peer ep...@isoho.st wrote: Thanks Jake, however, I need to shrink the image, not delete it as it contains a live customer image. Is it possible to manually edit the RBD header to make the necessary adjustment? Technically speaking, yes - the size information is contained in omap attributes of header, but I`d recommend you to check this approach somewhere else. Even if this will work, this will leave a lot of stray files in filestore. If you are running VM on top of it with recent qemu, it`s easy to tell the emulator desired block geometry (size), then launch a drive-mirror job and occasionally change backing image (during power cycle for example). Although second one may work, I am not absolutely sure that drive-mirror job will respect geometry override, better to check this too. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is there an negative relationship between storage utilization and ceph performance?
Hi, you can reduce reserved space for ext4 via tune2fs and gain a little more space, up to 5%. By the way, if you are using Centos7, it reserves ridiculously high disk percentage for ext4 (at least during instalation). Performance probably should be compared on smaller allocsize mount option for xfs (32..512k) and comparison should be made for long runs (like weeks of small writes from a bunch of clients to reach bad enough fragmentation ratio). When I measured comparable results last time, it was a bobtail era, and XFS started to decrease operation speed on about 40% of real allocation. If you prefer to use rados bench, real-life fragmentation may be achieved by running multiple benches with small block size simultaneously in different pools over same set of OSDs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Marking a OSD a new in the OSDMap
On Wed, Dec 31, 2014 at 8:20 PM, Wido den Hollander w...@42on.com wrote: On 12/31/2014 05:54 PM, Andrey Korolyov wrote: On Wed, Dec 31, 2014 at 7:34 PM, Wido den Hollander w...@42on.com wrote: Hi, Is there a way to set a OSD to exists,new in the OSDMap? I want to re-install a OSD and re-use the ID and it's key. I know I can remove the OSD and re-add it, but that triggers balancing and I want to prevent that by simply marking the OSD a new and booting it with a freshly formatted XFS filesystem. Yes, you can call mkfs with existing UUID: --osd-uuid xxx. Ah, but will that also mark the OSD as new in the OSDMap? Will it receive all old maps? Yes, technically I do not understand why most replacement guides are going through out-in plus optional rm-add procedure, it works perfectly with freshly formatted filestore using existing uuid and key. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, Dec 29, 2014 at 12:47 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote: On Sun, Dec 28, 2014 at 02:49:08PM +0900, Christian Balzer wrote: You really, really want size 3 and a third node for both performance (reads) and redundancy. How does it benefit read performance? I thought all reads are made only from the active primary OSD. -- Tomasz Kuzemko tomasz.kuze...@ovh.net You`ll have chunks of primary data scattered between three devices instead of two, as each pg will have a random acting set (until you decide to pin primary). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
Power supply means bigger capex and less redundancy, as the emergency procedure in case of power failure is less deterministic than with controlled battery-backed cache. Cache battery is smaller and way more predictable for a health measurements than a UPS (if passes internal check, it will be *always* enough to keep memory powered for a while, but UPS requires periodical battle testing, if you want to know that it still be able to hold power failure, with two power lanes should be safe enough, simply because device itself has more complex structure than a battery with a single voltage stabilizer). Anyway XFS nobarrier does not bring enough performance boost to be enabled by my experience. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sat, Dec 27, 2014 at 4:31 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: On Sat, 27 Dec 2014 04:59:51 PM you wrote: Power supply means bigger capex and less redundancy, as the emergency procedure in case of power failure is less deterministic than with controlled battery-backed cache. Yes, the whole auto shut-down procedure is rather more complex and fragile for a UPS than a controller cache Anyway XFS nobarrier does not bring enough performance boost to be enabled by my experience. It makes a non-trivial difference on my (admittedly slow) setup, with write bandwidth going from 35 MB/s to 51 MB/s Are you able to separate log with data in your setup and check the difference? If your devices are working strictly under their upper limits for bw/IOPS, separating meta and data bytes may help a lot, at least for synchronous clients. So, depending on type of your benchmark (sync/async/IOPS-/bandwidth-hungry) you may win something just for crossing journal and data between disks (and increase failure domain for a single disk as well :) ). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sun, Dec 28, 2014 at 1:25 AM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: On Sat, 27 Dec 2014 06:02:32 PM you wrote: Are you able to separate log with data in your setup and check the difference? Do you mean putting the OSD journal on a separate disk? I have the journals on SSD partitions, which has helped a lot, previously I was getting 13 MB/s No, I meant XFS journal, as we are speaking about filestore fs performance. Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to get SSD's with better seq write speed and IOPS I've been trying to figure out if adding more OSD's will improve my performance, I only have 2 OSD's (one per node) Erm, yes. Two OSDs cannot be considered even for a performance measurement testbed setup, neither should three or any other small number. This explains numbers you are getting and impact from nobarrier option. So, depending on type of your benchmark (sync/async/IOPS-/bandwidth-hungry) you may win something just for crossing journal and data between disks (and increase failure domain for a single disk as well ). One does tend to foxus on raw seq read/writes for becnhmarking, but my actual usage is solely for hosting KVM images, so really random R/W is probably more important. Ok, then my suggestion may not help as much as it can. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird scrub problem
On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just sam.j...@inktank.com wrote: Oh, that's a bit less interesting. The bug might be still around though. -Sam On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: You'll have to reproduce with logs on all three nodes. I suggest you open a high priority bug and attach the logs. debug osd = 20 debug filestore = 20 debug ms = 1 I'll be out for the holidays, but I should be able to look at it when I get back. -Sam Thanks Sam, although I am not sure if it makes not only a historical interest (the mentioned cluster running cuttlefish), I`ll try to collect logs for scrub. Same stuff: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15447.html https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14918.html Looks like issue is still with us, though it requires meta or file structure corruption to show itself. I`ll check if it can be reproduced via rsync -X sec pg subdir - pri pg subdir or vice-versa. Mine case shows slightly different pathnames for same objects with same checksums, may be a root reason then. As every case mentioned, including mine, happened in oh-shit-hardware-is-broken case, I suggest that the incurable corruption happens during primary backfill from active replica at the recovery time. Recovery/backfill from corrupted primary copy results to crash (attached) of primary OSD, for example it can be triggered by purging one of secondary copies (top of cuttlefish branch for line numbers). Although as secondaries preserve same data with same checksums, it is possible to destroy both meta record and pg directory and refill primary back. The interesting point is that the corrupted primary was completely refilled after hardware failure, but looks like it survived long enough after a failure event to spread corruption to the copies, I simply can not imagine better explanation. Thread 1 (Thread 0x7f193190d700 (LWP 64087)): #0 0x7f194a47ab7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00857d59 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 signal handler called #4 0x7f1948879405 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7f194887cb5b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7f194917789d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7f1949175996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7f19491759c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7f1949175bee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x0090436a in ceph::__ceph_assert_fail ( assertion=0x9caf67 r = 0, file=optimized out, line=7115, func=0x9d1900 void ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)) at common/assert.cc:77 #11 0x0065de69 in ReplicatedPG::scan_range (this=this@entry=0x4df6000, begin=..., min=min@entry=32, max=max@entry=64, bi=bi@entry=0x4df6d40) at osd/ReplicatedPG.cc:7115 #12 0x0066f5c6 in ReplicatedPG::recover_backfill ( this=this@entry=0x4df6000, max=max@entry=1) at osd/ReplicatedPG.cc:6923 #13 0x0067c18d in ReplicatedPG::start_recovery_ops (this=0x4df6000, max=1, prctx=optimized out) at osd/ReplicatedPG.cc:6561 #14 0x006f2340 in OSD::do_recovery (this=0x2ba7000, pg=pg@entry= 0x4df6000) at osd/OSD.cc:6104 #15 0x00735361 in OSD::RecoveryWQ::_process (this=optimized out, pg=0x4df6000) at osd/OSD.h:1248 #16 0x008faeba in ThreadPool::worker (this=0x2ba75e0, wt=0x7be1540) at common/WorkQueue.cc:119 #17 0x008fc160 in ThreadPool::WorkThread::entry (this=optimized out) at common/WorkQueue.h:316 #18 0x7f194a472e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #19 0x7f19489353dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #20 0x in ?? () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Weird scrub problem
Hello, I am currently facing some strange problem, most probably a bug (osd.3 is a primary holder): ceph pg scrub 4.458 2014-12-22 19:19:00.238077 osd.3 [ERR] 4.458 osd.34 missing 6f0df458/rbd_data.cbba8a759d2.0a5b/head//4 2014-12-22 19:19:00.238079 osd.3 [ERR] 4.458 osd.10 missing 6f0df458/rbd_data.cbba8a759d2.0a5b/head//4 Checksum is exactly the same on every OSD which included in this complaint and so the file count: find 4.458_* -name \*cbba8a759d2* -exec md5sum {} \; | sort \106f7a2b6c0d71d52031d2bd92ea9111 4.458_head/DIR_8/DIR_5/DIR_4/DIR_2/rbd\\udata.cbba8a759d2.00db__head_73442458__4 \2db655db8c7b0a1cccbec79bbc9fc923 4.458_head/DIR_8/DIR_5/DIR_4/DIR_F/rbd\\udata.cbba8a759d2.0a5b__head_6F0DF458__4 \6505c6c8ccb3c103618c5ef0a22d3414 4.458_head/DIR_8/DIR_5/DIR_4/DIR_C/rbd\\udata.cbba8a759d2.14ec__head_B37BC458__4 Extended attributes looks relatively well, or at least not suspicious. Automatic repair is failing on those, what should I try next? Unfortunately I cannot go over export+rm+import on most of those images just to delete appropriate prefix. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird scrub problem
On Mon, Dec 22, 2014 at 11:50 PM, Samuel Just sam.j...@inktank.com wrote: So 4.458_head/DIR_8/DIR_5/DIR_4/DIR_F/rbd\\udata.cbba8a759d2.0a5b__head_6F0DF458__4 is present on osd 3, osd 34, and osd 10? -Sam Yes, exactly, and have same checksum. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird scrub problem
On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: You'll have to reproduce with logs on all three nodes. I suggest you open a high priority bug and attach the logs. debug osd = 20 debug filestore = 20 debug ms = 1 I'll be out for the holidays, but I should be able to look at it when I get back. -Sam Thanks Sam, although I am not sure if it makes not only a historical interest (the mentioned cluster running cuttlefish), I`ll try to collect logs for scrub. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
On Tue, Nov 18, 2014 at 10:04 PM, Craig Lewis cle...@centraldesktop.com wrote: That would probably have helped. The XFS deadlocks would only occur when there was relatively little free memory. Kernel 3.18 is supposed to have a fix for that, but I haven't tried it yet. Looking at my actual usage, I don't even need 64k inodes. 64k inodes should make things a bit faster when you have a large number of files in a directory. Ceph will automatically split directories with too many files into multiple sub-directories, so it's kinda pointless. I may try the experiment again, but probably not. It took several weeks to reformat all of the OSDS. Even on a single node, it takes 4-5 days to drain, format, and backfill. That was months ago, and I'm still dealing with the side effects. I'm not eager to try again. On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote: I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. Would you mind to check suggestions by following mine hints or hints from mentioned URLs from there http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As for me, I am not observing lock loop after setting min_free_kbytes for a half of gigabyte per OSD. Even if your locks has a different nature, it may be worthy to try anyway. Thanks, I perfectly understand this. But, if you have low enough OSD/node ratio, it can be possible to check the problem at a node scale. By the way, I do not see real reasons behind using lower allocsize NOT for object storage-designed cluster. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote: I did have a problem in my secondary cluster that sounds similar to yours. I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs options xfs = -i size=64k). This showed up with a lot of XFS: possible memory allocation deadlock in kmem_alloc in the kernel logs. I was able to keep things limping along by flushing the cache frequently, but I eventually re-formatted every OSD to get rid of the 64k inodes. After I finished the reformat, I had problems because of deep-scrubbing. While reformatting, I disabled deep-scrubbing. Once I re-enabled it, Ceph wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would be doing a deep-scrub. I'm manually deep-scrubbing now, trying to spread out the schedule a bit. Once this finishes in a few day, I should be able to re-enable deep-scrubbing and keep my HEALTH_OK. Would you mind to check suggestions by following mine hints or hints from mentioned URLs from there http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As for me, I am not observing lock loop after setting min_free_kbytes for a half of gigabyte per OSD. Even if your locks has a different nature, it may be worthy to try anyway. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] isolate_freepages_block and excessive CPU usage by OSD process
On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka vba...@suse.cz wrote: On 11/15/2014 06:10 PM, Andrey Korolyov wrote: On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka vba...@suse.cz wrote: On 11/15/2014 12:48 PM, Andrey Korolyov wrote: Hello, I had found recently that the OSD daemons under certain conditions (moderate vm pressure, moderate I/O, slightly altered vm settings) can go into loop involving isolate_freepages and effectively hit Ceph cluster performance. I found this thread Do you feel it is a regression, compared to some older kernel version or something? No, it`s just a rare but very concerning stuff. The higher pressure is, the more chance to hit this particular issue, although absolute numbers are still very large (e.g. room for cache memory). Some googling also found simular question on sf: http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph but there are no perf info unfortunately so I cannot say if the issue is the same or not. Well it would be useful to find out what's doing the high-order allocations. With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and allocation flags can be captured by enabling the page_alloc tracepoint. Thanks, please give me some time to go through testing iterations, so I`ll collect appropriate perf.data. https://lkml.org/lkml/2012/6/27/545, but looks like that the significant decrease of bdi max_ratio did not helped even for a bit. Although I have approximately a half of physical memory for cache-like stuff, the problem with mm persists, so I would like to try suggestions from the other people. In current testing iteration I had decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and background ratio to 15 and 10 correspondingly (because default values are too spiky for mine workloads). The host kernel is a linux-stable 3.10. Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying it, or at least 3.17. Lot of patches went to reduce compaction overhead for (especially for transparent hugepages) since 3.10. Heh, I may say that I limited to pushing knobs in 3.10, because it has a well-known set of problems and any major version switch will lead to months-long QA procedures, but I may try that if none of mine knob selection will help. I am not THP user, the problem is happening with regular 4k pages and almost default VM settings. Also it worth to mean OK that's useful to know. So it might be some driver (do you also have mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations. Yes, I am using mellanox transport there and SLUB allocator, as SLAB had some issues with allocations with uneven node fill-up on a two-head system which I am primarily using. that kernel messages are not complaining about allocation failures, as in case in URL from above, compaction just tightens up to some limit Without the warnings, that's why we need tracing/profiling to find out what's causing it. and (after it 'locked' system for a couple of minutes, reducing actual I/O and derived amount of memory operations) it goes back to normal. Cache flush fixing this just in a moment, so should large room for That could perhaps suggest a poor coordination between reclaim and compaction, made worse by the fact that there are more parallel ongoing attempts and the watermark checking doesn't take that into account. min_free_kbytes. Over couple of days, depends on which nodes with certain settings issue will reappear, I may judge if my ideas was wrong. Non-default VM settings are: vm.swappiness = 5 vm.dirty_ratio=10 vm.dirty_background_ratio=5 bdi_max_ratio was 100%, right now 20%, at a glance it looks like the situation worsened, because unstable OSD host cause domino-like effect on other hosts, which are starting to flap too and only cache flush via drop_caches is helping. Unfortunately there are no slab info from exhausted state due to sporadic nature of this bug, will try to catch next time. slabtop (normal state): Active / Total Objects (% used): 8675843 / 8965833 (96.8%) Active / Total Slabs (% used) : 224858 / 224858 (100.0%) Active / Total Caches (% used) : 86 / 132 (65.2%) Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%) Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 6890130 6889185 99%0.10K 176670 39706680K buffer_head 751232 721707 96%0.06K 11738 64 46952K kmalloc-64 251636 226228 89%0.55K 8987 28143792K radix_tree_node 121696 45710 37%0.25K 3803 32 30424K kmalloc-256 113022 80618 71%0.19K 2691 42 21528K dentry 112672 35160 31%0.50K 3521 32 56336K kmalloc-512 73136 72800 99%0.07K 1306 56 5224K Acpi-ParseExt
Re: [ceph-users] Ceph and Compute on same hardware?
On Wed, Nov 12, 2014 at 5:30 PM, Haomai Wang haomaiw...@gmail.com wrote: Actually, our production cluster(up to ten) all are that ceph-osd ran on compute-node(KVM). The primary action is that you need to constrain the cpu and memory. For example, you can alloc a ceph cpu-set and memory group, let ceph-osd run with it within limited cores and memory. The another risk is the network. Because compute-node and ceph-osd shared the same kernel network stack, it exists some risks that VM may ran out of network resources such as conntracker in netfilter framework. On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson mark.nel...@inktank.com wrote: Technically there's no reason it shouldn't work, but it does complicate things. Probably the biggest worry would be that if something bad happens on the compute side (say it goes nuts with network or memory transfers) it could slow things down enough that OSDs start failing heartbeat checks causing ceph to go into recovery and maybe cause a vicious cycle of nastiness. You can mitigate some of this with cgroups and try to dedicate specific sockets and memory banks to Ceph/Compute, but we haven't done a lot of testing yet afaik. Mark On 11/12/2014 07:45 AM, Pieter Koorts wrote: Hi, A while back on a blog I saw mentioned that Ceph should not be run on compute nodes and in the general sense should be on dedicated hardware. Does this really still apply? An example, if you have nodes comprised of 16+ cores 256GB+ RAM Dual 10GBE Network 2+8 OSD (SSD log + HDD store) I understand that Ceph can use a lot of IO and CPU in some cases but if the nodes are powerful enough does it not make it an option to run compute and storage on the same hardware to either increase density of compute or save money on additional hardware? What are the reasons for not running Ceph on the Compute nodes. Thanks Pieter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Yes, the essential part is a resource management, which can neither be dynamic or static. In Flops we had implemented dynamic resource control which allows to pack VMs and OSDs more densely than static cg-based jails can allow (and it requires deep orchestration modifications for every open source cloud orchestrator, unfortunately). As long as you are able to manage strong traffic isolation for storage and vm segment, there are absolutely no problem (it can be static limits from linux-qos or tricky flow management for OpenFlow, depends on what your orchestration allows). The possibility of putting together compute and storage roles without significant impact to performance characteristics was one of key features which led our selection to Ceph as a storage backend three years ago. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] dumpling to giant test transition
Hello, after doing a single-step transition, the test cluster is hanging in unclean state, both before and after crush tunables adjustment: status: http://xdel.ru/downloads/transition-stuck/cephstatus.txt osd dump: http://xdel.ru/downloads/transition-stuck/cephosd.txt query for a single pg in active+remapped state: http://xdel.ru/downloads/transition-stuck/remappedpg.txt query for a single pg in active+undersized+degraded state: http://xdel.ru/downloads/transition-stuck/degradedpg.txt As one can see there is empty value for backfill_targets for both sets of pg which is clearly indicates some problem with placement calculation (two-node two-osd cluster should have enough targets for backfilling in this case). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dumpling to giant test transition
Yes, that`s right guess - crushmap is imagining both OSDs on a single host, although daemon on the second host was able to act as up/in when crushmap placed it on a different host, looks as a weird placement miscalculation during update. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
Hi Haomai, all. Today after unexpected power failure one of kv stores (placed on ext4 with default mount options) refused to work. I think that it may be interesting to revive it because it is almost first time among hundreds of power failures (and their simulations) when data store got broken. Strace: http://xdel.ru/downloads/osd1.strace.gz Debug output with 20-everything level: http://xdel.ru/downloads/osd1.out ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote: Thanks for Andrey, The attachment OSD.1's log is only these lines? I really can't find the detail infos from it? Maybe you need to improve debug_osd to 20/20? On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote: Hi Haomai, all. Today after unexpected power failure one of kv stores (placed on ext4 with default mount options) refused to work. I think that it may be interesting to revive it because it is almost first time among hundreds of power failures (and their simulations) when data store got broken. Strace: http://xdel.ru/downloads/osd1.strace.gz Debug output with 20-everything level: http://xdel.ru/downloads/osd1.out -- Best Regards, Wheat Unfortunately that`s all I`ve got. Updated osd1.out to show an actual cli args and entire output - it ends abruptly without last newline and without any valuable output. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang haomaiw...@gmail.com wrote: Thanks! You mean osd.1 exited abrptly without ceph callback trace? Anyone has some ideas about this log? @sage @gregory On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote: Thanks for Andrey, The attachment OSD.1's log is only these lines? I really can't find the detail infos from it? Maybe you need to improve debug_osd to 20/20? On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote: Hi Haomai, all. Today after unexpected power failure one of kv stores (placed on ext4 with default mount options) refused to work. I think that it may be interesting to revive it because it is almost first time among hundreds of power failures (and their simulations) when data store got broken. Strace: http://xdel.ru/downloads/osd1.strace.gz Debug output with 20-everything level: http://xdel.ru/downloads/osd1.out -- Best Regards, Wheat Unfortunately that`s all I`ve got. Updated osd1.out to show an actual cli args and entire output - it ends abruptly without last newline and without any valuable output. -- Best Regards, Wheat With log-file specified, it adds just following line at very end: 2014-10-29 13:29:57.437776 7ffa562c9840 -1 ** ERROR: osd init failed: (22) Invalid argument the stdout printing seems a bit broken and do not print this at all (and store output part is definitely is not detailed enough to make any conclusions, and even file a bug). CCing Sage/Greg. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
On Wed, Oct 29, 2014 at 1:37 PM, Haomai Wang haomaiw...@gmail.com wrote: maybe you can run it directly with debug_osd=20/20 and get ending logs ceph-osd -i 1 -c /etc/ceph/ceph.conf -f On Wed, Oct 29, 2014 at 6:34 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang haomaiw...@gmail.com wrote: Thanks! You mean osd.1 exited abrptly without ceph callback trace? Anyone has some ideas about this log? @sage @gregory On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote: Thanks for Andrey, The attachment OSD.1's log is only these lines? I really can't find the detail infos from it? Maybe you need to improve debug_osd to 20/20? On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote: Hi Haomai, all. Today after unexpected power failure one of kv stores (placed on ext4 with default mount options) refused to work. I think that it may be interesting to revive it because it is almost first time among hundreds of power failures (and their simulations) when data store got broken. Strace: http://xdel.ru/downloads/osd1.strace.gz Debug output with 20-everything level: http://xdel.ru/downloads/osd1.out -- Best Regards, Wheat Unfortunately that`s all I`ve got. Updated osd1.out to show an actual cli args and entire output - it ends abruptly without last newline and without any valuable output. -- Best Regards, Wheat With log-file specified, it adds just following line at very end: 2014-10-29 13:29:57.437776 7ffa562c9840 -1 ** ERROR: osd init failed: (22) Invalid argument the stdout printing seems a bit broken and do not print this at all (and store output part is definitely is not detailed enough to make any conclusions, and even file a bug). CCing Sage/Greg. -- Best Regards, Wheat -f does not print the last line to stderr either. Ok, it looks like very minor separate bug, but I remembering its appearance long before, so but as bug remains, probably it does not bother anyone - the stderr output is less usual for debugging purposes. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
On Sun, Oct 26, 2014 at 7:40 AM, Haomai Wang haomaiw...@gmail.com wrote: On Sun, Oct 26, 2014 at 3:12 AM, Andrey Korolyov and...@xdel.ru wrote: Thanks Haomai. Turns out that the master` recovery is too buggy right now (recovery speed degrades over a time, OSD (non-kv) is going out of cluster with no reason, misplaced object calculation is wrong and so on), so I am sticking to giant with rocksdb now. So far no major problems are revealed. Hmm, do you mean kvstore has problem on osd recovery? I'm eager to know the operations about how to produce this situation. Could you give more detail? -- Best Regards, Wheat I`m not sure if kv has triggered any of those, it`s just a side effect of deploying master branch (and OSDs which showed problems was not in kv subset only). Looks like both giant and master are exposing some problem with pg recalculation on tight-IO conditions for MON (MONs are sharing disk with one of OSD each and post-peering recalculation may take some minutes when kv-based OSDs are involved, also recalculation from active+remapped - active+degraded(+...) takes tens of minutes; the same 'non-optimal' setup worked well before and all recalculations was made in a matter of tens of seconds, so I will investigate this a bit later). Giant crashed on non-KV daemons during nightly recovery, so there is a more critical stuff to fix right now because kv so far did not exposed any crashes by itself. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Continuous OSD crash with kv backend (firefly)
Thanks Haomai. Turns out that the master` recovery is too buggy right now (recovery speed degrades over a time, OSD (non-kv) is going out of cluster with no reason, misplaced object calculation is wrong and so on), so I am sticking to giant with rocksdb now. So far no major problems are revealed. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Continuous OSD crash with kv backend (firefly)
Hi, during recovery testing on a latest firefly with leveldb backend we found that the OSDs on a selected host may crash at once, leaving attached backtrace. In other ways, recovery goes more or less smoothly for hours. Timestamps shows how the issue is correlated between different processes on same node: core.ceph-osd.25426.node01.1414148261 core.ceph-osd.25734.node01.1414148263 core.ceph-osd.25566.node01.1414148345 The question is about kv backend state in Firefly - is it considered stable enough to run production test against it or we should better move to giant/master for this? Thanks! GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-linux-gnu. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/bin/ceph-osd...Reading symbols from /usr/lib/debug/usr/bin/ceph-osd...done. done. [New LWP 10182] [New LWP 10183] [New LWP 10699] [New LWP 10184] [New LWP 10703] [New LWP 10704] [New LWP 10702] [New LWP 10708] [New LWP 10707] [New LWP 10710] [New LWP 10700] [New LWP 10717] [New LWP 10765] [New LWP 10705] [New LWP 10706] [New LWP 10701] [New LWP 10712] [New LWP 10735] [New LWP 10713] [New LWP 10750] [New LWP 10718] [New LWP 10711] [New LWP 10716] [New LWP 10715] [New LWP 10785] [New LWP 10766] [New LWP 10796] [New LWP 10720] [New LWP 10725] [New LWP 10736] [New LWP 10709] [New LWP 10730] [New LWP 11541] [New LWP 10770] [New LWP 11573] [New LWP 10778] [New LWP 10804] [New LWP 11561] [New LWP 9388] [New LWP 9398] [New LWP 11538] [New LWP 10790] [New LWP 11586] [New LWP 10798] [New LWP 9910] [New LWP 10726] [New LWP 21823] [New LWP 10815] [New LWP 9397] [New LWP 11248] [New LWP 10723] [New LWP 11253] [New LWP 10728] [New LWP 10791] [New LWP 9389] [New LWP 10724] [New LWP 10780] [New LWP 11287] [New LWP 11592] [New LWP 10816] [New LWP 10812] [New LWP 10787] [New LWP 20622] [New LWP 21822] [New LWP 10751] [New LWP 10768] [New LWP 10767] [New LWP 11874] [New LWP 10733] [New LWP 10811] [New LWP 11574] [New LWP 11873] [New LWP 10771] [New LWP 11551] [New LWP 10799] [New LWP 10729] [New LWP 18254] [New LWP 10792] [New LWP 10803] [New LWP 9912] [New LWP 11293] [New LWP 20623] [New LWP 14805] [New LWP 10773] [New LWP 11298] [New LWP 11872] [New LWP 10763] [New LWP 10783] [New LWP 10769] [New LWP 11300] [New LWP 10777] [New LWP 10764] [New LWP 10802] [New LWP 10749] [New LWP 14806] [New LWP 10806] [New LWP 10805] [New LWP 18255] [New LWP 10181] [New LWP 11277] [New LWP 9913] [New LWP 10800] [New LWP 10801] [New LWP 11555] [New LWP 11871] [New LWP 10748] [New LWP 9915] [New LWP 10779] [New LWP 11294] [New LWP 9916] [New LWP 10757] [New LWP 10734] [New LWP 10786] [New LWP 10727] [New LWP 19063] [New LWP 11279] [New LWP 9905] [New LWP 9911] [New LWP 10772] [New LWP 10722] [New LWP 9914] [New LWP 10789] [New LWP 11540] [New LWP 9917] [New LWP 11289] [New LWP 10714] [New LWP 10721] [New LWP 10719] [New LWP 10788] [New LWP 10782] [New LWP 10784] [New LWP 10776] [New LWP 10774] [New LWP 10737] [New LWP 19064] [Thread debugging using libthread_db enabled] Using host libthread_db library /lib/x86_64-linux-gnu/libthread_db.so.1. Core was generated by `/usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.con'. Program terminated with signal 6, Aborted. #0 0x7ff9ad91eb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) Thread 135 (Thread 0x7ff99a492700 (LWP 19064)): #0 0x7ff9ad91ad84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00c496da in Wait (mutex=..., this=0x108cd110) at ./common/Cond.h:55 #2 Pipe::writer (this=0x108ccf00) at msg/Pipe.cc:1730 #3 0x00c5485d in Pipe::Writer::entry (this=optimized out) at msg/Pipe.h:61 #4 0x7ff9ad916e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #5 0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x in ?? () Thread 134 (Thread 0x7ff975e1d700 (LWP 10737)): #0 0x7ff9ac498a13 in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00c3e73c in Pipe::tcp_read_wait (this=this@entry=0x4a53180) at msg/Pipe.cc:2282 #2 0x00c3e9d0 in Pipe::tcp_read (this=this@entry=0x4a53180, buf=optimized out, buf@entry=0x7ff975e1cccf \377, len=len@entry=1) at msg/Pipe.cc:2255 #3 0x00c5095f in Pipe::reader (this=0x4a53180) at msg/Pipe.cc:1421 #4 0x00c5497d in Pipe::Reader::entry (this=optimized out) at msg/Pipe.h:49 #5 0x7ff9ad916e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #6 0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #7 0x in ?? () Thread 133 (Thread 0x7ff972dda700 (LWP
Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation
Sorry, I see the problem. osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs: 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and fsid should be silently discarded there if OSD contains no epochs itself. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation
Heh, looks like that the osd process is unable to reach any of mon members. Since mkfs is getting just well (which requires same mon set to work) I suspect a bug there. osd0-monc10.log.gz Description: GNU Zip compressed data mon0-dbg.log.gz Description: GNU Zip compressed data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation
It is not so easy.. When I added fsid under selected osd` section and reformatted the store/journal, it aborted at start in FileStore::_do_transaction (see attach). On next launch, fsid in the mon store for this OSD magically changes to the something else and I am kicking again same doorstep (if I shut down osd process, recreate journal with new fsid inserted in fsid or recreate entire filestore too, it will abort, otherwise simply not join due to *next* mismatch). As far as I can see problem is in behavior of legacy clusters which are inherited fsid from filesystem created by third-party, not as a result of ceph-deploy work, so it is not fixed at all after such an update. Any suggestions? Trace is attached if someone is interested in it. On Thu, Oct 23, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote: Sorry, I see the problem. osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs: 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and fsid should be silently discarded there if OSD contains no epochs itself. abrt-at-start.txt.gz Description: GNU Zip compressed data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation
On Thu, Oct 23, 2014 at 9:18 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: Let me re-CC the list as this may be worth for the archives. On 10/23/2014 04:19 PM, Andrey Korolyov wrote: Doing off-list post again. So I was inaccurate in an initial bug description: - mkfs goes just well - on first start OSD is crashing with ABRT and trace from previous message, changing fsid before in the mon store - on next start it refuses to join due to fsid mismatch, not crashing any more. On Thu, Oct 23, 2014 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote: It is not so easy.. When I added fsid under selected osd` section and reformatted the store/journal, it aborted at start in FileStore::_do_transaction (see attach). On next launch, fsid in the mon store for this OSD magically changes to the something else and I am kicking again same doorstep (if I shut down osd process, recreate journal with new fsid inserted in fsid or recreate entire filestore too, it will abort, otherwise simply not join due to *next* mismatch). As far as I can see problem is in behavior of legacy clusters which are inherited fsid from filesystem created by third-party, not as a result of ceph-deploy work, so it is not fixed at all after such an update. Any suggestions? I'm not sure what you mean by 'changing fsid in the mon store', but I suspect you have a few misconceptions about 'fsid' and the 'osd uuid'. The error you have below, regarding the osd fsid, refers to the osd's uuid, which is passed to '--mkfs' using '--osd-uuid X'. 'X' is also the uuid you would pass when adding the osd to the monitors using 'ceph osd create uuid'. Then there's the cluster 'fsid', which refers to the cluster. This 'fsid' is kept in the monmap and is used to identify the cluster the monitors belong to and to allow clients (such as the osd) to correctly contact the monitors of the cluster they too belong to. Yes, I am referring to it. The problem is that the I called monmap mon store which is a bit incorrect in terms of documentation. Changing the 'fsid' option in ceph.conf results in changing the perceived value the clients and daemons have of the cluster fsid. If this value is different from the monmap's you're bound to have trouble. If you only change the 'fsid' option in the 'osd' section of ceph.conf, you're basically telling the osds that they belong to a different cluster, which will probably cause issues when they contact the monitors to obtain the monmap during mkfs. What you clearly want is to remove the contents of the osd data directory, generate a uuid 'X', run 'ceph osd create X', save the value it will return (it will be used as the OSD's id) and then run ceph-osd --mkfs with --osd-uuid X. Also, I don't believe that the 'clashing' message is a bug. IMO we should assume that it's the operator's responsibility to remove the data if it's no longer of any use, instead of just assuming what the operator may have meant when running mkfs repeatedly over a given osd store. Thanks, I see, using existing UUID from 'osd dump' worked well. The problem was probably in previous experience with the OSD recreation which did not require UUID to be specified over OSD re-format (and I believe that there is some inconsistence anyway - if I am specifying existing osd id upon mkfs call, why just not fetch and reuse its UUID for filestore?). Crash with SIGABRT takes place only with debug_ms being set to 10 or higher, so probably I am hitting independent bug there. Hope this helps. -Joao Trace is attached if someone is interested in it. On Thu, Oct 23, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote: Sorry, I see the problem. osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs: 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and fsid should be silently discarded there if OSD contains no epochs itself. -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: Latest firefly: osd not joining cluster after re-creation
Hello, given small test cluster, following sequence resulted to the inability to join back for freshly formatted OSD: - update cluster sequentially from cuttlefish to dumpling to firefly, - execute tunables change, wait for recovery completion, - shut down single osd, reformat filestore and journal, - start it back (auth caps and key remained the same). Version is 5a10b95f7968ecac1f2af4abf9fb91347a290544. Any ideas why this may happen are very welcomed. I suspect some resource starting from 29499 (probably earlier but this line doing a clear separation between init stage and loop in the log) line in strace which is continuously asking for resource all way down may be a root cause (something just after journal and collections initialization) but I have no idea what it may be. Thanks! Strace http://xdel.ru/downloads/osd0.out.gz osd0.stdout.gz Description: GNU Zip compressed data ceph.conf.gz Description: GNU Zip compressed data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy somnath@sandisk.com wrote: Thanks Haomai ! Here is some of the data from my setup. -- Set up: 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) - one OSD. 5 client m/c with 12 core cpu and each running two instances of ceph_smalliobench (10 clients total). Network is 10GbE. Workload: - Small workload – 20K objects with 4K size and io_size is also 4K RR. The intent is to serve the ios from memory so that it can uncover the performance problems within single OSD. Results from Firefly: -- Single client throughput is ~14K iops, but as the number of client increases the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu cores are used. Result with latest master: -- Single client is ~14K iops, but scaling as number of clients increases. 10 clients ~107K iops. ~25 cpu cores are used. -- More realistic workload: - Let’s see how it is performing while 90% of the ios are served from disks Setup: --- 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8 SSDs - 8 OSDs. One similar node for monitor and rgw. Another node for client running fio/vdbench. 4 rbds are configured with ‘noshare’ option. 40 GbE network Workload: 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR. Results from Firefly: Aggregated output while 4 rbd clients stressing the cluster in parallel is ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember precisely) Results from latest master: Aggregated output while 4 rbd clients stressing the cluster in parallel is ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores. Hope this helps. Thanks Roy, the results are very promising! Just two moments - are numbers from above related to the HT cores or you renormalized the result for real ones? And what about percentage of I/O time/utilization in this test was (if you measured this ones)? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS
On Thu, Aug 28, 2014 at 10:48 PM, Somnath Roy somnath@sandisk.com wrote: Nope, this will not be back ported to Firefly I guess. Thanks Regards Somnath Thanks for sharing this, the first thing in thought when I looked at this thread, was your patches :) If Giant will incorporate them, both the RDMA support and those should give a huge performance boost for RDMA-enabled Ceph backnets. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM
On Fri, Jun 13, 2014 at 7:09 AM, Ke-fei Lin k...@kfei.net wrote: Hi list, I deployed a Windows 7 VM with qemu-rbd disk, and got an unexpected booting phase performance. I discovered that when booting the Windows VM up, there are consecutive ~2 minutes that `ceph -w` gives me an interesting log like: ... 567 KB/s rd, 567 op/s, ... 789 KB/s rd, 789 op/s and so on. e.g. 2014-06-05 15:47:43.125441 mon.0 [INF] pgmap v18095: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 765 kB/s rd, 765 op/s 2014-06-05 15:47:44.240662 mon.0 [INF] pgmap v18096: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 568 kB/s rd, 568 op/s ... (skipped) 2014-06-05 15:50:02.441523 mon.0 [INF] pgmap v18186: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 412 kB/s rd, 412 op/s Which shows the number of rps is always the same as the number of ops, i.e. every operation is nearly 1KB, and I think this leads a very long boot time (takes 2 mins to enter desktop). But I can't understand why, is it an issue of my Ceph cluster? Or just some special I/O patterns in Windows VM booting process? In addition, I know that there are no qemu-rbd caching benefits during boot phase since the cache is not persistent (please corrects me), so is it possible to enlarge the read_ahead size in qemu-rbd driver? And does this make any sense? And finally, how can I tune up my Ceph cluster for this workload (booting Windows VM)? Any advice and suggestions will be greatly appreciated. Context: 4 OSDs (7200rpm/750GB/SATA) with replication factor 2. The system disk in Windows VM is NTFS formatted with default 4K block size. $ uname -a Linux ceph-consumer 3.11.0-22-generic #38~precise1-Ubuntu SMP Fri May 16 20:47:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ ceph --version ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) $ dpkg -l | grep rbd ii librbd-dev 0.80.1-1precise RADOS block device client library (development files) ii librbd1 0.80.1-1precise RADOS block device client library $ virsh version Compiled against library: libvir 0.9.8 Using library: libvir 0.9.8 Using API: QEMU 0.9.8 Running hypervisor: QEMU 1.7.1 () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, If you are able to leave only this VM in cluster scope to check, you`ll perhaps may use virsh domblkstat accumulated values to compare real number of operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM
On Fri, Jun 13, 2014 at 5:50 PM, Ke-fei Lin k...@kfei.net wrote: 2014-06-13 21:23 GMT+08:00 Andrey Korolyov and...@xdel.ru: On Fri, Jun 13, 2014 at 7:09 AM, Ke-fei Lin k...@kfei.net wrote: Hi list, I deployed a Windows 7 VM with qemu-rbd disk, and got an unexpected booting phase performance. I discovered that when booting the Windows VM up, there are consecutive ~2 minutes that `ceph -w` gives me an interesting log like: ... 567 KB/s rd, 567 op/s, ... 789 KB/s rd, 789 op/s and so on. e.g. 2014-06-05 15:47:43.125441 mon.0 [INF] pgmap v18095: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 765 kB/s rd, 765 op/s 2014-06-05 15:47:44.240662 mon.0 [INF] pgmap v18096: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 568 kB/s rd, 568 op/s ... (skipped) 2014-06-05 15:50:02.441523 mon.0 [INF] pgmap v18186: 320 pgs: 320 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 412 kB/s rd, 412 op/s Which shows the number of rps is always the same as the number of ops, i.e. every operation is nearly 1KB, and I think this leads a very long boot time (takes 2 mins to enter desktop). But I can't understand why, is it an issue of my Ceph cluster? Or just some special I/O patterns in Windows VM booting process? In addition, I know that there are no qemu-rbd caching benefits during boot phase since the cache is not persistent (please corrects me), so is it possible to enlarge the read_ahead size in qemu-rbd driver? And does this make any sense? And finally, how can I tune up my Ceph cluster for this workload (booting Windows VM)? Any advice and suggestions will be greatly appreciated. Context: 4 OSDs (7200rpm/750GB/SATA) with replication factor 2. The system disk in Windows VM is NTFS formatted with default 4K block size. $ uname -a Linux ceph-consumer 3.11.0-22-generic #38~precise1-Ubuntu SMP Fri May 16 20:47:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ ceph --version ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) $ dpkg -l | grep rbd ii librbd-dev 0.80.1-1precise RADOS block device client library (development files) ii librbd1 0.80.1-1precise RADOS block device client library $ virsh version Compiled against library: libvir 0.9.8 Using library: libvir 0.9.8 Using API: QEMU 0.9.8 Running hypervisor: QEMU 1.7.1 () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, If you are able to leave only this VM in cluster scope to check, you`ll perhaps may use virsh domblkstat accumulated values to compare real number of operations. Thanks, Andrey. I tried `virsh domblkstat vm hda` (only this VM in whole cluster) and got these values: hda rd_req 70682 hda rd_bytes 229894656 hda wr_req 1067 hda wr_bytes 12645888 hda flush_operations 0 (These values became stable after ~2 mins) While the output of `ceph -w` is attached at: http://pastebin.com/Uhdj9drV Any advices? Thanks, poor man`s analysis shows that it can be true - assuming median heartbeat value as 1.2s, overall read ops are about 40k, which is close enough to what qemu stats saying, regarding floating heartbeat interval. Because ceph -w never had such value as a precise measurement tool, I may suggest to measure block stats difference on smaller intervals, about 1s or so, and compare values then. By the way, what driver do you use in qemu for a block device? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM
In my belief, lot of sequential small reads will be aggregated after all when targeting filestore contents (of course if the moment of issuing next one is not dependent on status of previous read, otherwise they`ll be separated in time in such way that the rotating media scheduler will not be able to combine requests), am I wrong? If so, this case only affects OSD CPU consumption (on very large scale). Ke-fei, is there any real reasons behind staying on IDE and not LSI SCSI emulation/virtio? On Fri, Jun 13, 2014 at 8:11 PM, Sage Weil s...@inktank.com wrote: Right now, no. We could add a minimum read size to librbd when caching is enabled... that would not be particularly difficult. sage On Fri, 13 Jun 2014, Ke-fei Lin wrote: 2014-06-13 22:04 GMT+08:00 Andrey Korolyov and...@xdel.ru: On Fri, Jun 13, 2014 at 5:50 PM, Ke-fei Lin k...@kfei.net wrote: Thanks, Andrey. I tried `virsh domblkstat vm hda` (only this VM in whole cluster) and got these values: hda rd_req 70682 hda rd_bytes 229894656 hda wr_req 1067 hda wr_bytes 12645888 hda flush_operations 0 (These values became stable after ~2 mins) While the output of `ceph -w` is attached at: http://pastebin.com/Uhdj9drV Any advices? Thanks, poor man`s analysis shows that it can be true - assuming median heartbeat value as 1.2s, overall read ops are about 40k, which is close enough to what qemu stats saying, regarding floating heartbeat interval. Because ceph -w never had such value as a precise measurement tool, I may suggest to measure block stats difference on smaller intervals, about 1s or so, and compare values then. By the way, what driver do you use in qemu for a block device? OK, this time I capture the blkstat difference in a smaller interval (less than 1s). And a simple calculation gives me some result: (19531264-19209216)/(38147-37518) = 512 ... (20158976-19531264)/(39373-38147) = 512 Which means in the beginning of boot phase, every read request from VM is just *512 byte*. Maybe this is why `ceph -w` shows me every operation is about 1KB (in my first post)? So seems this is the inherent problem of Windows VM, but can I do something in my Ceph cluster's configuration to improve this? By the way the related part of my VM definition are: emulator/usr/bin/kvm/emulator disk type='network' device='disk' driver name='qemu' type='rbd' cache='writeback'/ source protocol='rbd' name='libvirt-pool/test-rbd-1' host name='10.0.0.5' port='6789'/ /source target dev='hda' bus='ide'/ address type='drive' controller='0' bus='0' unit='0'/ /disk Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backfilling, latency and priority
On Thu, Jun 12, 2014 at 5:02 PM, David da...@visions.se wrote: Thanks Mark! Well, our workload has more IOs and quite low throughput, perhaps 10MB/s - 100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / sql). During the recovery we had ranged between 600-1000MB/s throughput. So the only way to currently ”fix” this is to have enough IO to handle both recovery and client IOs? What’s the easiest/best way to add more IOs to a current cluster if you don’t want to scale? Add more RAM to OSD servers or add a SSD backed r/w cache tier? RAM usable only as read cache, SSD holds both types of operations. Dealing with a lot of small operations is very hard because the way cluster behaves changes dramatically with scale or with involved caching methods, therefore workloads which worked very reliable on certain number of OSDs may choke on 5 times higher count, so there almost nothing I can suggest to you except try-n-check variant. Kind Regards, David Majchrzak 12 jun 2014 kl. 14:42 skrev Mark Nelson mark.nel...@inktank.com: On 06/12/2014 03:44 AM, David wrote: Hi, We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs). We lost an OSD and the cluster started to backfill the data to the rest of the OSDs - during which the latency skyrocketed on some OSDs and connected clients experienced massive IO wait. I’m trying to rectify the situation now and from what I can tell, these are the settings that might help. osd client op priority osd recovery op priority osd max backfills osd recovery max active 1. Does a high priority value mean it has higher priority? (if the other one has lower value) Or does a priority of 1 mean highest priority? 2. I’m running with default on these settings. Does anyone else have any experience changing those? We did some investigation into this a little while back. I suspect you'll see some benefit by reducing backfill/recovery priority and max concurrent operations, but you have to be careful. We found that the higher the number of concurrent client IOs (past the saturation point), the greater relative proportion of throughput is used by client IO. That makes it hard to nail down specific priority and concurrency settings. If your workload requires high throughput and low latency with few client IOs (ie below the saturation point), you may need to overly favour client IO. If you are over-saturating the cluster with many concurrent IOs, you may want to give client IO less priority. If you overly favor client IO when over-saturating the cluster, recovery can take much much longer and client throughput may actually be lower in aggregate. Obviously this isn't ideal, but seems to be what's going on right now. Mark Kind Regards, David Majchrzak ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 06/04/2014 06:06 PM, Sage Weil wrote: On Wed, 4 Jun 2014, Dan Van Der Ster wrote: Hi Sage, all, On 21 May 2014, at 22:02, Sage Weil s...@inktank.com wrote: * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) Do you have some advice about how to use the snap trim throttle? I saw osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to follow the original ticket, since it started out as a question about deep scrub contending with client IOs, but then at some point you renamed the ticket to throttling snap trim. What exactly does snap trim do in the context of RBD client? And can you suggest a good starting point for osd_snap_trim_sleep = ? ? This is a coarse hack to make the snap trimming slow down and let client IO run by simply sleeping between work. I would start with something smallish (.01 = 10ms) after deleting some snapshots and see what effect it has on request latency. Unfortunately it's not a very intuitive knob to adjust, but it is an interim solution until we figure out how to better prioritize this (and other) background work. In short, if you do see a performance degradation after removing snaps, adjust this up or down and see how it changes that. If you don't see a degradation, then you're lucky and don't need to do anything. :) You can adjust this on running OSDs with something like 'ceph daemon osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* injectargs -- --osd-snap-trim-sleep .01'. sage Hi, we had the same mechanism for almost a half of year and it working nice except cases when multiple background snap deletions are hitting their ends - latencies may spike not regarding very large sleep gap for snap operations. Do you have any thoughts on reducing this particular impact? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 06/04/2014 07:22 PM, Sage Weil wrote: On Wed, 4 Jun 2014, Andrey Korolyov wrote: On 06/04/2014 06:06 PM, Sage Weil wrote: On Wed, 4 Jun 2014, Dan Van Der Ster wrote: Hi Sage, all, On 21 May 2014, at 22:02, Sage Weil s...@inktank.com wrote: * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) Do you have some advice about how to use the snap trim throttle? I saw osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to follow the original ticket, since it started out as a question about deep scrub contending with client IOs, but then at some point you renamed the ticket to throttling snap trim. What exactly does snap trim do in the context of RBD client? And can you suggest a good starting point for osd_snap_trim_sleep = ? ? This is a coarse hack to make the snap trimming slow down and let client IO run by simply sleeping between work. I would start with something smallish (.01 = 10ms) after deleting some snapshots and see what effect it has on request latency. Unfortunately it's not a very intuitive knob to adjust, but it is an interim solution until we figure out how to better prioritize this (and other) background work. In short, if you do see a performance degradation after removing snaps, adjust this up or down and see how it changes that. If you don't see a degradation, then you're lucky and don't need to do anything. :) You can adjust this on running OSDs with something like 'ceph daemon osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* injectargs -- --osd-snap-trim-sleep .01'. sage Hi, we had the same mechanism for almost a half of year and it working nice except cases when multiple background snap deletions are hitting their ends - latencies may spike not regarding very large sleep gap for snap operations. Do you have any thoughts on reducing this particular impact? This isn't ringing any bells. If this is somethign you can reproduce with osd logging enabled we should be able to tell what is causing the spike, though... sage Ok, would 10 be enough there? On 20, all timings most likely to be distorted by logging operations even for tmpfs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fatigue for XFS
On 05/06/2014 01:23 AM, Dave Chinner wrote: On Tue, May 06, 2014 at 12:59:27AM +0400, Andrey Korolyov wrote: On Tue, May 6, 2014 at 12:36 AM, Dave Chinner da...@fromorbit.com wrote: On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote: Hello, We are currently exploring issue which can be related to Ceph itself or to the XFS - any help is very appreciated. First, the picture: relatively old cluster w/ two years uptime and ten months after fs recreation on every OSD, one of daemons started to flap approximately once per day for couple of weeks, with no external reason (bandwidth/IOPS/host issues). It looks almost the same every time - OSD suddenly stop serving requests for a short period, gets kicked out by peers report, then returns in a couple of seconds. Of course, small but sensitive amount of requests are delayed by 15-30 seconds twice, which is bad for us. The only thing which correlates with this kick is a peak of I/O, not too large, even not consuming all underlying disk utilization, but alone in the cluster and clearly visible. Also there are at least two occasions *without* correlated iowait peak. So, actual numbers and traces are the only thing that tell us what is happening during these events. See here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F If it happens at almost the same time every day, then I'd be looking at the crontabs to find what starts up about that time. output of top will also probably tell you what process is running, too. topio might be instructive, and blktrace almost certainly will be I have two versions - we`re touching some sector on disk which is about to be marked as dead but not displayed in SMART statistics or (I Doubt it - SMART doesn't cause OS visible IO dispatch spikes. believe so) some kind of XFS fatigue, which can be more likely in this case, since near-bad sector should be touched more frequently and related impact could leave traces in dmesg/SMART from my experience. I I doubt that, too, because XFS doesn't have anything that is triggered on a daily basis inside it. Maybe you've got xfs_fsr set up on a cron job, though... would like to ask if anyone has a simular experience before or can suggest to poke existing file system in some way. If no suggestion appear, I`ll probably reformat disk and, if problem will remain after refill, replace it, but I think less destructive actions can be done before. Yeah, monitoring and determining the process that is issuing the IO is what you need to find first. Cheers, Dave. -- Dave Chinner da...@fromorbit.com Thanks Dave, there are definitely no cron set for specific time (though most of lockups happened in a relatively small interval which correlates with the Ceph snapshot operations). OK. FWIW, Ceph snapshots on XFS may not be immediately costly in terms of IO - they can be extremely costly after one is taken when the files in the snapshot are next written to. If you are snapshotting files that are currently being written to, then that's likely to cause immediate IO issues... In at least one case no Ceph snapshot operations (including delayed removal) happened and at least two when no I/O peak was observed. We observed and eliminated weird lockups related to the openswitch behavior before - we`re combining storage and compute nodes, so quirks in the OVS datapath caused very interesting and weird system-wide lockups on (supposedly) spinlock, and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels, all of them was correlated with very high context switch peak. Until we determine what is triggering the IO, the application isn't really a concern. Current issue is seemingly nothing to do with spinlock-like bugs or just a hardware problem, we even rebooted problematic node to check if the memory allocator may stuck at the border of specific NUMA node, with no help, but first reappearance of this bug was delayed by some days then. Disabling lazy allocation via specifying allocsize did nothing too. It may look like I am insisting that it is XFS bug, where Ceph version is more likely to appear because of way more complicated logic and operation behaviour, but persistence on specific node across relaunching of Ceph storage daemon suggests bug relation to the unlucky byte sequence more than anything else. If it finally appear as Ceph bug, it`ll ruin our expectations from two-year of close experience with this product and if it is XFS bug, we haven`t see anything like this before, thought we had a pretty collection of XFS-related lockups on the earlier kernels. Long experience with triaging storage performance issues has taught me to ignore what anyone *thinks* is the cause of the problem; I rely on the data that is gathered to tell me what the problem is. I find that hard data has a nasty habit of busting assumptions, expectations, speculations
Re: [ceph-users] Ceph and low latency kernel
On Mon, May 26, 2014 at 10:53 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, sorry it was a bit poorly defined. I'm talking about thinks like this: http://lwn.net/Articles/551179/ Stefan Not sure if Ceph can have any advantage of it, as for common Ceph operations looks like they are barely hitting the performance area patch is designated about of, but it would be awesome if you`ll be able to test it :) D(2)CTCP and may be some other congestion control algorithms designed for low-latency high-speed networks can definitely give you a speed bump for spiky workloads. Am 25.05.2014 11:11, schrieb Andrey Korolyov: Hi, which one you are talking about? -rt patchset has absolutely no difference for Ceph, though very specific workload (which I was unable to imagine at a time) can benefit of it a little. Windriver variant means much more, because it rt`ing virtualized envs - in combination with storage nodes you may achieve a lot better deadlines for tasks like gaming servers and so on, but I had not tried it. On Sun, May 25, 2014 at 1:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, has anybody ever tried to use a low latency kernel for ceph? Does it make any differences? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and low latency kernel
Hi, which one you are talking about? -rt patchset has absolutely no difference for Ceph, though very specific workload (which I was unable to imagine at a time) can benefit of it a little. Windriver variant means much more, because it rt`ing virtualized envs - in combination with storage nodes you may achieve a lot better deadlines for tasks like gaming servers and so on, but I had not tried it. On Sun, May 25, 2014 at 1:03 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, has anybody ever tried to use a low latency kernel for ceph? Does it make any differences? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] sparse copy between pools
On 05/14/2014 02:13 PM, Erwin Lubbers wrote: Hi, I'm trying to copy a sparse provisioned rbd image from pool A to pool B (both are replicated three times). The image has a disksize of 8 GB and contains around 1.4 GB of data. I do use: rbd cp PoolA/Image PoolB/Image After copying ceph -s tells me that 24 GB diskspace extra is in use. Then I delete the original pool A image and only 8 GB of space is freed. Does Ceph not sparse copy the image using cp? Is there another way to do so? I'm using 0.67.7 dumpling on this cluster. I believe that the http://tracker.ceph.com/projects/ceph/repository/revisions/824da2029613a6f4b380b6b2f16a0bd0903f7e3c/diff/src/librbd/internal.cc had to went to the dumpling as backport; github shows that it was not. Josh, would you mind to add your fix there too? Regards, Erwin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate whole clusters
Anyway replacing set of monitors means downtime for every client, so I`m in doubt if 'no outage' word is still applicable there. On Fri, May 9, 2014 at 9:46 PM, Kyle Bader kyle.ba...@gmail.com wrote: Let's assume a test cluster up and running with real data on it. Which is the best way to migrate everything to a production (and larger) cluster? I'm thinking to add production MONs to the test cluster, after that, add productions OSDs to the test cluster, waiting for a full rebalance and then starting to remove test OSDs and test mons. This should migrate everything with no outage. It's possible and I've done it, this was around the argonaut/bobtail timeframe on a pre-production cluster. If your cluster has a lot of data then it may take a long time or be disruptive, make sure you've tested that your recovery tunables are suitable for your hardware configuration. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80 Firefly released
Mike, would you mind to write your experience if you`ll manage to get this flow through first? I hope I`ll be able to conduct some tests related to 0.80 only next week, including maintenance combined with primary pointer relocation - one of most crucial things remaining in Ceph for the production performance. On Wed, May 7, 2014 at 10:18 PM, Mike Dawson mike.daw...@cloudapt.com wrote: On 5/7/2014 11:53 AM, Gregory Farnum wrote: On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, Sage Weil wrote: * *Primary affinity*: Ceph now has the ability to skew selection of OSDs as the primary copy, which allows the read workload to be cheaply skewed away from parts of the cluster without migrating any data. Can you please elaborate a bit on this one? I found the blueprint [1] but still don't quite understand how it works. Does this only change the crush calculation for reads? i.e writes still go to the usual primary, but reads are distributed across the replicas? If so, does this change the consistency model in any way. It changes the calculation of who becomes the primary, and that primary serves both reads and writes. In slightly more depth: Previously, the primary has always been the first OSD chosen as a member of the PG. For erasure coding, we added the ability to specify a primary independent of the selection ordering. This was part of a broad set of changes to prevent moving the EC shards around between different members of the PG, and means that the primary might be the second OSD in the PG, or the fourth. Once this work existed, we realized that it might be useful in other cases, because primaries get more of the work for their PG (serving all reads, coordinating writes). So we added the ability to specify a primary affinity, which is like the CRUSH weights but only impacts whether you become the primary. So if you have 3 OSDs that each have primary affinity = 1, it will behave as normal. If two have primary affinity = 0, the remaining OSD will be the primary. Etc. Is it possible (and/or advisable) to set primary affinity low while backfilling / recovering an OSD in an effort to prevent unnecessary slow reads that could be directed to less busy replicas? I suppose if the cost of setting/unsetting primary affinity is low and clients are starved for reads during backfill/recovery from the osd in question, it could be a win. Perhaps the workflow for maintenance on osd.0 would be something like: - Stop osd.0, do some maintenance on osd.0 - Read primary affinity of osd.0, store it for later - Set primary affinity on osd.0 to 0 - Start osd.0 - Enjoy a better backfill/recovery experience. RBD clients happier. - Reset primary affinity on osd.0 to previous value If the cost of setting primary affinity is low enough, perhaps this strategy could be automated by the ceph daemons. Thanks, Mike Dawson -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Explicit F2FS support (was: v0.80 Firefly released)
Hello, first of all, congratulations to Inktank and thank you for your awesome work! Although exploiting native f2fs abilities, as with btrfs, sounds awesome for a matter of performance, I wonder when kv db is able to practically give users with 'legacy' file systems ability to conduct CoW operations as fast as on the log-based fs, with small or no performance impact, what`s the primary idea behind introducing interface bounded to the specific filesystem in same time? Of course I believe that f2fs will outperform almost every competitor at its field - non-rotating media operations, but I would be grateful if someone can shed light on this development choice. On Wed, May 7, 2014 at 5:05 AM, Sage Weil s...@inktank.com wrote: We did it! Firefly v0.80 is built and pushed out to the ceph.com repositories. This release will form the basis for our long-term supported release Firefly, v0.80.x. The big new features are support for erasure coding and cache tiering, although a broad range of other features, fixes, and improvements have been made across the code base. Highlights include: * *Erasure coding*: support for a broad range of erasure codes for lower storage overhead and better data durability. * *Cache tiering*: support for creating 'cache pools' that store hot, recently accessed objects with automatic demotion of colder data to a base tier. Typically the cache pool is backed by faster storage devices like SSDs. * *Primary affinity*: Ceph now has the ability to skew selection of OSDs as the primary copy, which allows the read workload to be cheaply skewed away from parts of the cluster without migrating any data. * *Key/value OSD backend* (experimental): An alternative storage backend for Ceph OSD processes that puts all data in a key/value database like leveldb. This provides better performance for workloads dominated by key/value operations (like radosgw bucket indices). * *Standalone radosgw* (experimental): The radosgw process can now run in a standalone mode without an apache (or similar) web server or fastcgi. This simplifies deployment and can improve performance. We expect to maintain a series of stable releases based on v0.80 Firefly for as much as a year. In the meantime, development of Ceph continues with the next release, Giant, which will feature work on the CephFS distributed file system, more alternative storage backends (like RocksDB and f2fs), RDMA support, support for pyramid erasure codes, and additional functionality in the block device (RBD) like copy-on-read and multisite mirroring. This release is the culmination of a huge collective effort by about 100 different contributors. Thank you everyone who has helped to make this possible! Upgrade Sequencing -- * If your existing cluster is running a version older than v0.67 Dumpling, please first upgrade to the latest Dumpling release before upgrading to v0.80 Firefly. Please refer to the :ref:`Dumpling upgrade` documentation. * Upgrade daemons in the following order: 1. Monitors 2. OSDs 3. MDSs and/or radosgw If the ceph-mds daemon is restarted first, it will wait until all OSDs have been upgraded before finishing its startup sequence. If the ceph-mon daemons are not restarted prior to the ceph-osd daemons, they will not correctly register their new capabilities with the cluster and new features may not be usable until they are restarted a second time. * Upgrade radosgw daemons together. There is a subtle change in behavior for multipart uploads that prevents a multipart request that was initiated with a new radosgw from being completed by an old radosgw. Notable changes since v0.79 --- * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng) * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng) * librados: fix inconsistencies in API error values (David Zafman) * librados: fix watch operations with cache pools (Sage Weil) * librados: new snap rollback operation (David Zafman) * mds: fix respawn (John Spray) * mds: misc bugs (Yan, Zheng) * mds: misc multi-mds fixes (Yan, Zheng) * mds: use shared_ptr for requests (Greg Farnum) * mon: fix peer feature checks (Sage Weil) * mon: require 'x' mon caps for auth operations (Joao Luis) * mon: shutdown when removed from mon cluster (Joao Luis) * msgr: fix locking bug in authentication (Josh Durgin) * osd: fix bug in journal replay/restart (Sage Weil) * osd: many many many bug fixes with cache tiering (Samuel Just) * osd: track omap and hit_set objects in pg stats (Samuel Just) * osd: warn if agent cannot enable due to invalid (post-split) stats (Sage Weil) * rados bench: track metadata for multiple runs separately (Guang Yang) * rgw: fixed subuser modify (Yehuda Sadeh) * rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza) For the complete release notes, please see:
Re: [ceph-users] RBD on Mac OS X
You can do this for sure using iSCSI reexport feature, AFAIK no working RBD implementation for OSX exists. On Tue, May 6, 2014 at 3:28 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote: Hi! I want to use ceph for time machine backups on Mac OS X. Is it possible to map RBD or mount CephFS on mac directly, for example, using osxfuse? Or it is only way to do this -- make an intermediate linux server? Pavel. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fatigue for XFS
Hello, We are currently exploring issue which can be related to Ceph itself or to the XFS - any help is very appreciated. First, the picture: relatively old cluster w/ two years uptime and ten months after fs recreation on every OSD, one of daemons started to flap approximately once per day for couple of weeks, with no external reason (bandwidth/IOPS/host issues). It looks almost the same every time - OSD suddenly stop serving requests for a short period, gets kicked out by peers report, then returns in a couple of seconds. Of course, small but sensitive amount of requests are delayed by 15-30 seconds twice, which is bad for us. The only thing which correlates with this kick is a peak of I/O, not too large, even not consuming all underlying disk utilization, but alone in the cluster and clearly visible. Also there are at least two occasions *without* correlated iowait peak. I have two versions - we`re touching some sector on disk which is about to be marked as dead but not displayed in SMART statistics or (I believe so) some kind of XFS fatigue, which can be more likely in this case, since near-bad sector should be touched more frequently and related impact could leave traces in dmesg/SMART from my experience. I would like to ask if anyone has a simular experience before or can suggest to poke existing file system in some way. If no suggestion appear, I`ll probably reformat disk and, if problem will remain after refill, replace it, but I think less destructive actions can be done before. XFS is running on 3.10 with almost default create and mount options, ceph version is the latest cuttlefish (this rack should be upgraded, I know). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fatigue for XFS
On Tue, May 6, 2014 at 12:36 AM, Dave Chinner da...@fromorbit.com wrote: On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote: Hello, We are currently exploring issue which can be related to Ceph itself or to the XFS - any help is very appreciated. First, the picture: relatively old cluster w/ two years uptime and ten months after fs recreation on every OSD, one of daemons started to flap approximately once per day for couple of weeks, with no external reason (bandwidth/IOPS/host issues). It looks almost the same every time - OSD suddenly stop serving requests for a short period, gets kicked out by peers report, then returns in a couple of seconds. Of course, small but sensitive amount of requests are delayed by 15-30 seconds twice, which is bad for us. The only thing which correlates with this kick is a peak of I/O, not too large, even not consuming all underlying disk utilization, but alone in the cluster and clearly visible. Also there are at least two occasions *without* correlated iowait peak. So, actual numbers and traces are the only thing that tell us what is happening during these events. See here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F If it happens at almost the same time every day, then I'd be looking at the crontabs to find what starts up about that time. output of top will also probably tell you what process is running, too. topio might be instructive, and blktrace almost certainly will be I have two versions - we`re touching some sector on disk which is about to be marked as dead but not displayed in SMART statistics or (I Doubt it - SMART doesn't cause OS visible IO dispatch spikes. believe so) some kind of XFS fatigue, which can be more likely in this case, since near-bad sector should be touched more frequently and related impact could leave traces in dmesg/SMART from my experience. I I doubt that, too, because XFS doesn't have anything that is triggered on a daily basis inside it. Maybe you've got xfs_fsr set up on a cron job, though... would like to ask if anyone has a simular experience before or can suggest to poke existing file system in some way. If no suggestion appear, I`ll probably reformat disk and, if problem will remain after refill, replace it, but I think less destructive actions can be done before. Yeah, monitoring and determining the process that is issuing the IO is what you need to find first. Cheers, Dave. -- Dave Chinner da...@fromorbit.com Thanks Dave, there are definitely no cron set for specific time (though most of lockups happened in a relatively small interval which correlates with the Ceph snapshot operations). In at least one case no Ceph snapshot operations (including delayed removal) happened and at least two when no I/O peak was observed. We observed and eliminated weird lockups related to the openswitch behavior before - we`re combining storage and compute nodes, so quirks in the OVS datapath caused very interesting and weird system-wide lockups on (supposedly) spinlock, and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels, all of them was correlated with very high context switch peak. Current issue is seemingly nothing to do with spinlock-like bugs or just a hardware problem, we even rebooted problematic node to check if the memory allocator may stuck at the border of specific NUMA node, with no help, but first reappearance of this bug was delayed by some days then. Disabling lazy allocation via specifying allocsize did nothing too. It may look like I am insisting that it is XFS bug, where Ceph version is more likely to appear because of way more complicated logic and operation behaviour, but persistence on specific node across relaunching of Ceph storage daemon suggests bug relation to the unlucky byte sequence more than anything else. If it finally appear as Ceph bug, it`ll ruin our expectations from two-year of close experience with this product and if it is XFS bug, we haven`t see anything like this before, thought we had a pretty collection of XFS-related lockups on the earlier kernels. So, my understanding is that we hitting neither very rare memory allocator bug in case of XFS or age-related Ceph issue, both are very unlikely to exist - but I cannot imagine nothing else. If it helps, I may collect a series of perf events during next appearance or exact iostat output (mine graphics can say that the I/O was not choked completely when peak appeared, that`s all). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replace OSD drive without remove/re-add OSD
On Sat, May 3, 2014 at 4:01 AM, Indra Pramana in...@sg.or.id wrote: Sorry forgot to cc the list. On 3 May 2014 08:00, Indra Pramana in...@sg.or.id wrote: Hi Andrey, I actually wanted to try this (instead of remove and readd OSD) to avoid remapping of PGs to other OSDs and the unnecessary I/O load. Are you saying that doing this will also trigger remapping? I thought it will just do recovery to replace missing PGs as a result of the drive replacement? Thank you. Yes, remapping will take place, though it is a bit counterintuitive and I suspect that the roots are the same as with double data placement recalculation with out + rm procedure. Actually Inktank people may answer the question with more details I suppose. Also I think that preserving of the collections may eliminate remap during such kind of refill, though it is not trivial thing to do and I had not experimented with this. On 2 May 2014 21:02, Andrey Korolyov and...@xdel.ru wrote: On 05/02/2014 03:27 PM, Indra Pramana wrote: Hi, May I know if it's possible to replace an OSD drive without removing / re-adding back the OSD? I want to avoid the time and the excessive I/O load which will happen during the recovery process at the time when: - the OSD is removed; and - the OSD is being put back into the cluster. I read David Zafman's comment on this thread, that we can set noout, take OSD down, replace the drive, and then bring the OSD back up and unset noout. http://www.spinics.net/lists/ceph-users/msg05959.html May I know if it's possible to do this? - ceph osd set noout - sudo stop ceph-osd id=12 - Replace the drive, and once done: - sudo start ceph-osd id=12 - ceph osd unset noout The cluster was built using ceph-deploy, can we just replace a drive like that without zapping and preparing the disk using ceph-deploy? There will be absolutely no quirks except continuous remapping with peering along entire recovery process. If your cluster may meet this well, there is absolutely no problem to go through this flow. Otherwise, in longer out+in flow, there are only two short intensive recalculations which can be done at the scheduled time, comparing with peering during remap, which can introduce unnecessary I/O spikes. Looking forward to your reply, thank you. Cheers. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to bring cluster up
Galndalf, regarding this one and previous you told about memory consumption - there are too much PGs, so memory consumption is so high as you are observing. Dead loop of osd-never-goes-up is probably because of suicide timeout of internal queues. It is may be not good but expected. OSD behaviour is ultimately depends of all kinds of knobs you can change, e.g. I had found recently funny issue that the collection warm-up (e.g. bringing collections to the RAM cache) actually slows down OSD rejoin (typical post-peering I/O delays), comparing with regular situation when collections reading from disk upon OSD launch. On Tue, Apr 29, 2014 at 9:22 PM, Gregory Farnum g...@inktank.com wrote: You'll need to go look at the individual OSDs to determine why they aren't on. All the cluster knows is that the OSDs aren't communicating properly. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Apr 29, 2014 at 3:06 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: After a simple service ceph restart on a server, i'm unable to get my cluster up again http://pastebin.com/raw.php?i=Wsmfik2M suddenly, some OSDs goes UP and DOWN randomly. I don't see any network traffic on cluster interface. How can I detect what ceph is doing ? From the posted output there is no way to detect if ceph is recovering or not. Showing just a bunch of increasing/decreasing numbers doens't help. I can see this: 2014-04-29 12:03:49.013808 mon.0 [INF] pgmap v1047121: 98432 pgs: 241 inactive, 33138 peering, 25 remapped, 60067 down+peering, 3489 remapped+peering, 1472 down+remapped+peering; 66261 bytes data, 1647 MB used, 5582 GB / 5583 GB avail so what, is it recovering? Is it sleeping ? Why is not recovering ? http://pastebin.com/raw.php?i=2EdugwQa why all OSDs from host osd12 and osd13 are down ? Both hosts are up and running. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OOM-Killer for ceph-osd
For the record, ``rados df'' will give an object count. Would you mind to send out your ceph.conf? I cannot imagine what parameter may raise memory consumption so dramatically, so config at a glance may reveal some detail. Also core dump should be extremely useful (though it`s better to pass the flag to Inktank there). On Mon, Apr 28, 2014 at 1:14 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: I don't know how to count objects but its a test cluster, i have not more than 50 small files 2014-04-27 22:33 GMT+02:00 Andrey Korolyov and...@xdel.ru: What # of objects do you have? After all, such large footprint can be just a bug in your build if you do not have ultimate high object count(~1e8) or any extraordinary configuration parameter. On Mon, Apr 28, 2014 at 12:26 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: So, are you suggesting to lower the pg count ? Actually i'm using the suggested number of OSD*100/Replicas and I have just 2 OSDs per server. 2014-04-24 19:34 GMT+02:00 Andrey Korolyov and...@xdel.ru: On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote: During a recovery, I'm hitting oom-killer for ceph-osd because it's using more than 90% of avaialble ram (8GB) How can I decrease the memory footprint during a recovery ? You can reduce pg count per OSD for example, it scales down well enough. OSD memory footprint (during recovery or normal operations) depends of number of objects, e.g. commited data and total count of PGs per OSD. Because deleting some data is not an option, I may suggest only one remaining way :) I had raised related question a long ago, it was about post-recovery memory footprint patterns - OSD shrinks memory usage after successful recovery in a relatively long period, up to some days and by couple of fast 'leaps'. Heap has nothing to do with this bug I had not profiled the daemon itself yet. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OOM-Killer for ceph-osd
What # of objects do you have? After all, such large footprint can be just a bug in your build if you do not have ultimate high object count(~1e8) or any extraordinary configuration parameter. On Mon, Apr 28, 2014 at 12:26 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: So, are you suggesting to lower the pg count ? Actually i'm using the suggested number of OSD*100/Replicas and I have just 2 OSDs per server. 2014-04-24 19:34 GMT+02:00 Andrey Korolyov and...@xdel.ru: On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote: During a recovery, I'm hitting oom-killer for ceph-osd because it's using more than 90% of avaialble ram (8GB) How can I decrease the memory footprint during a recovery ? You can reduce pg count per OSD for example, it scales down well enough. OSD memory footprint (during recovery or normal operations) depends of number of objects, e.g. commited data and total count of PGs per OSD. Because deleting some data is not an option, I may suggest only one remaining way :) I had raised related question a long ago, it was about post-recovery memory footprint patterns - OSD shrinks memory usage after successful recovery in a relatively long period, up to some days and by couple of fast 'leaps'. Heap has nothing to do with this bug I had not profiled the daemon itself yet. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OOM-Killer for ceph-osd
On Mon, Apr 28, 2014 at 1:26 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2014-04-27 23:20 GMT+02:00 Andrey Korolyov and...@xdel.ru: For the record, ``rados df'' will give an object count. Would you mind to send out your ceph.conf? I cannot imagine what parameter may raise memory consumption so dramatically, so config at a glance may reveal some detail. Also core dump should be extremely useful (though it`s better to pass the flag to Inktank there). http://pastie.org/pastes/9118130/text?key=vsjj5g4ybetbxu7swflyvq From the config below, i've removed the single mon definition to hide IPs and hostname from posting to ML. [global] auth cluster required = cephx auth service required = cephx auth client required = cephx fsid = 6b9916f9-c209-4f53-98c6-581adcdf0955 osd pool default pg num = 8192 osd pool default pgp num = 8192 osd pool default size = 3 [mon] mon osd down out interval = 600 mon osd mon down reporters = 7 [osd] osd mkfs type = xfs osd journal size = 16384 osd mon heartbeat interval = 30 # Performance tuning filestore merge threshold = 40 filestore split multiple = 8 osd op threads = 8 # Recovery tuning osd recovery max active = 5 osd max backfills = 2 osd recovery op priority = 2 Nothing looks wrong, except heartbeat interval which probably should be smaller due to recovery considerations. Try ``ceph osd tell X heap release'' and if it will not change memory consumption, file a bug. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OOM-Killer for ceph-osd
On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote: During a recovery, I'm hitting oom-killer for ceph-osd because it's using more than 90% of avaialble ram (8GB) How can I decrease the memory footprint during a recovery ? You can reduce pg count per OSD for example, it scales down well enough. OSD memory footprint (during recovery or normal operations) depends of number of objects, e.g. commited data and total count of PGs per OSD. Because deleting some data is not an option, I may suggest only one remaining way :) I had raised related question a long ago, it was about post-recovery memory footprint patterns - OSD shrinks memory usage after successful recovery in a relatively long period, up to some days and by couple of fast 'leaps'. Heap has nothing to do with this bug I had not profiled the daemon itself yet. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Largest Production Ceph Cluster
On 04/01/2014 03:51 PM, Robert Sander wrote: On 01.04.2014 13:38, Karol Kozubal wrote: I am curious to know what is the largest known ceph production deployment? I would assume it is the CERN installation. Have a look at the slides from Frankfurt Ceph Day: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern Regards Just curious, how CERN guys built the network topology to prevent possible cluster splits, because split in the middle will cause huge downtime even for a relatively short split time enough to mark half of those 1k OSDs as down by remaining MON majority. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Convert version 1 RBD to version 2?
On 03/25/2014 02:08 PM, Graeme Lambert wrote: Hi Stuart, If this helps, these three lines will do it for you. I'm sure you could rustle up a script to go through all of your images and do this for you. rbd export libvirt-pool/my-server - | rbd import --image-format 2 - libvirt-pool/my-server2 rbd rm libvirt-pool/my-server rbd mv libvirt-pool/my-server2 libvirt-pool/my-server Best regards Graeme Actually will commit all the bytes. If one prefer to keep discarded plances, it is necessary to throw out the copy to the filesystem(or implement chunk-reader pipe for rbd client). On 24/03/14 07:29, Stuart Longland wrote: Hi all, This might be a dumb question, but I'll ask anyway as I don't see it answered. I have a stack of RBD images in the default v1 format. I'd like to use COW-cloning on these. How does one convert them to version 2 format? Is there a tool to do the conversion or do I need to export each one and re-import them? Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Convert version 1 RBD to version 2?
On 03/25/2014 02:08 PM, Graeme Lambert wrote: Hi Stuart, If this helps, these three lines will do it for you. I'm sure you could rustle up a script to go through all of your images and do this for you. rbd export libvirt-pool/my-server - | rbd import --image-format 2 - libvirt-pool/my-server2 rbd rm libvirt-pool/my-server rbd mv libvirt-pool/my-server2 libvirt-pool/my-server Best regards Graeme Actually will commit all the bytes. If one prefer to keep discarded places, it is necessary to throw out the copy to the filesystem(or implement chunk-reader pipe for rbd client). On 24/03/14 07:29, Stuart Longland wrote: Hi all, This might be a dumb question, but I'll ask anyway as I don't see it answered. I have a stack of RBD images in the default v1 format. I'd like to use COW-cloning on these. How does one convert them to version 2 format? Is there a tool to do the conversion or do I need to export each one and re-import them? Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD down after PG increase
On 03/13/2014 02:08 AM, Gandalf Corvotempesta wrote: I've increased PG number to a running cluster. After this operation, all OSDs from one node was marked as down. Now, after a while, i'm seeing that OSDs are slowly coming up again (sequentially) after rebalancing. Is this an expected behaviour ? Hello Gandalf, Yes, if you have essentially high amount of commited data in the cluster and/or large number of PG(tens of thousands). If you have a room to experiment with this transition from scratch you may want to play with numbers in the OSD` queues since they causing deadlock-like behaviour on operations like increasing PG count or large pool deletion. If cluster has no I/O at all at the moment, such behaviour is not expected definitely. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Impact of disk encryption and recommendations?
Hello, On Mon, Mar 3, 2014 at 2:41 PM, Pieter Koorts pie...@heisenbug.io wrote: Hi Does the disk encryption have a major impact on performance for a busy(ish) cluster? What are the thoughts of having the encryption enabled for all disks by default? Encryption means stricter requirements to handle a power failure, because container contents may be lost entirely as easy as regular filesystem may get corrupt on same event. Therefore enforced sync policy along with the additional CPU resources consumption and (very possibly) battery for disk controller requirement describes all the difference. - Pieter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Impact of disk encryption and recommendations?
On 03/03/2014 06:55 PM, Sage Weil wrote: On Mon, 3 Mar 2014, Andrey Korolyov wrote: Hello, On Mon, Mar 3, 2014 at 2:41 PM, Pieter Koorts pie...@heisenbug.io wrote: Hi Does the disk encryption have a major impact on performance for a busy(ish) cluster? What are the thoughts of having the encryption enabled for all disks by default? Encryption means stricter requirements to handle a power failure, because container contents may be lost entirely as easy as regular filesystem may get corrupt on same event. Therefore enforced sync policy along with the additional CPU resources consumption and (very possibly) battery for disk controller requirement describes all the difference. Hi Andrey, You're talking about dm-crypt, right? How does that affect safety? I assumed that it passes IOs directly up and down the stack without reordering or buffering. Hi, Yes, my point is primarily about dm-crypt containers, HDD cache matters there. Or, in most terrible case, for any FS with default mount options when volume laid on it as a regular file (though nobody will do this for Ceph because of related performance impact). My experience represents the actual results for 'benchmarking' this three years ago but I think not much changed there. May be it will be interesting to collect fault statistics over different block sizes for crypto containers and raw storage with default device cache settings. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RDB, VMs, btrfs, COW, OSD journals, f2fs, SSDs
Hello, Right now, none of filesystems whose CoW features can be used by Ceph (btrfs and zfs in near future) are recommended for production usage and it makes sense only for filestore mount point, not for journal. I`m in doubt if there can be any advantages for performance by using f2fs for journal over raw device target, but there is a worth to compare it to ext4. F2fs will take great advantage for pure SSD-based storage and for upcoming kv filestore over existing best practices with XFS. Probably f2fs will be able to reduce wearout factor comparing to the regular discard(), but it should take really long to compare properly, though I`ll be happy if someone will be able to make such comparison. So, if you want to get rid of locking behavior operating with huge snapshots, you may try btrfs, but it fits barely in any near-production environment. XFS is way more stable, but unfortunate snapshot deletion(very huge, though) may tear down pool I/O latencies for a very long period. On Mon, Mar 3, 2014 at 1:19 AM, Joshua Dotson j...@wrale.com wrote: Hello, If I'm storing large VM images on Ceph RDB, and I have OSD journals on SSD, should I _not_ be using a copy on write file system on the OSDs? I read that large VM images don't play well with COW (e.g. btrfs) [1]. Does Ceph improve this situation? Would btrfs outperform non-cow filesystems in this setting? Also, I'm considering placing my OSD journals on f2fs-formatted partitions on my Samsung SSDs for hardware resiliency (Samsung created both my SSDs and f2fs) [2]. F2FS uses copy on write [3]. Has anyone ever tried this? Thoughts? [1] https://wiki.archlinux.org/index.php/Btrfs#Copy-On-Write_.28CoW.29 [2] https://www.usenix.org/legacy/event/fast12/tech/full_papers/Min.pdf [3] http://www.dslreports.com/forum/r27846667- Thanks, Joshua ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Constant slow / blocked requests with otherwise healthy cluster
Hey, What number do you have for a replication factor? As for three, 1.5k IOPS may be a little bit high for 36 disks, and your OSD ids looks a bit suspicious - there should not be 60+ OSDs based on calculation from numbers below. On 11/28/2013 12:45 AM, Oliver Schulz wrote: Dear Ceph Experts, our Ceph cluster suddenly went into a state of OSDs constantly having blocked or slow requests, rendering the cluster unusable. This happened during normal use, there were no updates, etc. All disks seem to be healthy (smartctl, iostat, etc.). A complete hardware reboot including system update on all nodes has not helped. The network equipment also shows no trouble. We'd be glad for any advice on how to diagnose and solve this, as the cluster is basically at a standstill and we urgently need to get it back into operation. Cluster structure: 6 Nodes, 6x 3TB disks plus 1x System/Journal SSD per node, one OSD per disk. We're running ceph version 0.67.4-1precise on Ubuntu 12.04.3 with kernel 3.8.0-33-generic (x86_64). ceph status shows something like (it varies): cluster 899509fe-afe4-42f4-a555-bb044ca0f52d health HEALTH_WARN 77 requests are blocked 32 sec monmap e1: 3 mons at {a=134.107.24.179:6789/0,b=134.107.24.181:6789/0,c=134.107.24.183:6789/0}, election epoch 312, quorum 0,1,2 a,b,c osdmap e32600: 36 osds: 36 up, 36 in pgmap v16404527: 14304 pgs: 14304 active+clean; 20153 GB data, 60630 GB used, 39923 GB / 100553 GB avail; 1506KB/s rd, 21246B/s wr, 545op/s mdsmap e478: 1/1/1 up {0=c=up:active}, 1 up:standby-replay ceph health detail shows something like (it varies): HEALTH_WARN 363 requests are blocked 32 sec; 22 osds have slow requests 363 ops are blocked 32.768 sec 1 ops are blocked 32.768 sec on osd.0 8 ops are blocked 32.768 sec on osd.3 37 ops are blocked 32.768 sec on osd.12 [...] 11 ops are blocked 32.768 sec on osd.62 45 ops are blocked 32.768 sec on osd.65 22 osds have slow requests The number and identity of affected OSDs constantly changes (sometimes health even goes to OK for a moment). Cheers and thanks for any ideas, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Placement groups on a 216 OSD cluster with multiple pools
Of course, but it means that in case of failure you can no longer trust your data consistency and should recheck it against separately stored checksums or so. I`m leaving aside such fact that Ceph will not probably recover pool properly with replication number lower than 2 in many cases. So generally yes, one may use no replication but it does not make sense because for small amount of data there is very small savings and for larger data cost of recheck/reupload will be higher than cost of permanent additional storage. On 11/15/2013 02:27 AM, Nigel Williams wrote: On 15/11/2013 8:57 AM, Dane Elwell wrote: [2] - I realise the dangers/stupidity of a replica size of 0, but some of the data we wish to store just isn’t /that/ important. We've been thinking of this too. The application is storing boot-images, ISOs, local repository mirrors etc where recovery is easy with a slight inconvenience if the data has to be re-fetched. This suggests a neat additional feature for Ceph would be the ability to have metadata attached to zero-replica objects that includes a URL where a copy could be recovered/re-fetched. Then it could all happen auto-magically. We also have users trampolining data between systems in order to buffer fast-data streams or handle data-surges. This can be zero-replica too. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Recovery took too long on cuttlefish
Hello, Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with large amount of data commit and relatively low PG rate, I`ve observed unexplainable long recovery times for PGs even if the degraded object count is almost zero: 04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean, 1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s; 865/4162926 degraded (0.021%); recovering 2 o/s, 9251KB/s at this moment we have freshly restarted cluster and large amount of PGs in recovery_wait state; after a couple of minutes picture changes a little: 2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939 active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr, 1499op/s; 2456/4167201 degraded (0.059%) and after a couple of hours we`re reaching a peak by degraded objects, PGs still moving to active+clean: 2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191 active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s wr, 16025op/s; 23495/4223061 degraded (0.556%) After peak was passed, object count starts to decrease and seems cluster will reach completely clean state in next ten hours. For example, with PG count ten times higher recovery goes way faster: 2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171 active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB data, 25609 GB used, 78538 GB / 101 TB avail; 3 298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%) 2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646 active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data, 25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s Recovery and backfill settings was the same during all tests: osd_max_backfills: 1, osd_backfill_full_ratio: 0.85, osd_backfill_retry_interval: 10, osd_recovery_threads: 1, osd_recover_clone_overlap: true, osd_backfill_scan_min: 64, osd_backfill_scan_max: 512, osd_recovery_thread_timeout: 30, osd_recovery_delay_start: 300, osd_recovery_max_active: 5, osd_recovery_max_chunk: 8388608, osd_recovery_forget_lost_objects: false, osd_kill_backfill_at: 0, osd_debug_skip_full_check_in_backfill_reservation: false, osd_recovery_op_priority: 10, Also during recovery some heartbeats may miss, it is not related to the current situation but observed for a very long time(for now, seems four-seconds delays between heartbeats distributed almost randomly over a time flow): 2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr, 623op/s; 15670/4227330 degraded (0.371%) 2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr, 284op/s; 15670/4227330 degraded (0.371%) 2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s wr, 1470op/s; 15670/4227330 degraded (0.371%) 2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr, 1820op/s; 15670/4227330 degraded (0.371%) 2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr, 13195op/s; 15670/4227330 degraded (0.371%) 2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513 active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr, 46511op/s; 15670/4227330 degraded (0.371%) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-osd and btrfs results in high disk usage
Observed same behaviour with higher disk resource consumption than xfs before, but I`m wonder how it`s possible to get 5K IOPS on regular (even cache-backed) device. On Sat, Nov 9, 2013 at 7:55 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, i've deployed two OSDs with btrfs but i'm seeing really crazy disk / fs usage. The xfs ones have constant 10-20MB/s the brfs one has constant 100MB/s with 5000 iop/s. And it's not btrfs it's the ceph-osd doing this amount of io at least iotop shows the ceph-osd writing this massive amount of data. Is this correct? Has anybody seen this before? Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Block Storage QoS
On Thu, Nov 7, 2013 at 11:50 PM, Wido den Hollander w...@42on.com wrote: On 11/07/2013 08:42 PM, Gruher, Joseph R wrote: Is there any plan to implement some kind of QoS in Ceph? Say I want to provide service level assurance to my OpenStack VMs and I might have to throttle bandwidth to some to provide adequate bandwidth to others - is anything like that planned for Ceph? Generally with regard to block storage (rbds), not object or filesystem. Or is there already a better way to do this elsewhere in the OpenStack cloud? I don't know if OpenStack supports it, but in CloudStack we recently implemented the I/O throttling mechanism of Qemu via libvirt. That might be a solution if OpenStack implements that as well? Just a side note - current QEMU implements more gentle throttling than a rest of the versions and it is very useful thing for NBD I/O burst handling. Thanks, Joe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disk Density Considerations
On Wed, Nov 6, 2013 at 4:15 PM, Darren Birkett darren.birk...@gmail.com wrote: Hi, I understand from various reading and research that there are a number of things to consider when deciding how many disks one wants to put into a single chassis: 1. Higher density means higher failure domain (more data to re-replicate if you lose a node) 2. More disks means more CPU/memory horsepower to handle the number of OSDs 3. Network becomes a bottleneck with too many OSDs per node 4. ... We are looking at building high density nodes for small scale 'starter' deployments for our customers (maybe 4 or 5 nodes). High density in this case could mean a 2u chassis with 2x external 45 disk JBOD containers attached. That's 90 3TB disks/OSDs to be managed by a single node. That's about 243TB of potential usable space, and so (assuming up to 75% fillage) maybe 182TB of potential data 'loss' in the event of a node failure. On an uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say that would take about 45 hours to get the cluster back into an undegraded state - that is the requisite number of copies of all objects. For such large number of disks you should consider that the cache amortization will not take any place even if you are using 1GB controller(s) - only tiered cache can be an option. Also recovery will take much more time even if you have a room for client I/O in the calculations because raw disks have very limited IOPS capacity and recovery will either take a much longer than such expectations at a glance or affect regular operations. For S3/Swift it may be acceptable but for VM images it does not. Assuming that you can shove in a pair of hex core hyperthreaded processors, you're probably OK with number 2. If you're already considering 10GbE networking for the storage network, there's probably not much you can do about 3 unless you want to spend a lot more money (and the reason we're going so dense is to keep this as a cheap option). So the main thing would seem to be a real fear of 'losing' so much data in the event of a node failure. Who wants to wait 45 hours (probably much longer assuming the cluster remains live and has production traffic traversing that networl) for the cluster to self-heal? But surely this fear is based on an assumption that in that time, you've not identified and replaced the failed chassis. That you would sit for 2-3 days and just leave the cluster to catch up, and not actually address the broken node. Given good data centre processes, a good stock of spare parts, isn't it more likely that you'd have replaced that node and got things back up and running in a mater of hours? In all likelyhood, a node crash/failure is not likely to have taken out all, or maybe any, of the disks, and a new chassis can just have the JBODs plugged back in and away you go? I'm sure I'm missing some other pieces, but if you're comfortable with your hardware replacement processes, doesn't number 1 become a non-fear really? I understand that in some ways it goes against the concept of ceph being self healing, and that in an ideal world you'd have lots of lower density nodes to limit your failure domain, but when being driven by cost isn't this an OK way to look at things? What other glaringly obvious considerations am I missing with this approach? Darren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disk Density Considerations
On Wed, Nov 6, 2013 at 6:42 PM, Darren Birkett darren.birk...@gmail.com wrote: On 6 November 2013 14:08, Andrey Korolyov and...@xdel.ru wrote: We are looking at building high density nodes for small scale 'starter' deployments for our customers (maybe 4 or 5 nodes). High density in this case could mean a 2u chassis with 2x external 45 disk JBOD containers attached. That's 90 3TB disks/OSDs to be managed by a single node. That's about 243TB of potential usable space, and so (assuming up to 75% fillage) maybe 182TB of potential data 'loss' in the event of a node failure. On an uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say that would take about 45 hours to get the cluster back into an undegraded state - that is the requisite number of copies of all objects. For such large number of disks you should consider that the cache amortization will not take any place even if you are using 1GB controller(s) - only tiered cache can be an option. Also recovery will take much more time even if you have a room for client I/O in the calculations because raw disks have very limited IOPS capacity and recovery will either take a much longer than such expectations at a glance or affect regular operations. For S3/Swift it may be acceptable but for VM images it does not. Sure, but my argument was that you are never likely to actually let that entire recovery operation complete - you're going to replace the hardware and plug the disks back in and let them catch up by log replay/backfill. Assuming you don't ever actually expect to really lose all data on 90 disks in one go... By tiered caching, do you mean using something like flashcache or bcache? Exactly, just another step to offload CPU from I/O time. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cuttlefish: pool recreation results in cluster crash
Hello, I was able to reproduce following on the top of current cuttlefish: - create pool, - delete it after all pgs initialized, - create new pool with same name after, say, ten seconds. All osds dies immediately with attached trace. The problem exists in bobtail as well. pool-recreate.txt.gz Description: GNU Zip compressed data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8
On Thu, Sep 19, 2013 at 8:12 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: On 09/19/2013 04:46 PM, Andrey Korolyov wrote: On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: On 09/18/2013 11:25 PM, Andrey Korolyov wrote: Hello, Just restarted one of my mons after a month of uptime, memory commit raised ten times high than before: 13206 root 10 -10 12.8g 8.8g 107m S65 14.0 0:53.97 ceph-mon normal one looks like 30092 root 10 -10 4411m 790m 46m S 1 1.2 1260:28 ceph-mon Try running 'ceph heap stats', followed by 'ceph heap release', and then recheck the memory consumption for the monitor. It had shrinked to 350M RSS over night, so seems I need to restart this mon again or try with other one to reproduce the problem over next night. This monitor was a leader so I may check against other ones and see their peak consumption. Was that monitor attempting to join the quorum? No, it had joined a long before. As we discussed in IRC, I restarted non-leader mon and there is some stat from freshly started mon process (which is joined quorum two minutes age) : ceph heap stats --keyfile admin -m 10.5.0.17:6789 mon.2tcmalloc heap stats: MALLOC: 26256488 ( 25.0 MiB) Bytes in use by application MALLOC: + 11240284160 (10719.6 MiB) Bytes in page heap freelist MALLOC: + 3184848 (3.0 MiB) Bytes in central cache freelist MALLOC: + 8974848 (8.6 MiB) Bytes in transfer cache freelist MALLOC: + 15560904 ( 14.8 MiB) Bytes in thread cache freelists MALLOC: + 22114456 ( 21.1 MiB) Bytes in malloc metadata MALLOC: MALLOC: = 11316375704 (10792.1 MiB) Actual memory used (physical + swap) MALLOC: + 90226688 ( 86.0 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 11406602392 (10878.2 MiB) Virtual address space used MALLOC: MALLOC: 4140 Spans in use MALLOC: 14 Thread heaps in use MALLOC: 8192 Tcmalloc page size Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). and after calling release: # ceph heap release --keyfile admin -m 10.5.0.17:6789 mon.2 releasing free RAM back to system. # ceph heap stats --keyfile admin -m 10.5.0.17:6789 mon.2tcmalloc heap stats: MALLOC: 38925536 ( 37.1 MiB) Bytes in use by application MALLOC: + 13508608 ( 12.9 MiB) Bytes in page heap freelist MALLOC: + 2992112 (2.9 MiB) Bytes in central cache freelist MALLOC: + 12092416 ( 11.5 MiB) Bytes in transfer cache freelist MALLOC: + 17547056 ( 16.7 MiB) Bytes in thread cache freelists MALLOC: + 22114456 ( 21.1 MiB) Bytes in malloc metadata MALLOC: MALLOC: =107180184 ( 102.2 MiB) Actual memory used (physical + swap) MALLOC: + 11299422208 (10776.0 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 11406602392 (10878.2 MiB) Virtual address space used MALLOC: MALLOC: 4678 Spans in use MALLOC: 14 Thread heaps in use MALLOC: 8192 Tcmalloc page size Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8
On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis joao.l...@inktank.com wrote: On 09/18/2013 11:25 PM, Andrey Korolyov wrote: Hello, Just restarted one of my mons after a month of uptime, memory commit raised ten times high than before: 13206 root 10 -10 12.8g 8.8g 107m S65 14.0 0:53.97 ceph-mon normal one looks like 30092 root 10 -10 4411m 790m 46m S 1 1.2 1260:28 ceph-mon Try running 'ceph heap stats', followed by 'ceph heap release', and then recheck the memory consumption for the monitor. It had shrinked to 350M RSS over night, so seems I need to restart this mon again or try with other one to reproduce the problem over next night. This monitor was a leader so I may check against other ones and see their peak consumption. monstore has simular size about 15G per monitor so only one problem is very unusual and terrifying memory consumption. Also it is possible that remaining mons running 0.61.7 but binary was updated long ago so it`s hard to tell which version is running not doing dump and disrupting quorum for a little - anyway I should tame current one. How big is the cluster? 15GB for the monitor store may not be that surprising if you have a bunch of OSDs and they're not completely healthy, as that will prevent the removal of old maps on the monitor side. ~8.5T commit and 2.5M objects but cluster is completely healthy on the moment though recently I had run couple of recovery procedures and it may affect mondir data allocation on day-long intervals. This could also be an issue with store compaction. Also, you should check whether the monitors are running 0.61.7 and, if so, update them to 0.61.8. You should be able to get that version using the admin socket if you want to. Just checked, the same 0.61.8. -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Excessive mon memory usage in cuttlefish 0.61.8
Hello, Just restarted one of my mons after a month of uptime, memory commit raised ten times high than before: 13206 root 10 -10 12.8g 8.8g 107m S65 14.0 0:53.97 ceph-mon normal one looks like 30092 root 10 -10 4411m 790m 46m S 1 1.2 1260:28 ceph-mon monstore has simular size about 15G per monitor so only one problem is very unusual and terrifying memory consumption. Also it is possible that remaining mons running 0.61.7 but binary was updated long ago so it`s hard to tell which version is running not doing dump and disrupting quorum for a little - anyway I should tame current one. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hit suicide timeout on osd start
A little follow-up: One of cluster nodes(from not-yet-restarted set) went in some kind of flapping state exposing cpu consumption peaks and latency spikes every 50 seconds. Even more interesting thing was that when we injected non-zero debug_ms latency spikes had gone away, but cpu ones remains as well. At the picture[0] below we had injected debug_ms 1 and log file as /dev/null at the 19:03 and set it back to 0 at 19:13. 0. http://i.imgur.com/8BBWM7o.png On Wed, Sep 11, 2013 at 5:05 AM, Andrey Korolyov and...@xdel.ru wrote: Hello, Got so-famous error on 0.61.8, just for little disk overload on OSD daemon start. I currently have very large metadata per osd (about 20G), this may be an issue. #0 0x7f2f46adeb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00860469 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 signal handler called #4 0x7f2f44b45405 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7f2f44b48b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7f2f4544389d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7f2f45441996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7f2f454419c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7f2f45441bee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1 0 == \hit suicide timeout\, file=optimized out, line=79, func=0xa38c60 bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)) at common/assert.cc:77 #11 0x0087914b in ceph::HeartbeatMap::_check (this=this@entry=0x26560e0, h=optimized out, who=who@entry=0xa38b40 is_healthy, now=now@entry=1378860192) at common/HeartbeatMap.cc:79 #12 0x00879956 in ceph::HeartbeatMap::is_healthy (this=this@entry=0x26560e0) at common/HeartbeatMap.cc:130 #13 0x00879f08 in ceph::HeartbeatMap::check_touch_file (this=0x26560e0) at common/HeartbeatMap.cc:141 #14 0x009189f5 in CephContextServiceThread::entry (this=0x2652200) at common/ceph_context.cc:68 #15 0x7f2f46ad6e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x7f2f44c013dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x in ?? () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cp copies of sparse files become fully allocated
On Tue, Sep 10, 2013 at 3:03 AM, Josh Durgin josh.dur...@inktank.com wrote: On 09/09/2013 04:57 AM, Andrey Korolyov wrote: May I also suggest the same for export/import mechanism? Say, if image was created by fallocate we may also want to leave holes upon upload and vice-versa for export. Import and export already omit runs of zeroes. They could detect smaller runs (currently they look at object size chunks), and export might be more efficient if it used diff_iterate() instead of read_iterate(). Have you observed them misbehaving with sparse images? Did you meant dumpling? As I had checked some months ago cuttlefish not had such feature. On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil s...@inktank.com wrote: On Sat, 7 Sep 2013, Oliver Daudey wrote: Hey all, This topic has been partly discussed here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html Tested on Ceph version 0.67.2. If you create a fresh empty image of, say, 100GB in size on RBD and then use rbd cp to make a copy of it, even though the image is sparse, the command will attempt to read every part of it and take far more time than expected. After reading the above thread, I understand why the copy of an essentially empty sparse image on RBD would take so long, but it doesn't explain why the copy won't be sparse itself. If I use rbd cp to copy an image, the copy will take it's full allocated size on disk, even if the original was empty. If I use the QEMU qemu-img-tool's convert-option to convert the original image to the copy without changing the format, essentially only making a copy, it takes it's time as well, but will be faster than rbd cp and the resulting copy will be sparse. Example-commands: rbd create --size 102400 test1 rbd cp test1 test2 qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3 Shouldn't rbd cp at least have an option to attempt to sparsify the copy, or copy the sparse parts as sparse? Same goes for rbd clone, BTW. Yep, this is in fact a bug. Opened http://tracker.ceph.com/issues/6257. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hit suicide timeout on osd start
Hello, Got so-famous error on 0.61.8, just for little disk overload on OSD daemon start. I currently have very large metadata per osd (about 20G), this may be an issue. #0 0x7f2f46adeb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00860469 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 signal handler called #4 0x7f2f44b45405 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7f2f44b48b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7f2f4544389d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7f2f45441996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7f2f454419c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7f2f45441bee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1 0 == \hit suicide timeout\, file=optimized out, line=79, func=0xa38c60 bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)) at common/assert.cc:77 #11 0x0087914b in ceph::HeartbeatMap::_check (this=this@entry=0x26560e0, h=optimized out, who=who@entry=0xa38b40 is_healthy, now=now@entry=1378860192) at common/HeartbeatMap.cc:79 #12 0x00879956 in ceph::HeartbeatMap::is_healthy (this=this@entry=0x26560e0) at common/HeartbeatMap.cc:130 #13 0x00879f08 in ceph::HeartbeatMap::check_touch_file (this=0x26560e0) at common/HeartbeatMap.cc:141 #14 0x009189f5 in CephContextServiceThread::entry (this=0x2652200) at common/ceph_context.cc:68 #15 0x7f2f46ad6e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x7f2f44c013dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x in ?? () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cp copies of sparse files become fully allocated
May I also suggest the same for export/import mechanism? Say, if image was created by fallocate we may also want to leave holes upon upload and vice-versa for export. On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil s...@inktank.com wrote: On Sat, 7 Sep 2013, Oliver Daudey wrote: Hey all, This topic has been partly discussed here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html Tested on Ceph version 0.67.2. If you create a fresh empty image of, say, 100GB in size on RBD and then use rbd cp to make a copy of it, even though the image is sparse, the command will attempt to read every part of it and take far more time than expected. After reading the above thread, I understand why the copy of an essentially empty sparse image on RBD would take so long, but it doesn't explain why the copy won't be sparse itself. If I use rbd cp to copy an image, the copy will take it's full allocated size on disk, even if the original was empty. If I use the QEMU qemu-img-tool's convert-option to convert the original image to the copy without changing the format, essentially only making a copy, it takes it's time as well, but will be faster than rbd cp and the resulting copy will be sparse. Example-commands: rbd create --size 102400 test1 rbd cp test1 test2 qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3 Shouldn't rbd cp at least have an option to attempt to sparsify the copy, or copy the sparse parts as sparse? Same goes for rbd clone, BTW. Yep, this is in fact a bug. Opened http://tracker.ceph.com/issues/6257. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Removing osd with zero data causes placement shift
Hello I had a couple of osds with down+out state and completely clean cluster, but after pushing a button ``osd crush remove'' there was some data redistribution with shift proportional to osd weight in the crushmap but lower than 'osd out' amount of data replacement over osd with same weight approximate two times. This is some sort of non-idempotency kept at least from bobtail series. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD crash upon pool creation
Hello, Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some disaster-alike behavior after ``pool create'' command - every osd daemon in the cluster will die at least once(some will crash times in a row after bringing back). Please take a look on the backtraces(almost identical) below. Issue #5637 is created in the tracker. Thanks! http://xdel.ru/downloads/poolcreate.txt.gz http://xdel.ru/downloads/poolcreate2.txt.gz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal size suggestions
On Wed, Jul 10, 2013 at 3:28 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: Thank you for the response. You are talking of median expected writes, but should I consider the single disk write speed or the network speed? A single disk is 100MB/s so 100*30=3000MB of journal for each osd? Or should I consider the network speed that is 1.25GB/s? Why 30 seconds? default flush frequency is 5 seconds. What do you mean with fine tuning spinning storage media? On which tuning are you referring to? Since journal is created on per-osd basis, you should calculate it with only disk speed in mind. As I remember no one referred directly to flush interval when recommending referring to tens of seconds on such calculation, neither do I - it`s just a safe road anyway to have some capacity over this value. By fine tuning I meant such things as readahead values, number of internal XFS partitions, size of XFS chunks, hardware controller cache policy(if you have some) and so on - being honest, filesystem tuning is not affecting performance so much on general workload types, but may affect greatly on some specific things like digits in the benchmark :) . Il giorno 09/lug/2013 23:45, Andrey Korolyov and...@xdel.ru ha scritto: On Wed, Jul 10, 2013 at 1:16 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: Hi, i'm planning a new cluster on a 10GbE network. Each storage node will have a maximum of 12 SATA disks and 2 SSD as journals. What do you suggest as journal size for each OSD? 5GB is enough? Should I just consider SATA writing speed when calculating journal size or also network speed? Hello, As many recommendations suggests before, you may set journal size proportional to amount of median (or peak, if expected) writes multiplied, say, by thirty seconds - that`s the safe area and you should not able to suffer because of journal size following this calculation. Twelve SATA disks in theory may have enough output to thrash 10G network but you`ll face lack of IOPS times before almost for sure, and OSD daemons are not working very close to the physical limits speaking of transferring data from/to disk, so fine tuning of spinning storage media still is primary target to play with in such configuration. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal size suggestions
On Wed, Jul 10, 2013 at 1:16 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: Hi, i'm planning a new cluster on a 10GbE network. Each storage node will have a maximum of 12 SATA disks and 2 SSD as journals. What do you suggest as journal size for each OSD? 5GB is enough? Should I just consider SATA writing speed when calculating journal size or also network speed? Hello, As many recommendations suggests before, you may set journal size proportional to amount of median (or peak, if expected) writes multiplied, say, by thirty seconds - that`s the safe area and you should not able to suffer because of journal size following this calculation. Twelve SATA disks in theory may have enough output to thrash 10G network but you`ll face lack of IOPS times before almost for sure, and OSD daemons are not working very close to the physical limits speaking of transferring data from/to disk, so fine tuning of spinning storage media still is primary target to play with in such configuration. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com