from:"Andrey Korolyov"

Re: [ceph-users] Data still in OSD directories after removing

2016-04-29 Thread Andrey Korolyov

On Thu, May 22, 2014 at 12:56 PM, Olivier Bonvalet  wrote:
>
> Le mercredi 21 mai 2014 à 18:20 -0700, Josh Durgin a écrit :
>> On 05/21/2014 03:03 PM, Olivier Bonvalet wrote:
>> > Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit :
>> >> You're certain that that is the correct prefix for the rbd image you
>> >> removed?  Do you see the objects lists when you do 'rados -p rbd ls - |
>> >> grep '?
>> >
>> > I'm pretty sure yes : since I didn't see a lot of space freed by the
>> > "rbd snap purge" command, I looked at the RBD prefix before to do the
>> > "rbd rm" (it's not the first time I see that problem, but previous time
>> > without the RBD prefix I was not able to check).
>> >
>> > So :
>> > - "rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29" return nothing
>> > at all
>> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
>> >   error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such
>> > file or directory
>> > - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.
>> >   error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such
>> > file or directory
>> > - # ls -al 
>> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
>> > -rw-r--r-- 1 root root 4194304 oct.   8  2013 
>> > /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9
>> >
>> >
>> >> If the objects really are orphaned, teh way to clean them up is via 'rados
>> >> -p rbd rm '.  I'd like to get to the bottom of how they ended
>> >> up that way first, though!
>> >
>> > I suppose the problem came from me, by doing CTRL+C while "rbd snap
>> > purge $IMG".
>> > "rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026" don't remove
>> > thoses files, and just answer with a "No such file or directory".
>>
>> Those files are all for snapshots, which are removed by the osds
>> asynchronously in a process called 'snap trimming'. There's no
>> way to directly remove them via rados.
>>
>> Since you stopped 'rbd snap purge' partway through, it may
>> have removed the reference to the snapshot before removing
>> the snapshot itself.
>>
>> You can get a list of snapshot ids for the remaining objects
>> via the 'rados listsnaps' command, and use
>> rados_ioctx_selfmanaged_snap_remove() (no convenient wrapper
>> unfortunately) on each of those snapshot ids to be sure they are all
>> scheduled for asynchronous deletion.
>>
>> Josh
>>
>
> Great : "rados listsnaps" see it :
> # rados listsnaps -p sas3copies rb.0.14bfb5a.238e1f29.0002f026
> rb.0.14bfb5a.238e1f29.0002f026:
> cloneid snaps   sizeoverlap
> 41554   35746   4194304 []
>
> So, I have to write a wrapper to
> rados_ioctx_selfmanaged_snap_remove(), and find a way to obtain a list
> of all "orphan" objects ?
>
> I also try to recreate the object (rados put) then remove it (rados rm),
> but snapshots still here.
>
> Olivier


Hi,

there is a certainly an issue with (at least) old FileStore and
snapshot chunks as they are getting completely unreferenced even for
listsnaps example from above but are presented in omap and on
filesystem after complete image and snapshot removal. Given the fact
that the control flow has not been interrupted ever, e.g. snap
deletion command was always been successful on exit either was image
removal itself, what could be done for those poor data chunks? In fact
this leakage on a long-scale runs like eight months in a given case
could be quite problematic to handle, as orphans do consume almost as
much data as the 'active' rest of the storage on selected OSDs. Since
the chunks are still referenced in omap by some reason, they must not
be deleted directly, so my question could be narrowed down to a
possible existing workaround for this.

Thanks!

3.1b0_head$ find . -type f -name '*64ba14d3dd381*' -mtime +90
./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__23116_25FB11B0__3
./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__241e9_25FB11B0__3
./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__2507f_25FB11B0__3
./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__25dfd_25FB11B0__3



find . -type f -name '*64ba14d3dd381*snap*'
./DIR_0/DIR_B/DIR_1/DIR_1/rbd\udata.64ba14d3dd381.00020dd7__snapdir_25FB11B0__3
./DIR_0/DIR_B/DIR_1/DIR_2/DIR_3/rbd\udata.64ba14d3dd381.00010eb3__snapdir_2B8321B0__3
./DIR_0/DIR_B/DIR_1/DIR_4/DIR_6/rbd\udata.64ba14d3dd381.0001c715__snapdir_F5D641B0__3
./DIR_0/DIR_B/DIR_1/DIR_4/DIR_9/rbd\udata.64ba14d3dd381.0002b694__snapdir_CC4941B0__3
./DIR_0/DIR_B/DIR_1/DIR_5/DIR_9/rbd\udata.64ba14d3dd381.0001b6f7__snapdir_08B951B0__3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD on XFS ENOSPC at 84% data / 5% inode and inode64?

2015-11-26 Thread Andrey Korolyov

On Thu, Nov 26, 2015 at 1:29 AM, Laurent GUERBY  wrote:
> Hi,
>
> After our trouble with ext4/xattr soft lockup kernel bug we started
> moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel
> and ceph 0.94.5.
>
> We have two out of 28 rotational OSD running XFS and
> they both get restarted regularly because they're terminating with
> "ENOSPC":
>
> 2015-11-25 16:51:08.015820 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11)  error (28) No space left on device not 
> handled on operation 0xa0f4d520 (12849173.0.4, or op 4, counting from 0)
> 2015-11-25 16:51:08.015837 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11) ENOSPC handling not implemented
> 2015-11-25 16:51:08.015838 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11)  transaction dump:
> ...
> {
> "op_num": 4,
> "op_name": "write",
> "collection": "58.2d5_head",
> "oid": 
> "53e4fed5\/rbd_data.11f20f75aac8266.000a79eb\/head\/\/58",
> "length": 73728,
> "offset": 4120576,
> "bufferlist length": 73728
> },
>
> (Writing the last 73728 bytes = 72 kbytes of 4 Mbytes if I'm reading
> this correctly)
>
> Mount options:
>
> /dev/sdb1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime,attr2,inode64,noquota
>
> Space and Inodes:
>
> Filesystem Type  1K-blocks   Used Available Use% Mounted on
> /dev/sdb1  xfs  1947319356 1624460408 322858948  84% 
> /var/lib/ceph/osd/ceph-11
>
> Filesystem TypeInodes   IUsed IFree IUse% Mounted on
> /dev/sdb1  xfs   48706752 1985587  467211655% 
> /var/lib/ceph/osd/ceph-11
>
> We're only using rbd devices, so max 4 MB/object write, how
> can we get ENOSPC for a 4MB operation with 322 GB free space?
>
> The most surprising thing is that after the automatic restart
> disk usage keep increasing and we no longer get ENOSPC for a while.
>
> Did we miss a needed XFS mount option? Did other ceph users
> encounter this issue with XFS?
>
> We have no such issue with ~96% full ext4 OSD (after setting the right
> value for the various ceph "fill" options).
>
> Thanks in advance,
>
> Laurent
>

Hi, from given numbers one can conclude that you are facing some kind
of XFS preallocation bug, because ((raw space divided by number of
files)) is four times lower than the ((raw space divided by 4MB
blocks)). At a glance it could be avoided by specifying relatively
small allocsize= mount option, of course by impacting overall
performance, appropriate benchmarks could be found through
ceph-users/ceph-devel. Also do you plan to preserve overcommit ratio
to be that high forever?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Andrey Korolyov

On Mon, Sep 7, 2015 at 12:54 PM, Paul Mansfield
 wrote:
>
>
> On 04/09/15 20:55, Richard Bade wrote:
>> We have a Ceph pool that is entirely made up of Intel S3700/S3710
>> enterprise SSD's.
>>
>> We are seeing some significant I/O delays on the disks causing a “SCSI
>> Task Abort” from the OS. This seems to be triggered by the drive
>> receiving a “Synchronize cache command”.
>
> I've heard from other sources that the new Intel 3610 and 3710 have been
> afflicted by a bug, possibly now fixed with new firmware, that might be
> the cause of your problem.
> The person who first reported it said that they upgraded from 3600 units
> and never had a problem but started seeing issues with 3610 model.
>
> When they look at their log they see this
>
> Aug  9 21:50:39 cetacea kernel: [177609.957939] ata2.00: exception Emask
> 0x0 SAct 0x6000 SErr 0x0 action 0x6 frozen
> Aug  9 21:50:39 cetacea kernel: [177609.958480] ata2.00: failed command:
> READ FPDMA QUEUED
> Aug  9 21:50:39 cetacea kernel: [177609.958995] ata2.00: cmd
> 60/00:68:00:08:db/04:00:0a:00:00/40 tag 13 ncq 524288 in
> Aug  9 21:50:39 cetacea kernel: [177609.958995]  res
> 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Aug  9 21:50:39 cetacea kernel: [177609.960074] ata2.00: status: { DRDY }
> Aug  9 21:50:39 cetacea kernel: [177609.960628] ata2.00: failed command:
> READ FPDMA QUEUED
> Aug  9 21:50:39 cetacea kernel: [177609.961198] ata2.00: cmd
> 60/f0:70:00:0c:db/00:00:0a:00:00/40 tag 14 ncq 122880 in
> Aug  9 21:50:39 cetacea kernel: [177609.961198]  res
> 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Aug  9 21:50:39 cetacea kernel: [177609.962405] ata2.00: status: { DRDY }
> Aug  9 21:50:39 cetacea kernel: [177609.963001] ata2: hard resetting link
> Aug  9 21:50:40 cetacea kernel: [177610.281881] ata2: SATA link up 6.0
> Gbps (SStatus 133 SControl 300)
> Aug  9 21:50:40 cetacea kernel: [177610.282865] ata2.00: configured for
> UDMA/133
> Aug  9 21:50:40 cetacea kernel: [177610.282887] ata2.00: device reported
> invalid CHS sector 0
> Aug  9 21:50:40 cetacea kernel: [177610.282890] ata2.00: device reported
> invalid CHS sector 0
> Aug  9 21:50:40 cetacea kernel: [177610.282896] ata2: EH complete
>
>

Intel had a mess with consequent NCQ command handling [1] on 3x10 and
issued a firmware fix recently [2]. LSI controllers apparently has a
different bug as people reporting bus resets which are different from
the ones on C602. The firmware release fixed problem for me, same for
complete NCQ disablement mentioned in the thread below.

1. https://communities.intel.com/thread/77801
2. 
https://downloadcenter.intel.com/download/23931/Intel-Solid-State-Drive-Data-Center-Tool
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests during ceph osd boot

2015-07-15 Thread Andrey Korolyov

On Wed, Jul 15, 2015 at 12:15 PM, Jan Schermer j...@schermer.cz wrote:
 We have the same problems, we need to start the OSDs slowly.
 The problem seems to be CPU congestion. A booting OSD will use all available 
 CPU power you give it, and if it doesn’t have enough nasty stuff happens 
 (this might actually be the manifestation of some kind of problem in our 
 setup as well).
 It doesn’t do that always - I was restarting our hosts this weekend and most 
 of them came up fine with simple “service ceph start”, some just sat there 
 spinning the CPU and not doing any real world (and the cluster was not very 
 happy about that).

 Jan


 On 15 Jul 2015, at 10:53, Kostis Fardelas dante1...@gmail.com wrote:

 Hello,
 after some trial and error we concluded that if we start the 6 stopped
 OSD daemons with a delay of 1 minute, we do not experience slow
 requests (threshold is set on 30 sec), althrough there are some ops
 that last up to 10s which is already high enough. I assume that if we
 spread the delay more, the slow requests will vanish. The possibility
 of not having tuned our setup to the most finest detail is not zeroed
 out but I wonder if at any way we miss some ceph tuning in terms of
 ceph configuration.

 We run firefly latest stable version.

 Regards,
 Kostis

 On 13 July 2015 at 13:28, Kostis Fardelas dante1...@gmail.com wrote:
 Hello,
 after rebooting a ceph node and the OSDs starting booting and joining
 the cluster, we experience slow requests that get resolved immediately
 after cluster recovers. It is improtant to note that before the node
 reboot, we set noout flag in order to prevent recovery - so there are
 only degraded PGs when OSDs shut down- and let the cluster handle the
 OSDs down/up in the lightest way.

 Is there any tunable we should consider in order to avoid service
 degradation for our ceph clients?

 Regards,
 Kostis
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


As far as I`ve seen this problem, the main issue for regular
disk-backed OSDs is an IOPS starvation during some interval after
reading maps from filestore and marking itself as 'in' - even if
in-memory caches are still hot, I/O will significantly degrade for a
short period. The possible workaround for an otherwise healthy cluster
and node-wide restart is to set norecover flag, it would greatly
reduce a chance of hitting slow operations. Of course it is applicable
only to non-empty cluster with tens of percents of an average
utilization for rotating media. I pointed this issue a couple of years
ago first (it *does* break 30s I/O SLA for returning OSD, but
refilling same OSDs from scratch would not violate the same SLA,
giving out way bigger completion time for a refill). From UX side, it
would be great to introduce some kind of recovery throttler for newly
started OSDs, as recovery_ delay_start does not prevent immediate
recovery procedures.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

2015-06-24 Thread Andrey Korolyov

 The question is: is this behavior indeed expected?

The answer can be positive if you are using large number of placement
groups, 16k is indeed a large one. The peering may take a long time,
blocking I/O requests effectively during this period. Do you have a ceph -w
log during this transition to share?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

2015-06-24 Thread Andrey Korolyov

 http://pastebin.com/HfUPDTK4

Yes, you are experiencing issues with I/O because of a slow peering. You
may put monstores behind a faster storage if they are served from rotating
disks right now or greatly decrease number of placement groups. if it is
possible - with 100 OSDs I would try something like 4096 and 8192, though
it may impact data placement flatness. I`ve seen a couple of off-list
reports where slow peering on a large number of placement group caused
persistent problems, for example if user added new OSDs in the middle of
slow-going peering process, it stood still forever. If none of those
suggestions helps, please feel free to report this problem to a bugtracker,
possibly it would give a bump bump a very nice blueprint initiative for
reducing overall peering time (
https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%3A_Faster_Peering).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew detected

2015-06-10 Thread Andrey Korolyov

On Wed, Jun 10, 2015 at 4:11 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote:
 Hi!

 Immediately after a reboot of mon.3 host its clock was unsynchronized and 
 clock skew detected on mon.3 warning is appeared.
 But now (more then 1 hour of uptime) the clock is synced, but the warning 
 still showing.
 Is this ok?
 Or I have to restart monitor after clock synchronization?

 Pavel.



The quorum should report OK after a five-minute interval but there is
a bug which is preventing quorum for doing so at least on oldest
supported stable versions of Ceph. I`ve never reported it because of
its almost zero importance, but things are what they are - the
theoretical behavior should be different and warning should disappear
without restart.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Andrey Korolyov

On Tue, Jun 9, 2015 at 7:59 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
 host conf : rbd_cache=true   : guest cache=none  : result : cache (wrong)


Thanks Alexandre, so you are confirming that this exact case misbehaves?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-09 Thread Andrey Korolyov

On Tue, Jun 9, 2015 at 11:51 AM, Alexandre DERUMIER aderum...@odiso.com wrote:
Thanks Alexandre, so you are confirming that this exact case misbehaves?

 The rbd_cache value from ceph.conf always override the cache value from qemu.

 My personnal opinion is this is wrong. qemu value should overrive the 
 ceph.conf value.

 I don't known what happen in a live migration for example, if rbd_cache in 
 ceph.conf is different on source and target host ?




Yes, you are right. The destination process in a live migration
behaves as an independently launched copy, it does not inherit those
kind of parameters from source emulator.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 10:43 PM, Josh Durgin jdur...@redhat.com wrote:
 On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote:

 Hi,

 looking at the latest version of QEMU,


 It's seem that it's was already this behaviour since the add of rbd_cache
 parsing in rbd.c by josh in 2012


 http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85


 I'll do tests on my side tomorrow to be sure.


 It seems like we should switch the order so ceph.conf is overridden by
 qemu's cache settings. I don't remember a good reason to have it the
 other way around.

 Josh


Erm, doesn`t this code *already* represent the right priorities?
Cache=none setting should set a BDRV_O_NOCACHE which is effectively
disabling cache in a mentioned snippet.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 6:50 PM, Jason Dillaman dilla...@redhat.com wrote:
 Hmm ... looking at the latest version of QEMU, it appears that the RBD cache 
 settings are changed prior to reading the configuration file instead of 
 overriding the value after the configuration file has been read [1].  Try 
 specifying the path to a new configuration file via the 
 conf=/path/to/my/new/ceph.conf QEMU parameter where the RBD cache is 
 explicitly disabled  [2].


 [1] 
 http://git.qemu.org/?p=qemu.git;a=blob;f=block/rbd.c;h=fbe87e035b12aab2e96093922a83a3545738b68f;hb=HEAD#l478
 [2] http://ceph.com/docs/master/rbd/qemu-rbd/#usage


Actually the mentioned snippet presumes *expected* behavior with
cache=xxx driving the overall cache behavior. Probably the pass itself
(from cache=none to proper bitmask values in a backend properties) is
broken in some way.

CCing qemu-devel for possible bug.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 1:24 PM, Arnaud Virlet avir...@easter-eggs.com wrote:
 Hi

 Actually we use libvirt VM with ceph rbd pool for storage.
 By default we want to have disk cache=writeback for all disks in libvirt.
 In /etc/ceph/ceph.conf, we have rbd cache = true and for each VMs XML we
 set cache=writeback for all disks in VMs configuration.

 We want to use one ocfs2 volume on our rbd pool. For this volume we want set
 cache=none. When we set cache=none in libvirt template for this hosts, it
 doesn't work.

Can you please describe this more specifically? Does the libvirt
produce a launch string with cache=none or you are measuring the cache
(mis)presence in some other way? The rbd_cache setting and cache=xxx
for qemu should show a conjugate behavior.

 The only way that's work  is when we set rbd cache = false in
 /etc/ceph/ceph.conf.

 How can I set cache=none just for one volume specifically without modifing
 default settings in ceph.conf?


 Regards,

 Arnaud
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 2:48 PM, Arnaud Virlet avir...@easter-eggs.com wrote:
 Thanks for you reply,

 On 06/08/2015 12:31 PM, Andrey Korolyov wrote:

 On Mon, Jun 8, 2015 at 1:24 PM, Arnaud Virlet avir...@easter-eggs.com
 wrote:

 Hi

 Actually we use libvirt VM with ceph rbd pool for storage.
 By default we want to have disk cache=writeback for all disks in
 libvirt.
 In /etc/ceph/ceph.conf, we have rbd cache = true and for each VMs XML
 we
 set cache=writeback for all disks in VMs configuration.

 We want to use one ocfs2 volume on our rbd pool. For this volume we want
 set
 cache=none. When we set cache=none in libvirt template for this hosts, it
 doesn't work.


 Can you please describe this more specifically? Does the libvirt
 produce a launch string with cache=none or you are measuring the cache
 (mis)presence in some other way? The rbd_cache setting and cache=xxx
 for qemu should show a conjugate behavior.


 With rbd_cache = true in ceph.conf and cache = none in XML, Libvirt produce
 a launch string with cache=none for the ocfs2 volume.

Am I understand you right that you are using certain template engine
for both OCFS- and RBD-backed volumes within a single VM` config and
it does not allow per-disk cache mode separation in a suggested way?



 The only way that's work  is when we set rbd cache = false in
 /etc/ceph/ceph.conf.

 How can I set cache=none just for one volume specifically without
 modifing
 default settings in ceph.conf?


 Regards,

 Arnaud
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 3:44 PM, Arnaud Virlet avir...@easter-eggs.com wrote:


 On 06/08/2015 01:59 PM, Andrey Korolyov wrote:


 Am I understand you right that you are using certain template engine
 for both OCFS- and RBD-backed volumes within a single VM` config and
 it does not allow per-disk cache mode separation in a suggested way?

 My VM has 3 disks on RBD backend. disks 1 and 2 have cache=writeback, disk 3
 ( for  ocfs2 ) has cache=none in my VM XML file. When I start the VM,
 libvirt produce a launch string with cache=wtriteback for disk 1/2, and with
 cache=none for disk 3.
 Even if cache = none for disk 3, it seems doesn't take effect without set
 rbd cache = false in ceph.conf.

It is very strange and contradictive to what it should be. Could you
post a resulting qemu argument string, by a chance? Also please share
a method which you are using to determine if disk uses emulator cache
or not.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Andrey Korolyov

On Mon, Jun 8, 2015 at 6:36 PM, Arnaud Virlet avir...@easter-eggs.com wrote:


 On 06/08/2015 03:17 PM, Andrey Korolyov wrote:

 On Mon, Jun 8, 2015 at 3:44 PM, Arnaud Virlet avir...@easter-eggs.com
 wrote:



 On 06/08/2015 01:59 PM, Andrey Korolyov wrote:



 Am I understand you right that you are using certain template engine
 for both OCFS- and RBD-backed volumes within a single VM` config and
 it does not allow per-disk cache mode separation in a suggested way?

 My VM has 3 disks on RBD backend. disks 1 and 2 have cache=writeback,
 disk 3
 ( for  ocfs2 ) has cache=none in my VM XML file. When I start the VM,
 libvirt produce a launch string with cache=wtriteback for disk 1/2, and
 with
 cache=none for disk 3.
 Even if cache = none for disk 3, it seems doesn't take effect without
 set
 rbd cache = false in ceph.conf.


 It is very strange and contradictive to what it should be. Could you
 post a resulting qemu argument string, by a chance? Also please share
 a method which you are using to determine if disk uses emulator cache
 or not.


 Here my qemu arguments strings for the related VM:

 /usr/bin/qemu-system-x86_64 -name www-pa2-01 -S -machine
 pc-i440fx-1.6,accel=kvm,usb=off -m 2048 -realtime mlock=off -smp
 2,sockets=2,cores=1,threads=1 -uuid 3542c57c-dd47-44cd-933f-7dae0b949012
 -no-user-config -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/www-pa2-01.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
 -boot order=c,menu=on,strict=on -device
 piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=rbd:libvirt-pool/www-pa2-01:id=libvirt:key=:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
 -drive
 file=rbd:libvirt-pool/www-pa2-01-data:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk1,format=raw,cache=writeback
 -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1
 -drive
 file=rbd:libvirt-pool/www-pa2-webmutu:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk2,format=raw,cache=none
 -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk2,id=virtio-disk2
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device
 ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
 tap,fd=29,id=hostnet0,vhost=on,vhostfd=34 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:90:53:6b,bus=pci.0,addr=0x3
 -netdev tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device
 virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:7b:b9:85,bus=pci.0,addr=0x8
 -netdev tap,fd=37,id=hostnet2,vhost=on,vhostfd=38 -device
 virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:2e:ce:f6,bus=pci.0,addr=0xa
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:5 -device
 cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on

 For the disk without cache:

 -device
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1
 -drive
 file=rbd:libvirt-pool/www-pa2-webmutu:id=libvirt:key=XXX:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,if=none,id=drive-virtio-disk2,format=raw,cache=none



 I don't really have method to determine if disk use emulator's cache or not.
 When I was testing if my ocfs2 cluster work correctly, I realized that if
 rbd cache = true in ceph.conf and cache=none in XML file, my ocfs2
 cluster doesn't work. Cluster's member doesn't see if members join or leave
 the cluster.
 But if rbd cache = false in ceph.conf and cache = none in XML. OCFS2
 cluster work, cluster's members see the other members when they join or
 leave.


Thanks, can you please also add a description about how your ocfs
cluster set up? Honestly, there is no chance to intersect with any
kind of software bug because of different entities you are using
(userspace storage emulation versus kernel file system).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Snap operation throttling (again)

2015-05-19 Thread Andrey Korolyov

Hello,

this question was brought many times before, and also solved in a
various ways - snap trimmer, scheduler` priorities and persistent fix
(for a ReplicatedPG issue), but it seems that the current Ceph
versions may suffer as well during the rollback operations on large
images and on large scale. Given CFQ scheduler for rotating media and
~10 percentage of an utilization as an initial preconditions, the
rollback operation of an one-fourth terabyte image over 100 OSDs may
result in a significant latency impact and, for such configuration,
breakage of a 30s request completion barrier. Although recent
improvements did very well in means of the congestion control for a
slow media for many kind of non-client ops, this exact issue remains.
I think it can be solved by another sleeper knob but unsure where its
proper place should be.

Thanks for suggestions!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preliminary RDMA vs TCP numbers

2015-04-08 Thread Andrey Korolyov

On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy somnath@sandisk.com wrote:

 Hi,
 Please find the preliminary performance numbers of TCP Vs RDMA (XIO) 
 implementation (on top of SSDs) in the following link.

 http://www.slideshare.net/somnathroy7568/ceph-on-rdma

 The attachment didn't go through it seems, so, I had to use slideshare.

 Mark,
 If we have time, I can present it in tomorrow's performance meeting.

 Thanks  Regards
 Somnath


Those numbers are really impressive (for small numbers at least)! What
are TCP settings you using?For example, difference can be lowered on
scale due to less intensive per-connection acceleration on CUBIC on a
larger number of nodes, though I do not believe that it was a main
reason for an observed TCP catchup on a relatively flat workload such
as fio generates.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rebalance after empty bucket addition

2015-04-05 Thread Andrey Korolyov

Hello,

after reaching certain ceiling of host/PG ratio, moving empty bucket
in causes a small rebalance:

ceph osd crush add-bucket 10.10.2.13
ceph osd crush move 10.10.2.13 root=default rack=unknownrack

I have two pools, one is very large and it is keeping up with proper
amount of pg/osd but another one contains in fact lesser amount of PGs
than the number of active OSDs and after insertion of empty bucket in
it goes to a rebalance, though that the actual placement map is not
changed. Keeping in mind that this case is very far from being
offensive to any kind of a sane production configuration, is this an
expected behavior?

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New Intel 750 PCIe SSD

2015-04-03 Thread Andrey Korolyov

On Thu, Apr 2, 2015 at 8:03 PM, Mark Nelson mnel...@redhat.com wrote:
 Thought folks might like to see this:

 http://hothardware.com/reviews/intel-ssd-750-series-nvme-pci-express-solid-state-drive-review

 Quick summary:

 - PCIe SSD based on the P3700
 - 400GB for $389!
 - 1.2GB/s writes and 2.4GB/s reads
 - power loss protection
 - 219TB write endurance

 So basically looks extremely attractive on paper except for the write
 endurance.  I suspect this is not actually using HET cells (a summary I read
 said it was).  How far beyond Intel's endurance rating the card can go is
 the big question.

 Mark
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

All the characteristics are awesome except resource - it would burn
out in a couple of months and replacement is coupled with node
poweroff...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD slow requests causing disk aborts in KVM

2015-02-12 Thread Andrey Korolyov

On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki
krzysztof.a.nowi...@gmail.com wrote:
 Hi all,

 I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage
 backend for a set of KVM virtual machines. The VMs use RBD for disk storage.
 On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain
 DISCARD support.

 Each OSD node is running on a separate machine, using 3TB WD Black drive +
 Samsung SSD for journal. The machines used for OSD nodes are not equal in
 spec. Three of them are small servers, while one is a desktop PC. The last
 node is the one causing trouble. During high loads caused by remapping due
 to one of the other nodes going down I've experienced some slow requests. To
 my surprise however these slow requests caused aborts from the block device
 on the VM side, which ended up corrupting files.

 What I wonder if such behaviour (aborts) is normal in case slow requests
 pile up. I always though that these requests would be delayed but eventually
 they'd be handled. Are there any tunables that would help me avoid such
 situations? I would really like to avoid VM outages caused by such
 corruption issues.

 I can attach some logs if needed.

 Best regards
 Chris

Hi, this is unevitable payoff for using scsi backend on a storage
which is capable to slow enough operations. There was some
argonaut/bobtail-era discussions in ceph ml, may be those readings can
be interesting for you. AFAIR the scsi disk would about after 70s of
non-receiving ack state for a pending operation.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down

2015-02-05 Thread Andrey Korolyov


 Yep, it's a silly bug and I'm surprised we haven't noticed until now!

 http://tracker.ceph.com/issues/10762
 https://github.com/ceph/ceph/pull/3631

 Thanks!
 sage

Thanks Sage, is dumpling missing from backport list by a purpose?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Andrey Korolyov

On Sun, Jan 4, 2015 at 10:43 PM, Edwin Peer ep...@isoho.st wrote:
 Thanks Jake, however, I need to shrink the image, not delete it as it
 contains a live customer image. Is it possible to manually edit the RBD
 header to make the necessary adjustment?


Technically speaking, yes - the size information is contained in omap
attributes of header, but I`d recommend you to check this approach
somewhere else. Even if this will work, this will leave a lot of stray
files in filestore. If you are running VM on top of it with recent
qemu, it`s easy to tell the emulator desired block geometry (size),
then launch a drive-mirror job and occasionally change backing image
(during power cycle for example). Although second one may work, I am
not absolutely sure that drive-mirror job will respect geometry
override, better to check this too.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there an negative relationship between storage utilization and ceph performance?

2015-01-03 Thread Andrey Korolyov

Hi,

you can reduce reserved space for ext4 via tune2fs and gain a little
more space, up to 5%. By the way, if you are using Centos7, it
reserves ridiculously high disk percentage for ext4 (at least during
instalation). Performance probably should be compared on smaller
allocsize mount option for xfs (32..512k) and comparison should be
made for long runs (like weeks of small writes from a bunch of clients
to reach bad enough fragmentation ratio). When I measured comparable
results last time, it was a bobtail era, and XFS started to decrease
operation speed on about 40% of real allocation. If you prefer to use
rados bench, real-life fragmentation may be achieved by running
multiple benches with small block size simultaneously in different
pools over same set of OSDs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Marking a OSD a new in the OSDMap

2014-12-31 Thread Andrey Korolyov

On Wed, Dec 31, 2014 at 8:20 PM, Wido den Hollander w...@42on.com wrote:
 On 12/31/2014 05:54 PM, Andrey Korolyov wrote:
 On Wed, Dec 31, 2014 at 7:34 PM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Is there a way to set a OSD to exists,new in the OSDMap? I want to
 re-install a OSD and re-use the ID and it's key.

 I know I can remove the OSD and re-add it, but that triggers balancing
 and I want to prevent that by simply marking the OSD a new and booting
 it with a freshly formatted XFS filesystem.


 Yes, you can call mkfs with existing UUID: --osd-uuid xxx.


 Ah, but will that also mark the OSD as new in the OSDMap? Will it
 receive all old maps?


Yes, technically I do not understand why most replacement guides are
going through out-in plus optional rm-add procedure, it works
perfectly with freshly formatted filestore using existing uuid and
key.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Improving Performance with more OSD's?

2014-12-29 Thread Andrey Korolyov

On Mon, Dec 29, 2014 at 12:47 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote:
 On Sun, Dec 28, 2014 at 02:49:08PM +0900, Christian Balzer wrote:
 You really, really want size 3 and a third node for both performance
 (reads) and redundancy.

 How does it benefit read performance? I thought all reads are made only
 from the active primary OSD.

 --
 Tomasz Kuzemko
 tomasz.kuze...@ovh.net

You`ll have chunks of primary data scattered between three devices
instead of two, as each pg will have a random acting set (until you
decide to pin primary).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov

Power supply means bigger capex and less redundancy, as the emergency
procedure in case of power failure is less deterministic than with
controlled battery-backed cache. Cache battery is smaller and way more
predictable for a health measurements than a UPS (if passes internal
check, it will be *always* enough to keep memory powered for a while,
but UPS requires periodical battle testing, if you want to know that
it still be able to hold power failure, with two power lanes should be
safe enough, simply because device itself has more complex structure
than a battery with a single voltage stabilizer). Anyway XFS nobarrier
does not bring enough performance boost to be enabled by my
experience.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov

On Sat, Dec 27, 2014 at 4:31 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 On Sat, 27 Dec 2014 04:59:51 PM you wrote:
 Power supply means bigger capex and less redundancy, as the emergency
 procedure in case of power failure is less deterministic than with
 controlled battery-backed cache.

 Yes, the whole  auto shut-down procedure is rather more complex and fragile
 for a UPS than a controller cache

 Anyway XFS nobarrier
 does not bring enough performance boost to be enabled by my
 experience.

 It makes a non-trivial difference on my (admittedly slow) setup, with write
 bandwidth going from 35 MB/s to 51 MB/s

Are you able to separate log with data in your setup and check the
difference? If your devices are working strictly under their upper
limits for bw/IOPS, separating meta and data bytes may help a lot, at
least for synchronous clients. So, depending on type of your benchmark
(sync/async/IOPS-/bandwidth-hungry) you may win something just for
crossing journal and data between disks (and increase failure domain
for a single disk as well :) ).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Andrey Korolyov

On Sun, Dec 28, 2014 at 1:25 AM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 On Sat, 27 Dec 2014 06:02:32 PM you wrote:
 Are you able to separate log with data in your setup and check the
 difference?

 Do you mean putting the OSD journal on a separate disk? I have the journals on
 SSD partitions, which has helped a lot, previously I was getting 13 MB/s


No, I meant XFS journal, as we are speaking about filestore fs performance.

 Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to
 get SSD's with better seq write speed and IOPS

 I've been trying to figure out if adding more OSD's will improve my
 performance, I only have 2 OSD's (one per node)

Erm, yes. Two OSDs cannot be considered even for a performance
measurement testbed setup, neither should three or any other small
number. This explains numbers you are getting and impact from
nobarrier option.


  So, depending on type of your benchmark
 (sync/async/IOPS-/bandwidth-hungry) you may win something just for
 crossing journal and data between disks (and increase failure domain
 for a single disk as well  ).

 One does tend to foxus on raw seq read/writes for becnhmarking, but my actual
 usage is solely for hosting KVM images, so really random R/W is probably more
 important.

Ok, then my suggestion may not help as much as it can.


 --
 Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird scrub problem

2014-12-27 Thread Andrey Korolyov

On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just sam.j...@inktank.com wrote:
 Oh, that's a bit less interesting.  The bug might be still around though.
 -Sam

 On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 You'll have to reproduce with logs on all three nodes.  I suggest you
 open a high priority bug and attach the logs.

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 I'll be out for the holidays, but I should be able to look at it when
 I get back.
 -Sam



 Thanks Sam,

 although I am not sure if it makes not only a historical interest (the
 mentioned cluster running cuttlefish), I`ll try to collect logs for
 scrub.

 Same stuff:
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15447.html
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14918.html

 Looks like issue is still with us, though it requires meta or file
 structure corruption to show itself. I`ll check if it can be
 reproduced via rsync -X sec pg subdir - pri pg subdir or vice-versa.
 Mine case shows slightly different pathnames for same objects with
 same checksums, may be a root reason then. As every case mentioned,
 including mine, happened in oh-shit-hardware-is-broken case, I suggest
 that the incurable corruption happens during primary backfill from
 active replica at the recovery time.

Recovery/backfill from corrupted primary copy results to crash
(attached) of primary OSD, for example it can be triggered by purging
one of secondary copies (top of cuttlefish branch for line numbers).
Although as secondaries preserve same data with same checksums, it is
possible to destroy both meta record and pg directory and refill
primary back. The interesting point is that the corrupted primary was
completely refilled after hardware failure, but looks like it survived
long enough after a failure event to spread corruption to the copies,
I simply can not imagine better explanation.
Thread 1 (Thread 0x7f193190d700 (LWP 64087)):
#0  0x7f194a47ab7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00857d59 in reraise_fatal (signum=6)
at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  signal handler called
#4  0x7f1948879405 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x7f194887cb5b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x7f194917789d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x7f1949175996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x7f19491759c3 in std::terminate() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x7f1949175bee in __cxa_throw ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0090436a in ceph::__ceph_assert_fail (
assertion=0x9caf67 r = 0, file=optimized out, line=7115, 
func=0x9d1900 void ReplicatedPG::scan_range(hobject_t, int, int, 
PG::BackfillInterval*)) at common/assert.cc:77
#11 0x0065de69 in ReplicatedPG::scan_range (this=this@entry=0x4df6000, 
begin=..., min=min@entry=32, max=max@entry=64, bi=bi@entry=0x4df6d40)
at osd/ReplicatedPG.cc:7115
#12 0x0066f5c6 in ReplicatedPG::recover_backfill (
this=this@entry=0x4df6000, max=max@entry=1) at osd/ReplicatedPG.cc:6923
#13 0x0067c18d in ReplicatedPG::start_recovery_ops (this=0x4df6000, 
max=1, prctx=optimized out) at osd/ReplicatedPG.cc:6561
#14 0x006f2340 in OSD::do_recovery (this=0x2ba7000, pg=pg@entry=
0x4df6000) at osd/OSD.cc:6104
#15 0x00735361 in OSD::RecoveryWQ::_process (this=optimized out, 
pg=0x4df6000) at osd/OSD.h:1248
#16 0x008faeba in ThreadPool::worker (this=0x2ba75e0, wt=0x7be1540)
at common/WorkQueue.cc:119
#17 0x008fc160 in ThreadPool::WorkThread::entry (this=optimized out)
at common/WorkQueue.h:316
#18 0x7f194a472e9a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x7f19489353dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#20 0x in ?? ()
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Weird scrub problem

2014-12-22 Thread Andrey Korolyov

Hello,

I am currently facing some strange problem, most probably a bug (osd.3
is a primary holder):

ceph pg scrub 4.458

2014-12-22 19:19:00.238077 osd.3 [ERR] 4.458 osd.34 missing
6f0df458/rbd_data.cbba8a759d2.0a5b/head//4
2014-12-22 19:19:00.238079 osd.3 [ERR] 4.458 osd.10 missing
6f0df458/rbd_data.cbba8a759d2.0a5b/head//4

Checksum is exactly the same on every OSD which included in this
complaint and so the file count:

find 4.458_* -name \*cbba8a759d2* -exec md5sum {} \; | sort
\106f7a2b6c0d71d52031d2bd92ea9111
4.458_head/DIR_8/DIR_5/DIR_4/DIR_2/rbd\\udata.cbba8a759d2.00db__head_73442458__4
\2db655db8c7b0a1cccbec79bbc9fc923
4.458_head/DIR_8/DIR_5/DIR_4/DIR_F/rbd\\udata.cbba8a759d2.0a5b__head_6F0DF458__4
\6505c6c8ccb3c103618c5ef0a22d3414
4.458_head/DIR_8/DIR_5/DIR_4/DIR_C/rbd\\udata.cbba8a759d2.14ec__head_B37BC458__4

Extended attributes looks relatively well, or at least not suspicious.
Automatic repair is failing on those, what should I try next?
Unfortunately I cannot go over export+rm+import on most of those
images just to delete appropriate prefix.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird scrub problem

2014-12-22 Thread Andrey Korolyov

On Mon, Dec 22, 2014 at 11:50 PM, Samuel Just sam.j...@inktank.com wrote:
 So 
 4.458_head/DIR_8/DIR_5/DIR_4/DIR_F/rbd\\udata.cbba8a759d2.0a5b__head_6F0DF458__4
 is present on osd 3, osd 34, and osd 10?
 -Sam


Yes, exactly, and have same checksum.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird scrub problem

2014-12-22 Thread Andrey Korolyov

On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 You'll have to reproduce with logs on all three nodes.  I suggest you
 open a high priority bug and attach the logs.

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 I'll be out for the holidays, but I should be able to look at it when
 I get back.
 -Sam



Thanks Sam,

although I am not sure if it makes not only a historical interest (the
mentioned cluster running cuttlefish), I`ll try to collect logs for
scrub.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Andrey Korolyov

On Tue, Nov 18, 2014 at 10:04 PM, Craig Lewis cle...@centraldesktop.com wrote:
 That would probably have helped.  The XFS deadlocks would only occur when
 there was relatively little free memory.  Kernel 3.18 is supposed to have a
 fix for that, but I haven't tried it yet.

 Looking at my actual usage, I don't even need 64k inodes.  64k inodes should
 make things a bit faster when you have a large number of files in a
 directory.  Ceph will automatically split directories with too many files
 into multiple sub-directories, so it's kinda pointless.

 I may try the experiment again, but probably not.  It took several weeks to
 reformat all of the OSDS.  Even on a single node, it takes 4-5 days to
 drain, format, and backfill.  That was months ago, and I'm still dealing
 with the side effects.  I'm not eager to try again.


 On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:

 On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com
 wrote:
  I did have a problem in my secondary cluster that sounds similar to
  yours.
  I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
  options xfs = -i size=64k).   This showed up with a lot of XFS:
  possible
  memory allocation deadlock in kmem_alloc in the kernel logs.  I was
  able to
  keep things limping along by flushing the cache frequently, but I
  eventually
  re-formatted every OSD to get rid of the 64k inodes.
 
  After I finished the reformat, I had problems because of deep-scrubbing.
  While reformatting, I disabled deep-scrubbing.  Once I re-enabled it,
  Ceph
  wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs
  would
  be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to
  spread
  out the schedule a bit.  Once this finishes in a few day, I should be
  able
  to re-enable deep-scrubbing and keep my HEALTH_OK.
 
 

 Would you mind to check suggestions by following mine hints or hints
 from mentioned URLs from there
 http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As
 for me, I am not observing lock loop after setting min_free_kbytes for
 a half of gigabyte per OSD. Even if your locks has a different nature,
 it may be worthy to try anyway.



Thanks, I perfectly understand this. But, if you have low enough
OSD/node ratio, it can be possible to check the problem at a node
scale. By the way, I do not see real reasons behind using lower
allocsize NOT for object storage-designed cluster.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD commits suicide

2014-11-17 Thread Andrey Korolyov

On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com wrote:
 I did have a problem in my secondary cluster that sounds similar to yours.
 I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
 options xfs = -i size=64k).   This showed up with a lot of XFS: possible
 memory allocation deadlock in kmem_alloc in the kernel logs.  I was able to
 keep things limping along by flushing the cache frequently, but I eventually
 re-formatted every OSD to get rid of the 64k inodes.

 After I finished the reformat, I had problems because of deep-scrubbing.
 While reformatting, I disabled deep-scrubbing.  Once I re-enabled it, Ceph
 wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would
 be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
 out the schedule a bit.  Once this finishes in a few day, I should be able
 to re-enable deep-scrubbing and keep my HEALTH_OK.



Would you mind to check suggestions by following mine hints or hints
from mentioned URLs from there
http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As
for me, I am not observing lock loop after setting min_free_kbytes for
a half of gigabyte per OSD. Even if your locks has a different nature,
it may be worthy to try anyway.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] isolate_freepages_block and excessive CPU usage by OSD process

2014-11-15 Thread Andrey Korolyov

On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka vba...@suse.cz wrote:
On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka vba...@suse.cz wrote:
On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
Hello,

I had found recently that the OSD daemons under certain conditions
(moderate vm pressure, moderate I/O, slightly altered vm settings) can
go into loop involving isolate_freepages and effectively hit Ceph
cluster performance. I found this thread

Do you feel it is a regression, compared to some older kernel version or
something?

No, it`s just a rare but very concerning stuff. The higher pressure
is, the more chance to hit this particular issue, although absolute
numbers are still very large (e.g. room for cache memory). Some
googling also found simular question on sf:
http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
but there are no perf info unfortunately so I cannot say if the issue
is the same or not.

Well it would be useful to find out what's doing the high-order allocations.
With 'perf -g -a' and then 'perf report -g' determine the call stack. Order
and
allocation flags can be captured by enabling the page_alloc tracepoint.

Thanks, please give me some time to go through testing iterations, so
I`ll collect appropriate perf.data.

https://lkml.org/lkml/2012/6/27/545, but looks like that the
significant decrease of bdi max_ratio did not helped even for a bit.
Although I have approximately a half of physical memory for cache-like
stuff, the problem with mm persists, so I would like to try
suggestions from the other people. In current testing iteration I had
decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
background ratio to 15 and 10 correspondingly (because default values
are too spiky for mine workloads). The host kernel is a linux-stable
3.10.

Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend
trying
it, or at least 3.17. Lot of patches went to reduce compaction overhead for
(especially for transparent hugepages) since 3.10.

Heh, I may say that I limited to pushing knobs in 3.10, because it has
a well-known set of problems and any major version switch will lead to
months-long QA procedures, but I may try that if none of mine knob
selection will help. I am not THP user, the problem is happening with
regular 4k pages and almost default VM settings. Also it worth to mean

OK that's useful to know. So it might be some driver (do you also have
mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order
allocations.

Yes, I am using mellanox transport there and SLUB allocator, as SLAB
had some issues with allocations with uneven node fill-up on a
two-head system which I am primarily using.

that kernel messages are not complaining about allocation failures, as
in case in URL from above, compaction just tightens up to some limit

Without the warnings, that's why we need tracing/profiling to find out what's
causing it.

and (after it 'locked' system for a couple of minutes, reducing actual
I/O and derived amount of memory operations) it goes back to normal.
Cache flush fixing this just in a moment, so should large room for

That could perhaps suggest a poor coordination between reclaim and compaction,
made worse by the fact that there are more parallel ongoing attempts and the
watermark checking doesn't take that into account.

min_free_kbytes. Over couple of days, depends on which nodes with
certain settings issue will reappear, I may judge if my ideas was
wrong.

Non-default VM settings are:
vm.swappiness = 5
vm.dirty_ratio=10
vm.dirty_background_ratio=5
bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
situation worsened, because unstable OSD host cause domino-like effect
on other hosts, which are starting to flap too and only cache flush
via drop_caches is helping.

Unfortunately there are no slab info from exhausted state due to
sporadic nature of this bug, will try to catch next time.

slabtop (normal state):
Active / Total Objects (% used): 8675843 / 8965833 (96.8%)
Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
Active / Total Caches (% used) : 86 / 132 (65.2%)
Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
6890130 6889185 99%0.10K 176670 39706680K buffer_head
751232 721707 96%0.06K 11738 64 46952K kmalloc-64
251636 226228 89%0.55K 8987 28143792K radix_tree_node
121696 45710 37%0.25K 3803 32 30424K kmalloc-256
113022 80618 71%0.19K 2691 42 21528K dentry
112672 35160 31%0.50K 3521 32 56336K kmalloc-512
73136 72800 99%0.07K 1306 56 5224K Acpi-ParseExt

Re: [ceph-users] Ceph and Compute on same hardware?

2014-11-12 Thread Andrey Korolyov

On Wed, Nov 12, 2014 at 5:30 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Actually, our production cluster(up to ten) all are that ceph-osd ran
 on compute-node(KVM).

 The primary action is that you need to constrain the cpu and memory.
 For example, you can alloc a ceph cpu-set and memory group, let
 ceph-osd run with it within limited cores and memory.

 The another risk is the network. Because compute-node and ceph-osd
 shared the same kernel network stack, it exists some risks that VM may
 ran out of network resources such as conntracker in netfilter
 framework.

 On Wed, Nov 12, 2014 at 10:23 PM, Mark Nelson mark.nel...@inktank.com wrote:
 Technically there's no reason it shouldn't work, but it does complicate
 things.  Probably the biggest worry would be that if something bad happens
 on the compute side (say it goes nuts with network or memory transfers) it
 could slow things down enough that OSDs start failing heartbeat checks
 causing ceph to go into recovery and maybe cause a vicious cycle of
 nastiness.

 You can mitigate some of this with cgroups and try to dedicate specific
 sockets and memory banks to Ceph/Compute, but we haven't done a lot of
 testing yet afaik.

 Mark


 On 11/12/2014 07:45 AM, Pieter Koorts wrote:

 Hi,

 A while back on a blog I saw mentioned that Ceph should not be run on
 compute nodes and in the general sense should be on dedicated hardware.
 Does this really still apply?

 An example, if you have nodes comprised of

 16+ cores
 256GB+ RAM
 Dual 10GBE Network
 2+8 OSD (SSD log + HDD store)

 I understand that Ceph can use a lot of IO and CPU in some cases but if
 the nodes are powerful enough does it not make it an option to run
 compute and storage on the same hardware to either increase density of
 compute or save money on additional hardware?

 What are the reasons for not running Ceph on the Compute nodes.

 Thanks

 Pieter


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Best Regards,

 Wheat
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Yes, the essential part is a resource management, which can neither be
dynamic or static. In Flops we had implemented dynamic resource
control which allows to pack VMs and OSDs more densely than static
cg-based jails can allow (and it requires deep orchestration
modifications for every open source cloud orchestrator,
unfortunately). As long as you are able to manage strong traffic
isolation for storage and vm segment, there are absolutely no problem
(it can be static limits from linux-qos or tricky flow management for
OpenFlow, depends on what your orchestration allows). The possibility
of putting together compute and storage roles without significant
impact to performance characteristics was one of key features which
led our selection to Ceph as a storage backend three years ago.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] dumpling to giant test transition

2014-11-05 Thread Andrey Korolyov

Hello,

after doing a single-step transition, the test cluster is hanging in
unclean state, both before and after crush tunables adjustment:

status:
http://xdel.ru/downloads/transition-stuck/cephstatus.txt

osd dump:
http://xdel.ru/downloads/transition-stuck/cephosd.txt

query for a single pg in active+remapped state:
http://xdel.ru/downloads/transition-stuck/remappedpg.txt

query for a single pg in active+undersized+degraded state:
http://xdel.ru/downloads/transition-stuck/degradedpg.txt

As one can see there is empty value for backfill_targets for both sets
of pg which is clearly indicates some problem with placement
calculation (two-node two-osd cluster should have enough targets for
backfilling in this case).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dumpling to giant test transition

2014-11-05 Thread Andrey Korolyov

Yes, that`s right guess - crushmap is imagining both OSDs on a single
host, although daemon on the second host was able to act as up/in when
crushmap placed it on a different host, looks as a weird placement
miscalculation during update.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov

Hi Haomai, all.

Today after unexpected power failure one of kv stores (placed on ext4
with default mount options) refused to work. I think that it may be
interesting to revive it because it is almost first time among
hundreds of power failures (and their simulations) when data store got
broken.

Strace:
http://xdel.ru/downloads/osd1.strace.gz

Debug output with 20-everything level:
http://xdel.ru/downloads/osd1.out
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov

On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Thanks for Andrey,

 The attachment OSD.1's log is only these lines? I really can't find
 the detail infos from it?

 Maybe you need to improve debug_osd to 20/20?

 On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi Haomai, all.

 Today after unexpected power failure one of kv stores (placed on ext4
 with default mount options) refused to work. I think that it may be
 interesting to revive it because it is almost first time among
 hundreds of power failures (and their simulations) when data store got
 broken.

 Strace:
 http://xdel.ru/downloads/osd1.strace.gz

 Debug output with 20-everything level:
 http://xdel.ru/downloads/osd1.out



 --
 Best Regards,

 Wheat


Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
cli args and entire output - it ends abruptly without last newline and
without any valuable output.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov

On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Thanks!

 You mean osd.1 exited abrptly without ceph callback trace?
 Anyone has some ideas about this log? @sage @gregory


 On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Thanks for Andrey,

 The attachment OSD.1's log is only these lines? I really can't find
 the detail infos from it?

 Maybe you need to improve debug_osd to 20/20?

 On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi Haomai, all.

 Today after unexpected power failure one of kv stores (placed on ext4
 with default mount options) refused to work. I think that it may be
 interesting to revive it because it is almost first time among
 hundreds of power failures (and their simulations) when data store got
 broken.

 Strace:
 http://xdel.ru/downloads/osd1.strace.gz

 Debug output with 20-everything level:
 http://xdel.ru/downloads/osd1.out



 --
 Best Regards,

 Wheat


 Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
 cli args and entire output - it ends abruptly without last newline and
 without any valuable output.



 --
 Best Regards,

 Wheat


With log-file specified, it adds just following line at very end:

2014-10-29 13:29:57.437776 7ffa562c9840 -1  ** ERROR: osd init failed:
(22) Invalid argument

the stdout printing seems a bit broken and do not print this at all
(and store output part is definitely is not detailed enough to make
any conclusions, and even file a bug). CCing Sage/Greg.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-29 Thread Andrey Korolyov

On Wed, Oct 29, 2014 at 1:37 PM, Haomai Wang haomaiw...@gmail.com wrote:
 maybe you can run it directly with debug_osd=20/20 and get ending logs
 ceph-osd -i 1 -c /etc/ceph/ceph.conf -f

 On Wed, Oct 29, 2014 at 6:34 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Oct 29, 2014 at 1:28 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Thanks!

 You mean osd.1 exited abrptly without ceph callback trace?
 Anyone has some ideas about this log? @sage @gregory


 On Wed, Oct 29, 2014 at 6:19 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Oct 29, 2014 at 1:11 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Thanks for Andrey,

 The attachment OSD.1's log is only these lines? I really can't find
 the detail infos from it?

 Maybe you need to improve debug_osd to 20/20?

 On Wed, Oct 29, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Hi Haomai, all.

 Today after unexpected power failure one of kv stores (placed on ext4
 with default mount options) refused to work. I think that it may be
 interesting to revive it because it is almost first time among
 hundreds of power failures (and their simulations) when data store got
 broken.

 Strace:
 http://xdel.ru/downloads/osd1.strace.gz

 Debug output with 20-everything level:
 http://xdel.ru/downloads/osd1.out



 --
 Best Regards,

 Wheat


 Unfortunately that`s all I`ve got. Updated osd1.out to show an actual
 cli args and entire output - it ends abruptly without last newline and
 without any valuable output.



 --
 Best Regards,

 Wheat


 With log-file specified, it adds just following line at very end:

 2014-10-29 13:29:57.437776 7ffa562c9840 -1  ** ERROR: osd init failed:
 (22) Invalid argument

 the stdout printing seems a bit broken and do not print this at all
 (and store output part is definitely is not detailed enough to make
 any conclusions, and even file a bug). CCing Sage/Greg.



 --
 Best Regards,

 Wheat

-f does not print the last line to stderr either. Ok, it looks like
very minor separate bug, but I remembering its appearance long before,
so but as bug remains, probably it does not bother anyone - the stderr
output is less usual for debugging purposes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-26 Thread Andrey Korolyov

On Sun, Oct 26, 2014 at 7:40 AM, Haomai Wang haomaiw...@gmail.com wrote:
 On Sun, Oct 26, 2014 at 3:12 AM, Andrey Korolyov and...@xdel.ru wrote:
 Thanks Haomai. Turns out that the master` recovery is too buggy right
 now (recovery speed degrades over a time, OSD (non-kv) is going out of
 cluster with no reason, misplaced object calculation is wrong and so
 on), so I am sticking to giant with rocksdb now. So far no major
 problems are revealed.

 Hmm, do you mean kvstore has problem on osd recovery? I'm eager to
 know the operations about how to produce this situation. Could you
 give more detail?



 --
 Best Regards,

 Wheat


I`m not sure if kv has triggered any of those, it`s just a side effect
of deploying master branch (and OSDs which showed problems was not in
kv subset only). Looks like both giant and master are exposing some
problem with pg recalculation on tight-IO conditions for MON (MONs are
sharing disk with one of OSD each and post-peering recalculation may
take some minutes when kv-based OSDs are involved, also recalculation
from active+remapped - active+degraded(+...) takes tens of minutes;
the same 'non-optimal' setup worked well before and all recalculations
was made in a matter of tens of seconds, so I will investigate this a
bit later). Giant crashed on non-KV daemons during nightly recovery,
so there is a more critical stuff to fix right now because  kv so far
did not exposed any crashes by itself.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-25 Thread Andrey Korolyov

Thanks Haomai. Turns out that the master` recovery is too buggy right
now (recovery speed degrades over a time, OSD (non-kv) is going out of
cluster with no reason, misplaced object calculation is wrong and so
on), so I am sticking to giant with rocksdb now. So far no major
problems are revealed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Continuous OSD crash with kv backend (firefly)

2014-10-24 Thread Andrey Korolyov

Hi,

during recovery testing on a latest firefly with leveldb backend we
found that the OSDs on a selected host may crash at once, leaving
attached backtrace. In other ways, recovery goes more or less smoothly
for hours.

Timestamps shows how the issue is correlated between different
processes on same node:

core.ceph-osd.25426.node01.1414148261
core.ceph-osd.25734.node01.1414148263
core.ceph-osd.25566.node01.1414148345

The question is about kv backend state in Firefly - is it considered
stable enough to run production test against it or we should better
move to giant/master for this?

Thanks!
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type show copying
and show warranty for details.
This GDB was configured as x86_64-linux-gnu.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/ceph-osd...Reading symbols from 
/usr/lib/debug/usr/bin/ceph-osd...done.
done.
[New LWP 10182]
[New LWP 10183]
[New LWP 10699]
[New LWP 10184]
[New LWP 10703]
[New LWP 10704]
[New LWP 10702]
[New LWP 10708]
[New LWP 10707]
[New LWP 10710]
[New LWP 10700]
[New LWP 10717]
[New LWP 10765]
[New LWP 10705]
[New LWP 10706]
[New LWP 10701]
[New LWP 10712]
[New LWP 10735]
[New LWP 10713]
[New LWP 10750]
[New LWP 10718]
[New LWP 10711]
[New LWP 10716]
[New LWP 10715]
[New LWP 10785]
[New LWP 10766]
[New LWP 10796]
[New LWP 10720]
[New LWP 10725]
[New LWP 10736]
[New LWP 10709]
[New LWP 10730]
[New LWP 11541]
[New LWP 10770]
[New LWP 11573]
[New LWP 10778]
[New LWP 10804]
[New LWP 11561]
[New LWP 9388]
[New LWP 9398]
[New LWP 11538]
[New LWP 10790]
[New LWP 11586]
[New LWP 10798]
[New LWP 9910]
[New LWP 10726]
[New LWP 21823]
[New LWP 10815]
[New LWP 9397]
[New LWP 11248]
[New LWP 10723]
[New LWP 11253]
[New LWP 10728]
[New LWP 10791]
[New LWP 9389]
[New LWP 10724]
[New LWP 10780]
[New LWP 11287]
[New LWP 11592]
[New LWP 10816]
[New LWP 10812]
[New LWP 10787]
[New LWP 20622]
[New LWP 21822]
[New LWP 10751]
[New LWP 10768]
[New LWP 10767]
[New LWP 11874]
[New LWP 10733]
[New LWP 10811]
[New LWP 11574]
[New LWP 11873]
[New LWP 10771]
[New LWP 11551]
[New LWP 10799]
[New LWP 10729]
[New LWP 18254]
[New LWP 10792]
[New LWP 10803]
[New LWP 9912]
[New LWP 11293]
[New LWP 20623]
[New LWP 14805]
[New LWP 10773]
[New LWP 11298]
[New LWP 11872]
[New LWP 10763]
[New LWP 10783]
[New LWP 10769]
[New LWP 11300]
[New LWP 10777]
[New LWP 10764]
[New LWP 10802]
[New LWP 10749]
[New LWP 14806]
[New LWP 10806]
[New LWP 10805]
[New LWP 18255]
[New LWP 10181]
[New LWP 11277]
[New LWP 9913]
[New LWP 10800]
[New LWP 10801]
[New LWP 11555]
[New LWP 11871]
[New LWP 10748]
[New LWP 9915]
[New LWP 10779]
[New LWP 11294]
[New LWP 9916]
[New LWP 10757]
[New LWP 10734]
[New LWP 10786]
[New LWP 10727]
[New LWP 19063]
[New LWP 11279]
[New LWP 9905]
[New LWP 9911]
[New LWP 10772]
[New LWP 10722]
[New LWP 9914]
[New LWP 10789]
[New LWP 11540]
[New LWP 9917]
[New LWP 11289]
[New LWP 10714]
[New LWP 10721]
[New LWP 10719]
[New LWP 10788]
[New LWP 10782]
[New LWP 10784]
[New LWP 10776]
[New LWP 10774]
[New LWP 10737]
[New LWP 19064]
[Thread debugging using libthread_db enabled]
Using host libthread_db library /lib/x86_64-linux-gnu/libthread_db.so.1.
Core was generated by `/usr/bin/ceph-osd -i 1 --pid-file 
/var/run/ceph/osd.1.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0  0x7ff9ad91eb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) 
Thread 135 (Thread 0x7ff99a492700 (LWP 19064)):
#0  0x7ff9ad91ad84 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00c496da in Wait (mutex=..., this=0x108cd110) at 
./common/Cond.h:55
#2  Pipe::writer (this=0x108ccf00) at msg/Pipe.cc:1730
#3  0x00c5485d in Pipe::Writer::entry (this=optimized out) at 
msg/Pipe.h:61
#4  0x7ff9ad916e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x in ?? ()

Thread 134 (Thread 0x7ff975e1d700 (LWP 10737)):
#0  0x7ff9ac498a13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00c3e73c in Pipe::tcp_read_wait (this=this@entry=0x4a53180) at 
msg/Pipe.cc:2282
#2  0x00c3e9d0 in Pipe::tcp_read (this=this@entry=0x4a53180, 
buf=optimized out, buf@entry=0x7ff975e1cccf \377, len=len@entry=1)
at msg/Pipe.cc:2255
#3  0x00c5095f in Pipe::reader (this=0x4a53180) at msg/Pipe.cc:1421
#4  0x00c5497d in Pipe::Reader::entry (this=optimized out) at 
msg/Pipe.h:49
#5  0x7ff9ad916e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x7ff9ac4a43dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x in ?? ()

Thread 133 (Thread 0x7ff972dda700 (LWP

Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation

2014-10-23 Thread Andrey Korolyov

Sorry, I see the problem.

osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid
(ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs:
16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and
fsid should be silently discarded there if OSD contains no epochs
itself.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation

2014-10-23 Thread Andrey Korolyov

Heh, looks like that the osd process is unable to reach any of mon
members. Since mkfs is getting just well (which requires same mon set
to work) I suspect a bug there.


osd0-monc10.log.gz
Description: GNU Zip compressed data


mon0-dbg.log.gz
Description: GNU Zip compressed data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation

2014-10-23 Thread Andrey Korolyov

It is not so easy.. When I added fsid under selected osd` section and
reformatted the store/journal, it aborted at start in
FileStore::_do_transaction (see attach). On next launch, fsid in the
mon store for this OSD magically changes to the something else and I
am kicking again same doorstep (if I shut down osd process, recreate
journal with new fsid inserted in fsid or recreate entire filestore
too, it will abort, otherwise simply not join due to *next* mismatch).
As far as I can see problem is in behavior of legacy clusters which
are inherited fsid from filesystem created by third-party, not as a
result of ceph-deploy work, so it is not fixed at all after such an
update. Any suggestions?

Trace is attached if someone is interested in it.

On Thu, Oct 23, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote:
 Sorry, I see the problem.

 osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid
 (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs:
 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and
 fsid should be silently discarded there if OSD contains no epochs
 itself.


abrt-at-start.txt.gz
Description: GNU Zip compressed data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation

2014-10-23 Thread Andrey Korolyov

On Thu, Oct 23, 2014 at 9:18 PM, Joao Eduardo Luis
joao.l...@inktank.com wrote:
 Let me re-CC the list as this may be worth for the archives.

 On 10/23/2014 04:19 PM, Andrey Korolyov wrote:

 Doing off-list post again.

 So I was inaccurate in an initial bug description:
 - mkfs goes just well
 - on first start OSD is crashing with ABRT and trace from previous
 message, changing fsid before in the mon store
 - on next start it refuses to join due to fsid mismatch, not crashing any
 more.

 On Thu, Oct 23, 2014 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote:

 It is not so easy.. When I added fsid under selected osd` section and
 reformatted the store/journal, it aborted at start in
 FileStore::_do_transaction (see attach). On next launch, fsid in the
 mon store for this OSD magically changes to the something else and I
 am kicking again same doorstep (if I shut down osd process, recreate
 journal with new fsid inserted in fsid or recreate entire filestore
 too, it will abort, otherwise simply not join due to *next* mismatch).
 As far as I can see problem is in behavior of legacy clusters which
 are inherited fsid from filesystem created by third-party, not as a
 result of ceph-deploy work, so it is not fixed at all after such an
 update. Any suggestions?


 I'm not sure what you mean by 'changing fsid in the mon store', but I
 suspect you have a few misconceptions about 'fsid' and the 'osd uuid'.

 The error you have below, regarding the osd fsid, refers to the osd's uuid,
 which is passed to '--mkfs' using '--osd-uuid X'.  'X' is also the uuid you
 would pass when adding the osd to the monitors using 'ceph osd create
 uuid'.

 Then there's the cluster 'fsid', which refers to the cluster.  This 'fsid'
 is kept in the monmap and is used to identify the cluster the monitors
 belong to and to allow clients (such as the osd) to correctly contact the
 monitors of the cluster they too belong to.

Yes, I am referring to it. The problem is that the I called monmap mon
store which is a bit incorrect in terms of documentation.


 Changing the 'fsid' option in ceph.conf results in changing the perceived
 value the clients and daemons have of the cluster fsid.  If this value is
 different from the monmap's you're bound to have trouble.  If you only
 change the 'fsid' option in the 'osd' section of ceph.conf, you're basically
 telling the osds that they belong to a different cluster, which will
 probably cause issues when they contact the monitors to obtain the monmap
 during mkfs.

 What you clearly want is to remove the contents of the osd data directory,
 generate a uuid 'X', run 'ceph osd create X', save the value it will return
 (it will be used as the OSD's id) and then run ceph-osd --mkfs with
 --osd-uuid X.

 Also, I don't believe that the 'clashing' message is a bug.  IMO we should
 assume that it's the operator's responsibility to remove the data if it's no
 longer of any use, instead of just assuming what the operator may have meant
 when running mkfs repeatedly over a given osd store.


Thanks, I see, using existing UUID from 'osd dump' worked well. The
problem was probably in previous experience with the OSD recreation
which did not require UUID to be specified over OSD re-format (and I
believe that there is some inconsistence anyway - if I am specifying
existing osd id upon mkfs call, why just not fetch and reuse its UUID
for filestore?). Crash with SIGABRT takes place only with debug_ms
being set to 10 or higher, so probably I am hitting independent bug
there.



 Hope this helps.

   -Joao


 Trace is attached if someone is interested in it.

 On Thu, Oct 23, 2014 at 5:25 PM, Andrey Korolyov and...@xdel.ru wrote:

 Sorry, I see the problem.

 osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid
 (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs:
 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and
 fsid should be silently discarded there if OSD contains no epochs
 itself.



 --
 Joao Eduardo Luis
 Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fwd: Latest firefly: osd not joining cluster after re-creation

2014-10-22 Thread Andrey Korolyov

Hello,

given small test cluster, following sequence resulted to the inability
to join back for freshly formatted OSD:

- update cluster sequentially from cuttlefish to dumpling to firefly,
- execute tunables change, wait for recovery completion,
- shut down single osd, reformat filestore and journal,
- start it back (auth caps and key remained the same).

Version is 5a10b95f7968ecac1f2af4abf9fb91347a290544. Any ideas why
this may happen are very welcomed. I suspect some resource starting
from 29499 (probably earlier but this line doing a clear separation
between init stage and loop in the log) line in strace which is
continuously asking for resource all way down may be a root cause
(something just after journal and collections initialization) but I
have no idea what it may be.

Thanks!

Strace http://xdel.ru/downloads/osd0.out.gz


osd0.stdout.gz
Description: GNU Zip compressed data


ceph.conf.gz
Description: GNU Zip compressed data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-29 Thread Andrey Korolyov

On Fri, Aug 29, 2014 at 10:37 AM, Somnath Roy somnath@sandisk.com wrote:
 Thanks Haomai !

 Here is some of the data from my setup.



 --

 Set up:

 



 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) -
 one OSD. 5 client m/c with 12 core cpu and each running two instances of
 ceph_smalliobench (10 clients total). Network is 10GbE.



 Workload:

 -



 Small workload – 20K objects with 4K size and io_size is also 4K RR. The
 intent is to serve the ios from memory so that it can uncover the
 performance problems within single OSD.



 Results from Firefly:

 --



 Single client throughput is ~14K iops, but as the number of client increases
 the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu
 cores are used.



 Result with latest master:

 --



 Single client is ~14K iops, but scaling as number of clients increases. 10
 clients ~107K iops. ~25 cpu cores are used.



 --





 More realistic workload:

 -

 Let’s see how it is performing while  90% of the ios are served from disks

 Setup:

 ---

 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8
 SSDs - 8 OSDs. One similar node for monitor and rgw. Another node for
 client running fio/vdbench. 4 rbds are configured with ‘noshare’ option. 40
 GbE network



 Workload:

 



 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data.  Io_size = 4K RR.



 Results from Firefly:

 



 Aggregated output while 4 rbd clients stressing the cluster in parallel is
 ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can’t remember
 precisely)



 Results from latest master:

 



 Aggregated output while 4 rbd clients stressing the cluster in parallel is
 ~120K IOPS , cpu is 7% idle i.e  ~37-38 cpu cores.



 Hope this helps.




Thanks Roy, the results are very promising!

Just two moments - are numbers from above related to the HT cores or
you renormalized the result for real ones? And what about percentage
of I/O time/utilization in this test was (if you measured this ones)?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-08-28 Thread Andrey Korolyov

On Thu, Aug 28, 2014 at 10:48 PM, Somnath Roy somnath@sandisk.com wrote:
 Nope, this will not be back ported to Firefly I guess.

 Thanks  Regards
 Somnath


Thanks for sharing this, the first thing in thought when I looked at
this thread, was your patches :)

If Giant will incorporate them, both the RDMA support and those should
give a huge performance boost for RDMA-enabled Ceph backnets.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM

2014-06-13 Thread Andrey Korolyov

On Fri, Jun 13, 2014 at 7:09 AM, Ke-fei Lin k...@kfei.net wrote:
 Hi list,

 I deployed a Windows 7 VM with qemu-rbd disk, and got an unexpected booting
 phase performance.

 I discovered that when booting the Windows VM up, there are consecutive ~2
 minutes that `ceph -w` gives me an interesting log like: ... 567 KB/s rd,
 567 op/s, ... 789 KB/s rd, 789 op/s and so on.

 e.g.
 2014-06-05 15:47:43.125441 mon.0 [INF] pgmap v18095: 320 pgs: 320
 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 765 kB/s
 rd, 765 op/s
 2014-06-05 15:47:44.240662 mon.0 [INF] pgmap v18096: 320 pgs: 320
 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 568 kB/s
 rd, 568 op/s
 ... (skipped)
 2014-06-05 15:50:02.441523 mon.0 [INF] pgmap v18186: 320 pgs: 320
 active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 412 kB/s
 rd, 412 op/s

 Which shows the number of rps is always the same as the number of ops, i.e.
 every operation is nearly 1KB, and I think this leads a very long boot time
 (takes 2 mins to enter desktop). But I can't understand why, is it an issue
 of my Ceph cluster? Or just some special I/O patterns in Windows VM booting
 process?

 In addition, I know that there are no qemu-rbd caching benefits during boot
 phase since the cache is not persistent (please corrects me), so is it
 possible to enlarge the read_ahead size in qemu-rbd driver? And does this
 make any sense?

 And finally, how can I tune up my Ceph cluster for this workload (booting
 Windows VM)?

 Any advice and suggestions will be greatly appreciated.


 Context:

 4 OSDs (7200rpm/750GB/SATA) with replication factor 2.

 The system disk in Windows VM is NTFS formatted with default 4K block size.

 $ uname -a
 Linux ceph-consumer 3.11.0-22-generic #38~precise1-Ubuntu SMP Fri May 16
 20:47:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

 $ ceph --version
 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)

 $ dpkg -l | grep rbd
 ii  librbd-dev   0.80.1-1precise
 RADOS block device client library (development files)
 ii  librbd1  0.80.1-1precise
 RADOS block device client library

 $ virsh version
 Compiled against library: libvir 0.9.8
 Using library: libvir 0.9.8
 Using API: QEMU 0.9.8
 Running hypervisor: QEMU 1.7.1 ()

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hi,

If you are able to leave only this VM in cluster scope to check,
you`ll perhaps may use virsh domblkstat accumulated values to compare
real number of operations.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM

2014-06-13 Thread Andrey Korolyov

On Fri, Jun 13, 2014 at 5:50 PM, Ke-fei Lin k...@kfei.net wrote:
 2014-06-13 21:23 GMT+08:00 Andrey Korolyov and...@xdel.ru:

 On Fri, Jun 13, 2014 at 7:09 AM, Ke-fei Lin k...@kfei.net wrote:
  Hi list,
 
  I deployed a Windows 7 VM with qemu-rbd disk, and got an unexpected
  booting
  phase performance.
 
  I discovered that when booting the Windows VM up, there are consecutive
  ~2
  minutes that `ceph -w` gives me an interesting log like: ... 567 KB/s
  rd,
  567 op/s, ... 789 KB/s rd, 789 op/s and so on.
 
  e.g.
  2014-06-05 15:47:43.125441 mon.0 [INF] pgmap v18095: 320 pgs: 320
  active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 765
  kB/s
  rd, 765 op/s
  2014-06-05 15:47:44.240662 mon.0 [INF] pgmap v18096: 320 pgs: 320
  active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 568
  kB/s
  rd, 568 op/s
  ... (skipped)
  2014-06-05 15:50:02.441523 mon.0 [INF] pgmap v18186: 320 pgs: 320
  active+clean; 86954 MB data, 190 GB used, 2603 GB / 2793 GB avail; 412
  kB/s
  rd, 412 op/s
 
  Which shows the number of rps is always the same as the number of ops,
  i.e.
  every operation is nearly 1KB, and I think this leads a very long boot
  time
  (takes 2 mins to enter desktop). But I can't understand why, is it an
  issue
  of my Ceph cluster? Or just some special I/O patterns in Windows VM
  booting
  process?
 
  In addition, I know that there are no qemu-rbd caching benefits during
  boot
  phase since the cache is not persistent (please corrects me), so is it
  possible to enlarge the read_ahead size in qemu-rbd driver? And does
  this
  make any sense?
 
  And finally, how can I tune up my Ceph cluster for this workload
  (booting
  Windows VM)?
 
  Any advice and suggestions will be greatly appreciated.
 
 
  Context:
 
  4 OSDs (7200rpm/750GB/SATA) with replication factor 2.
 
  The system disk in Windows VM is NTFS formatted with default 4K block
  size.
 
  $ uname -a
  Linux ceph-consumer 3.11.0-22-generic #38~precise1-Ubuntu SMP Fri
  May 16
  20:47:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 
  $ ceph --version
  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 
  $ dpkg -l | grep rbd
  ii  librbd-dev   0.80.1-1precise
  RADOS block device client library (development files)
  ii  librbd1  0.80.1-1precise
  RADOS block device client library
 
  $ virsh version
  Compiled against library: libvir 0.9.8
  Using library: libvir 0.9.8
  Using API: QEMU 0.9.8
  Running hypervisor: QEMU 1.7.1 ()
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 Hi,

 If you are able to leave only this VM in cluster scope to check,
 you`ll perhaps may use virsh domblkstat accumulated values to compare
 real number of operations.


 Thanks, Andrey.

 I tried `virsh domblkstat vm hda` (only this VM in whole cluster) and got
 these values:

 hda rd_req 70682
 hda rd_bytes 229894656
 hda wr_req 1067
 hda wr_bytes 12645888
 hda flush_operations 0

 (These values became stable after ~2 mins)

 While the output of `ceph -w` is attached at: http://pastebin.com/Uhdj9drV

 Any advices?


Thanks, poor man`s analysis shows that it can be true - assuming
median heartbeat value as 1.2s, overall read ops are about 40k, which
is close enough to what qemu stats saying, regarding floating
heartbeat interval. Because ceph -w never had such value as a precise
measurement tool, I may suggest to measure block stats difference on
smaller intervals, about 1s or so, and compare values then. By the
way, what driver do you use in qemu for a block device?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange qemu-rbd I/O behavior when booting Windows VM

2014-06-13 Thread Andrey Korolyov

In my belief, lot of sequential small reads will be aggregated after
all when targeting filestore contents (of course if the moment of issuing
next one is not dependent on status of previous read, otherwise
they`ll be separated in time in such way that the rotating media
scheduler will not be able to combine requests), am I wrong? If so,
this case only affects OSD CPU consumption (on very large scale).
Ke-fei, is there any real reasons behind staying on IDE and not LSI
SCSI emulation/virtio?

On Fri, Jun 13, 2014 at 8:11 PM, Sage Weil s...@inktank.com wrote:
 Right now, no.

 We could add a minimum read size to librbd when caching is enabled...
 that would not be particularly difficult.

 sage


 On Fri, 13 Jun 2014, Ke-fei Lin wrote:

 2014-06-13 22:04 GMT+08:00 Andrey Korolyov and...@xdel.ru:
 
  On Fri, Jun 13, 2014 at 5:50 PM, Ke-fei Lin k...@kfei.net wrote:
   Thanks, Andrey.
  
   I tried `virsh domblkstat vm hda` (only this VM in whole cluster) and
 got
   these values:
  
   hda rd_req 70682
   hda rd_bytes 229894656
   hda wr_req 1067
   hda wr_bytes 12645888
   hda flush_operations 0
  
   (These values became stable after ~2 mins)
  
   While the output of `ceph -w` is attached at:
 http://pastebin.com/Uhdj9drV
  
   Any advices?
 
 
  Thanks, poor man`s analysis shows that it can be true - assuming
  median heartbeat value as 1.2s, overall read ops are about 40k, which
  is close enough to what qemu stats saying, regarding floating
  heartbeat interval. Because ceph -w never had such value as a precise
  measurement tool, I may suggest to measure block stats difference on
  smaller intervals, about 1s or so, and compare values then. By the
  way, what driver do you use in qemu for a block device?

 OK, this time I capture the blkstat difference in a smaller interval (less
 than 1s).
 And a simple calculation gives me some result:

 (19531264-19209216)/(38147-37518) = 512
 ...
 (20158976-19531264)/(39373-38147) = 512

 Which means in the beginning of boot phase, every read request from VM is
 just *512 byte*.
 Maybe this is why `ceph -w` shows me every operation is about 1KB (in my
 first post)?

 So seems this is the inherent problem of Windows VM, but can I do something
 in my Ceph
 cluster's configuration to improve this?

 By the way the related part of my VM definition are:

 emulator/usr/bin/kvm/emulator
 disk type='network' device='disk'
   driver name='qemu' type='rbd' cache='writeback'/
   source protocol='rbd' name='libvirt-pool/test-rbd-1'
 host name='10.0.0.5' port='6789'/
   /source
   target dev='hda' bus='ide'/
   address type='drive' controller='0' bus='0' unit='0'/
 /disk

 Thanks.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Backfilling, latency and priority

2014-06-12 Thread Andrey Korolyov

On Thu, Jun 12, 2014 at 5:02 PM, David da...@visions.se wrote:
 Thanks Mark!

 Well, our workload has more IOs and quite low throughput, perhaps 10MB/s - 
 100MB/s. It’s a quite mixed workload, but mostly small files (http / mail / 
 sql).
 During the recovery we had ranged between 600-1000MB/s throughput.

 So the only way to currently ”fix” this is to have enough IO to handle both 
 recovery and client IOs?
 What’s the easiest/best way to add more IOs to a current cluster if you don’t 
 want to scale? Add more RAM to OSD servers or add a SSD backed r/w cache tier?


RAM usable only as read cache, SSD holds both types of operations.
Dealing with a lot of small operations is very hard because the way
cluster behaves changes dramatically with scale or with involved
caching methods, therefore workloads which worked very reliable on
certain number of OSDs may choke on 5 times higher count, so there
almost nothing I can suggest to you except try-n-check variant.

 Kind Regards,

 David Majchrzak


 12 jun 2014 kl. 14:42 skrev Mark Nelson mark.nel...@inktank.com:

 On 06/12/2014 03:44 AM, David wrote:
 Hi,

 We have 5 OSD servers, with 10 OSDs each (journals on enterprise SSDs).

 We lost an OSD and the cluster started to backfill the data to the rest of 
 the OSDs - during which the latency skyrocketed on some OSDs and connected 
 clients experienced massive IO wait.

 I’m trying to rectify the situation now and from what I can tell, these are 
 the settings that might help.

 osd client op priority
 osd recovery op priority
 osd max backfills
 osd recovery max active

 1. Does a high priority value mean it has higher priority? (if the other 
 one has lower value) Or does a priority of 1 mean highest priority?
 2. I’m running with default on these settings. Does anyone else have any 
 experience changing those?

 We did some investigation into this a little while back.  I suspect you'll 
 see some benefit by reducing backfill/recovery priority and max concurrent 
 operations, but you have to be careful.  We found that the higher the number 
 of concurrent client IOs (past the saturation point), the greater relative 
 proportion of throughput is used by client IO. That makes it hard to nail 
 down specific priority and concurrency settings.  If your workload requires 
 high throughput and low latency with few client IOs (ie below the saturation 
 point), you may need to overly favour client IO.  If you are over-saturating 
 the cluster with many concurrent IOs, you may want to give client IO less 
 priority.  If you overly favor client IO when over-saturating the cluster, 
 recovery can take much much longer and client throughput may actually be 
 lower in aggregate.  Obviously this isn't ideal, but seems to be what's 
 going on right now.

 Mark


 Kind Regards,
 David Majchrzak
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.67.9 Dumpling released

2014-06-04 Thread Andrey Korolyov

On 06/04/2014 06:06 PM, Sage Weil wrote:
 On Wed, 4 Jun 2014, Dan Van Der Ster wrote:
 Hi Sage, all,

 On 21 May 2014, at 22:02, Sage Weil s...@inktank.com wrote:

 * osd: allow snap trim throttling with simple delay (#6278, Sage Weil)

 Do you have some advice about how to use the snap trim throttle? I saw 
 osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to 
 follow the original ticket, since it started out as a question about 
 deep scrub contending with client IOs, but then at some point you 
 renamed the ticket to throttling snap trim. What exactly does snap trim 
 do in the context of RBD client? And can you suggest a good starting 
 point for osd_snap_trim_sleep = ? ?
 
 This is a coarse hack to make the snap trimming slow down and let client 
 IO run by simply sleeping between work.  I would start with something 
 smallish (.01 = 10ms) after deleting some snapshots and see what effect it 
 has on request latency.  Unfortunately it's not a very intuitive knob to 
 adjust, but it is an interim solution until we figure out how to better 
 prioritize this (and other) background work.
 
 In short, if you do see a performance degradation after removing snaps, 
 adjust this up or down and see how it changes that.  If you don't see a 
 degradation, then you're lucky and don't need to do anything.  :)
 
 You can adjust this on running OSDs with something like 'ceph daemon 
 osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* 
 injectargs -- --osd-snap-trim-sleep .01'.
 
 sage
 

Hi,

we had the same mechanism for almost a half of year and it working nice
except cases when multiple background snap deletions are hitting their
ends - latencies may spike not regarding very large sleep gap for snap
operations. Do you have any thoughts on reducing this particular impact?


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.67.9 Dumpling released

2014-06-04 Thread Andrey Korolyov

On 06/04/2014 07:22 PM, Sage Weil wrote:
 On Wed, 4 Jun 2014, Andrey Korolyov wrote:
 On 06/04/2014 06:06 PM, Sage Weil wrote:
 On Wed, 4 Jun 2014, Dan Van Der Ster wrote:
 Hi Sage, all,

 On 21 May 2014, at 22:02, Sage Weil s...@inktank.com wrote:

 * osd: allow snap trim throttling with simple delay (#6278, Sage Weil)

 Do you have some advice about how to use the snap trim throttle? I saw 
 osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to 
 follow the original ticket, since it started out as a question about 
 deep scrub contending with client IOs, but then at some point you 
 renamed the ticket to throttling snap trim. What exactly does snap trim 
 do in the context of RBD client? And can you suggest a good starting 
 point for osd_snap_trim_sleep = ? ?

 This is a coarse hack to make the snap trimming slow down and let client 
 IO run by simply sleeping between work.  I would start with something 
 smallish (.01 = 10ms) after deleting some snapshots and see what effect it 
 has on request latency.  Unfortunately it's not a very intuitive knob to 
 adjust, but it is an interim solution until we figure out how to better 
 prioritize this (and other) background work.

 In short, if you do see a performance degradation after removing snaps, 
 adjust this up or down and see how it changes that.  If you don't see a 
 degradation, then you're lucky and don't need to do anything.  :)

 You can adjust this on running OSDs with something like 'ceph daemon 
 osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* 
 injectargs -- --osd-snap-trim-sleep .01'.

 sage


 Hi,

 we had the same mechanism for almost a half of year and it working nice
 except cases when multiple background snap deletions are hitting their
 ends - latencies may spike not regarding very large sleep gap for snap
 operations. Do you have any thoughts on reducing this particular impact?
 
 This isn't ringing any bells.  If this is somethign you can reproduce with 
 osd logging enabled we should be able to tell what is causing the spike, 
 though...
 
 sage
 

Ok, would 10 be enough there? On 20, all timings most likely to be
distorted by logging operations even for tmpfs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fatigue for XFS

2014-05-29 Thread Andrey Korolyov

On 05/06/2014 01:23 AM, Dave Chinner wrote:
On Tue, May 06, 2014 at 12:59:27AM +0400, Andrey Korolyov wrote:
On Tue, May 6, 2014 at 12:36 AM, Dave Chinner da...@fromorbit.com wrote:
On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
Hello,

We are currently exploring issue which can be related to Ceph itself
or to the XFS - any help is very appreciated.

First, the picture: relatively old cluster w/ two years uptime and ten
months after fs recreation on every OSD, one of daemons started to
flap approximately once per day for couple of weeks, with no external
reason (bandwidth/IOPS/host issues). It looks almost the same every
time - OSD suddenly stop serving requests for a short period, gets
kicked out by peers report, then returns in a couple of seconds. Of
course, small but sensitive amount of requests are delayed by 15-30
seconds twice, which is bad for us. The only thing which correlates
with this kick is a peak of I/O, not too large, even not consuming all
underlying disk utilization, but alone in the cluster and clearly
visible. Also there are at least two occasions *without* correlated
iowait peak.

So, actual numbers and traces are the only thing that tell us what
is happening during these events. See here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

If it happens at almost the same time every day, then I'd be looking
at the crontabs to find what starts up about that time. output of
top will also probably tell you what process is running, too. topio
might be instructive, and blktrace almost certainly will be

I have two versions - we`re touching some sector on disk which is
about to be marked as dead but not displayed in SMART statistics or (I

Doubt it - SMART doesn't cause OS visible IO dispatch spikes.

believe so) some kind of XFS fatigue, which can be more likely in this
case, since near-bad sector should be touched more frequently and
related impact could leave traces in dmesg/SMART from my experience. I

I doubt that, too, because XFS doesn't have anything that is
triggered on a daily basis inside it. Maybe you've got xfs_fsr set
up on a cron job, though...

would like to ask if anyone has a simular experience before or can
suggest to poke existing file system in some way. If no suggestion
appear, I`ll probably reformat disk and, if problem will remain after
refill, replace it, but I think less destructive actions can be done
before.

Yeah, monitoring and determining the process that is issuing the IO
is what you need to find first.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Thanks Dave,

there are definitely no cron set for specific time (though most of
lockups happened in a relatively small interval which correlates with
the Ceph snapshot operations).

OK.

FWIW, Ceph snapshots on XFS may not be immediately costly in terms
of IO - they can be extremely costly after one is taken when the
files in the snapshot are next written to. If you are snapshotting
files that are currently being written to, then that's likely to
cause immediate IO issues...

In at least one case no Ceph snapshot
operations (including delayed removal) happened and at least two when
no I/O peak was observed. We observed and eliminated weird lockups
related to the openswitch behavior before - we`re combining storage
and compute nodes, so quirks in the OVS datapath caused very
interesting and weird system-wide lockups on (supposedly) spinlock,
and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels,
all of them was correlated with very high context switch peak.

Until we determine what is triggering the IO, the application isn't
really a concern.

Current issue is seemingly nothing to do with spinlock-like bugs or
just a hardware problem, we even rebooted problematic node to check if
the memory allocator may stuck at the border of specific NUMA node,
with no help, but first reappearance of this bug was delayed by some
days then. Disabling lazy allocation via specifying allocsize did
nothing too. It may look like I am insisting that it is XFS bug, where
Ceph version is more likely to appear because of way more complicated
logic and operation behaviour, but persistence on specific node across
relaunching of Ceph storage daemon suggests bug relation to the
unlucky byte sequence more than anything else. If it finally appear as
Ceph bug, it`ll ruin our expectations from two-year of close
experience with this product and if it is XFS bug, we haven`t see
anything like this before, thought we had a pretty collection of
XFS-related lockups on the earlier kernels.

Long experience with triaging storage performance issues has taught
me to ignore what anyone *thinks* is the cause of the problem; I
rely on the data that is gathered to tell me what the problem is. I
find that hard data has a nasty habit of busting assumptions,
expectations, speculations

Re: [ceph-users] Ceph and low latency kernel

2014-05-26 Thread Andrey Korolyov

On Mon, May 26, 2014 at 10:53 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 sorry it was a bit poorly defined.

 I'm talking about thinks like this:
 http://lwn.net/Articles/551179/

 Stefan


Not sure if Ceph can have any advantage of it, as for common Ceph
operations looks like they are barely hitting the performance area
patch is designated about of, but it would be awesome if you`ll be
able to test it :)

D(2)CTCP and may be some other congestion control algorithms designed
for low-latency high-speed networks can definitely give you a speed
bump for spiky workloads.

 Am 25.05.2014 11:11, schrieb Andrey Korolyov:
 Hi,

 which one you are talking about? -rt patchset has absolutely no
 difference for Ceph, though very specific workload (which I was unable
 to imagine at a time) can benefit of it a little. Windriver variant
 means much more, because it rt`ing virtualized envs - in combination
 with storage nodes you may achieve a lot better deadlines for tasks
 like gaming servers and so on, but I had not tried it.

 On Sun, May 25, 2014 at 1:03 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Hi,

 has anybody ever tried to use a low latency kernel for ceph? Does it make 
 any differences?

 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and low latency kernel

2014-05-25 Thread Andrey Korolyov

Hi,

which one you are talking about? -rt patchset has absolutely no
difference for Ceph, though very specific workload (which I was unable
to imagine at a time) can benefit of it a little. Windriver variant
means much more, because it rt`ing virtualized envs - in combination
with storage nodes you may achieve a lot better deadlines for tasks
like gaming servers and so on, but I had not tried it.

On Sun, May 25, 2014 at 1:03 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 has anybody ever tried to use a low latency kernel for ceph? Does it make any 
 differences?

 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] sparse copy between pools

2014-05-14 Thread Andrey Korolyov

On 05/14/2014 02:13 PM, Erwin Lubbers wrote:
 Hi,
 
 I'm trying to copy a sparse provisioned rbd image from pool A to pool B (both 
 are replicated three times). The image has a disksize of 8 GB and contains 
 around 1.4 GB of data. I do use:
 
 rbd cp PoolA/Image PoolB/Image
 
 After copying ceph -s tells me that 24 GB diskspace extra is in use. Then I 
 delete the original pool A image and only 8 GB of space is freed.
 
 Does Ceph not sparse copy the image using cp? Is there another way to do so?
 
 I'm using 0.67.7 dumpling on this cluster.

I believe that the
http://tracker.ceph.com/projects/ceph/repository/revisions/824da2029613a6f4b380b6b2f16a0bd0903f7e3c/diff/src/librbd/internal.cc
had to went to the dumpling as backport; github shows that it was not.

Josh, would you mind to add your fix there too?
 
 Regards,
 Erwin
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-10 Thread Andrey Korolyov

Anyway replacing set of monitors means downtime for every client, so
I`m in doubt if 'no outage' word is still applicable there.

On Fri, May 9, 2014 at 9:46 PM, Kyle Bader kyle.ba...@gmail.com wrote:
 Let's assume a test cluster up and running with real data on it.
 Which is the best way to migrate everything to a production (and
 larger) cluster?

 I'm thinking to add production MONs to the test cluster, after that,
 add productions OSDs to the test cluster, waiting for a full rebalance
 and then starting to remove test OSDs and test mons.

 This should migrate everything with no outage.

 It's possible and I've done it, this was around the argonaut/bobtail
 timeframe on a pre-production cluster. If your cluster has a lot of
 data then it may take a long time or be disruptive, make sure you've
 tested that your recovery tunables are suitable for your hardware
 configuration.

 --

 Kyle
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80 Firefly released

2014-05-08 Thread Andrey Korolyov

Mike, would you mind to write your experience if you`ll manage to get
this flow through first? I hope I`ll be able to conduct some tests
related to 0.80 only next week, including maintenance combined with
primary pointer relocation - one of most crucial things remaining in
Ceph for the production performance.

On Wed, May 7, 2014 at 10:18 PM, Mike Dawson mike.daw...@cloudapt.com wrote:

 On 5/7/2014 11:53 AM, Gregory Farnum wrote:

 On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
 daniel.vanders...@cern.ch wrote:

 Hi,


 Sage Weil wrote:

 * *Primary affinity*: Ceph now has the ability to skew selection of
OSDs as the primary copy, which allows the read workload to be
cheaply skewed away from parts of the cluster without migrating any
data.


 Can you please elaborate a bit on this one? I found the blueprint [1] but
 still don't quite understand how it works. Does this only change the
 crush
 calculation for reads? i.e writes still go to the usual primary, but
 reads
 are distributed across the replicas? If so, does this change the
 consistency
 model in any way.


 It changes the calculation of who becomes the primary, and that
 primary serves both reads and writes. In slightly more depth:
 Previously, the primary has always been the first OSD chosen as a
 member of the PG.
 For erasure coding, we added the ability to specify a primary
 independent of the selection ordering. This was part of a broad set of
 changes to prevent moving the EC shards around between different
 members of the PG, and means that the primary might be the second OSD
 in the PG, or the fourth.
 Once this work existed, we realized that it might be useful in other
 cases, because primaries get more of the work for their PG (serving
 all reads, coordinating writes).
 So we added the ability to specify a primary affinity, which is like
 the CRUSH weights but only impacts whether you become the primary. So
 if you have 3 OSDs that each have primary affinity = 1, it will behave
 as normal. If two have primary affinity = 0, the remaining OSD will be
 the primary. Etc.


 Is it possible (and/or advisable) to set primary affinity low while
 backfilling / recovering an OSD in an effort to prevent unnecessary slow
 reads that could be directed to less busy replicas? I suppose if the cost of
 setting/unsetting primary affinity is low and clients are starved for reads
 during backfill/recovery from the osd in question, it could be a win.

 Perhaps the workflow for maintenance on osd.0 would be something like:

 - Stop osd.0, do some maintenance on osd.0
 - Read primary affinity of osd.0, store it for later
 - Set primary affinity on osd.0 to 0
 - Start osd.0
 - Enjoy a better backfill/recovery experience. RBD clients happier.
 - Reset primary affinity on osd.0 to previous value

 If the cost of setting primary affinity is low enough, perhaps this strategy
 could be automated by the ceph daemons.

 Thanks,
 Mike Dawson


 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Explicit F2FS support (was: v0.80 Firefly released)

2014-05-07 Thread Andrey Korolyov

Hello,

first of all, congratulations to Inktank and thank you for your awesome work!

Although exploiting native f2fs abilities, as with btrfs, sounds
awesome for a matter of performance, I wonder when kv db is able to
practically give users with 'legacy' file systems ability to conduct
CoW operations as fast as on the log-based fs, with small or no
performance impact, what`s the primary idea behind introducing
interface bounded to the specific filesystem in same time? Of course I
believe that f2fs will outperform almost every competitor at its field
- non-rotating media operations, but I would be grateful if someone
can shed light on this development choice.

On Wed, May 7, 2014 at 5:05 AM, Sage Weil s...@inktank.com wrote:
 We did it!  Firefly v0.80 is built and pushed out to the ceph.com
 repositories.

 This release will form the basis for our long-term supported release
 Firefly, v0.80.x.  The big new features are support for erasure coding
 and cache tiering, although a broad range of other features, fixes,
 and improvements have been made across the code base.  Highlights include:

 * *Erasure coding*: support for a broad range of erasure codes for lower
   storage overhead and better data durability.
 * *Cache tiering*: support for creating 'cache pools' that store hot,
   recently accessed objects with automatic demotion of colder data to
   a base tier.  Typically the cache pool is backed by faster storage
   devices like SSDs.
 * *Primary affinity*: Ceph now has the ability to skew selection of
   OSDs as the primary copy, which allows the read workload to be
   cheaply skewed away from parts of the cluster without migrating any
   data.
 * *Key/value OSD backend* (experimental): An alternative storage backend
   for Ceph OSD processes that puts all data in a key/value database like
   leveldb.  This provides better performance for workloads dominated by
   key/value operations (like radosgw bucket indices).
 * *Standalone radosgw* (experimental): The radosgw process can now run
   in a standalone mode without an apache (or similar) web server or
   fastcgi.  This simplifies deployment and can improve performance.

 We expect to maintain a series of stable releases based on v0.80
 Firefly for as much as a year.  In the meantime, development of Ceph
 continues with the next release, Giant, which will feature work on the
 CephFS distributed file system, more alternative storage backends
 (like RocksDB and f2fs), RDMA support, support for pyramid erasure
 codes, and additional functionality in the block device (RBD) like
 copy-on-read and multisite mirroring.

 This release is the culmination of a huge collective effort by about 100
 different contributors.  Thank you everyone who has helped to make this
 possible!

 Upgrade Sequencing
 --

 * If your existing cluster is running a version older than v0.67
   Dumpling, please first upgrade to the latest Dumpling release before
   upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
   documentation.

 * Upgrade daemons in the following order:

 1. Monitors
 2. OSDs
 3. MDSs and/or radosgw

   If the ceph-mds daemon is restarted first, it will wait until all
   OSDs have been upgraded before finishing its startup sequence.  If
   the ceph-mon daemons are not restarted prior to the ceph-osd
   daemons, they will not correctly register their new capabilities
   with the cluster and new features may not be usable until they are
   restarted a second time.

 * Upgrade radosgw daemons together.  There is a subtle change in behavior
   for multipart uploads that prevents a multipart request that was initiated
   with a new radosgw from being completed by an old radosgw.

 Notable changes since v0.79
 ---

 * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
 * ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
 * librados: fix inconsistencies in API error values (David Zafman)
 * librados: fix watch operations with cache pools (Sage Weil)
 * librados: new snap rollback operation (David Zafman)
 * mds: fix respawn (John Spray)
 * mds: misc bugs (Yan, Zheng)
 * mds: misc multi-mds fixes (Yan, Zheng)
 * mds: use shared_ptr for requests (Greg Farnum)
 * mon: fix peer feature checks (Sage Weil)
 * mon: require 'x' mon caps for auth operations (Joao Luis)
 * mon: shutdown when removed from mon cluster (Joao Luis)
 * msgr: fix locking bug in authentication (Josh Durgin)
 * osd: fix bug in journal replay/restart (Sage Weil)
 * osd: many many many bug fixes with cache tiering (Samuel Just)
 * osd: track omap and hit_set objects in pg stats (Samuel Just)
 * osd: warn if agent cannot enable due to invalid (post-split) stats (Sage 
 Weil)
 * rados bench: track metadata for multiple runs separately (Guang Yang)
 * rgw: fixed subuser modify (Yehuda Sadeh)
 * rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza)

 For the complete release notes, please see:

Re: [ceph-users] RBD on Mac OS X

2014-05-06 Thread Andrey Korolyov

You can do this for sure using iSCSI reexport feature, AFAIK no
working RBD implementation for OSX exists.

On Tue, May 6, 2014 at 3:28 PM, Pavel V. Kaygorodov pa...@inasan.ru wrote:
 Hi!

 I want to use ceph for time machine backups on Mac OS X.
 Is it possible to map RBD or mount CephFS on mac directly, for example, using 
 osxfuse?
 Or it is only way to do this -- make an intermediate linux server?

 Pavel.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fatigue for XFS

2014-05-05 Thread Andrey Korolyov

Hello,

We are currently exploring issue which can be related to Ceph itself
or to the XFS - any help is very appreciated.

First, the picture: relatively old cluster w/ two years uptime and ten
months after fs recreation on every OSD, one of daemons started to
flap approximately once per day for couple of weeks, with no external
reason (bandwidth/IOPS/host issues). It looks almost the same every
time - OSD suddenly stop serving requests for a short period, gets
kicked out by peers report, then returns in a couple of seconds. Of
course, small but sensitive amount of requests are delayed by 15-30
seconds twice, which is bad for us. The only thing which correlates
with this kick is a peak of I/O, not too large, even not consuming all
underlying disk utilization, but alone in the cluster and clearly
visible. Also there are at least two occasions *without* correlated
iowait peak.

I have two versions - we`re touching some sector on disk which is
about to be marked as dead but not displayed in SMART statistics or (I
believe so) some kind of XFS fatigue, which can be more likely in this
case, since near-bad sector should be touched more frequently and
related impact could leave traces in dmesg/SMART from my experience. I
would like to ask if anyone has a simular experience before or can
suggest to poke existing file system in some way. If no suggestion
appear, I`ll probably reformat disk and, if problem will remain after
refill, replace it, but I think less destructive actions can be done
before.

XFS is running on 3.10 with almost default create and mount options,
ceph version is the latest cuttlefish (this rack should be upgraded, I
know).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fatigue for XFS

2014-05-05 Thread Andrey Korolyov

On Tue, May 6, 2014 at 12:36 AM, Dave Chinner da...@fromorbit.com wrote:
On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
Hello,

We are currently exploring issue which can be related to Ceph itself
or to the XFS - any help is very appreciated.

So, actual numbers and traces are the only thing that tell us what
is happening during these events. See here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

I have two versions - we`re touching some sector on disk which is
about to be marked as dead but not displayed in SMART statistics or (I

Doubt it - SMART doesn't cause OS visible IO dispatch spikes.

I doubt that, too, because XFS doesn't have anything that is
triggered on a daily basis inside it. Maybe you've got xfs_fsr set
up on a cron job, though...

Yeah, monitoring and determining the process that is issuing the IO
is what you need to find first.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Thanks Dave,

there are definitely no cron set for specific time (though most of
lockups happened in a relatively small interval which correlates with
the Ceph snapshot operations). In at least one case no Ceph snapshot
operations (including delayed removal) happened and at least two when
no I/O peak was observed. We observed and eliminated weird lockups
related to the openswitch behavior before - we`re combining storage
and compute nodes, so quirks in the OVS datapath caused very
interesting and weird system-wide lockups on (supposedly) spinlock,
and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels,
all of them was correlated with very high context switch peak.

So, my understanding is that we hitting neither very rare memory
allocator bug in case of XFS or age-related Ceph issue, both are very
unlikely to exist - but I cannot imagine nothing else. If it helps, I
may collect a series of perf events during next appearance or exact
iostat output (mine graphics can say that the I/O was not choked
completely when peak appeared, that`s all).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replace OSD drive without remove/re-add OSD

2014-05-03 Thread Andrey Korolyov

On Sat, May 3, 2014 at 4:01 AM, Indra Pramana in...@sg.or.id wrote:
 Sorry forgot to cc the list.

 On 3 May 2014 08:00, Indra Pramana in...@sg.or.id wrote:

 Hi Andrey,

 I actually wanted to try this (instead of remove and readd OSD) to avoid
 remapping of PGs to other OSDs and the unnecessary I/O load.

 Are you saying that doing this will also trigger remapping? I thought it
 will just do recovery to replace missing PGs as a result of the drive
 replacement?

 Thank you.


Yes, remapping will take place, though it is a bit counterintuitive
and I suspect that the roots are the same as with double data
placement recalculation with out + rm procedure. Actually Inktank
people may answer the question with more details I suppose. Also I
think that preserving of the collections may eliminate remap during
such kind of refill, though it is not trivial thing to do and I had
not experimented with this.

 On 2 May 2014 21:02, Andrey Korolyov and...@xdel.ru wrote:

 On 05/02/2014 03:27 PM, Indra Pramana wrote:
  Hi,
 
  May I know if it's possible to replace an OSD drive without removing /
  re-adding back the OSD? I want to avoid the time and the excessive I/O
  load which will happen during the recovery process at the time when:
 
  - the OSD is removed; and
  - the OSD is being put back into the cluster.
 
  I read David Zafman's comment on this thread, that we can set noout,
  take OSD down, replace the drive, and then bring the OSD back up
  and
  unset noout.
 
  http://www.spinics.net/lists/ceph-users/msg05959.html
 
  May I know if it's possible to do this?
 
  - ceph osd set noout
  - sudo stop ceph-osd id=12
  - Replace the drive, and once done:
  - sudo start ceph-osd id=12
  - ceph osd unset noout
 
  The cluster was built using ceph-deploy, can we just replace a drive
  like that without zapping and preparing the disk using ceph-deploy?
 

 There will be absolutely no quirks except continuous remapping with
 peering along entire recovery process. If your cluster may meet this
 well, there is absolutely no problem to go through this flow. Otherwise,
 in longer out+in flow, there are only two short intensive recalculations
 which can be done at the scheduled time, comparing with peering during
 remap, which can introduce unnecessary I/O spikes.

  Looking forward to your reply, thank you.
 
  Cheers.
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unable to bring cluster up

2014-04-30 Thread Andrey Korolyov

Galndalf,

regarding this one and previous you told about memory consumption -
there are too much PGs, so memory consumption is so high as you are
observing. Dead loop of osd-never-goes-up is probably because of
suicide timeout of internal queues. It is may be not good but
expected.

OSD behaviour is ultimately depends of all kinds of knobs you can
change, e.g. I had found recently funny issue that the collection
warm-up (e.g. bringing collections to the RAM cache) actually slows
down OSD rejoin (typical post-peering I/O delays), comparing with
regular situation when collections reading from disk upon OSD launch.


On Tue, Apr 29, 2014 at 9:22 PM, Gregory Farnum g...@inktank.com wrote:
 You'll need to go look at the individual OSDs to determine why they
 aren't on. All the cluster knows is that the OSDs aren't communicating
 properly.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Tue, Apr 29, 2014 at 3:06 AM, Gandalf Corvotempesta
 gandalf.corvotempe...@gmail.com wrote:
 After a simple service ceph restart on a server, i'm unable to get
 my cluster up again
 http://pastebin.com/raw.php?i=Wsmfik2M

 suddenly, some OSDs goes UP and DOWN randomly.

 I don't see any network traffic on cluster interface.
 How can I detect what ceph is doing ? From the posted output there is
 no way to detect if ceph is recovering or not. Showing just a bunch of
 increasing/decreasing numbers doens't help.

 I can see this:

 2014-04-29 12:03:49.013808 mon.0 [INF] pgmap v1047121: 98432 pgs: 241
 inactive, 33138 peering, 25 remapped, 60067 down+peering, 3489
 remapped+peering, 1472 down+remapped+peering; 66261 bytes data, 1647
 MB used, 5582 GB / 5583 GB avail

 so what, is it recovering? Is it sleeping ? Why is not recovering ?

 http://pastebin.com/raw.php?i=2EdugwQa
 why all OSDs from host osd12 and osd13 are down ? Both hosts are up and 
 running.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OOM-Killer for ceph-osd

2014-04-27 Thread Andrey Korolyov

For the record, ``rados df'' will give an object count. Would you mind
to send out your ceph.conf? I cannot imagine what parameter may raise
memory consumption so dramatically, so config at a glance may reveal
some detail. Also core dump should be extremely useful (though it`s
better to pass the flag to Inktank there).

On Mon, Apr 28, 2014 at 1:14 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 I don't know how to count objects but its a test cluster,
 i have not more than 50 small files

 2014-04-27 22:33 GMT+02:00 Andrey Korolyov and...@xdel.ru:
 What # of objects do you have? After all, such large footprint can be
 just a bug in your build if you do not have ultimate high object
 count(~1e8) or any extraordinary configuration parameter.

 On Mon, Apr 28, 2014 at 12:26 AM, Gandalf Corvotempesta
 gandalf.corvotempe...@gmail.com wrote:
 So, are you suggesting to lower the pg count ?
 Actually i'm using the suggested number of OSD*100/Replicas
 and I have just 2 OSDs per server.


 2014-04-24 19:34 GMT+02:00 Andrey Korolyov and...@xdel.ru:
 On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote:
 During a recovery, I'm hitting oom-killer for ceph-osd because it's
 using more than 90% of avaialble ram (8GB)

 How can I decrease the memory footprint during a recovery ?

 You can reduce pg count per OSD for example, it scales down well enough.
 OSD memory footprint (during recovery or normal operations) depends of
 number of objects, e.g. commited data and total count of PGs per OSD.
 Because deleting some data is not an option, I may suggest only one
 remaining way :)

 I had raised related question a long ago, it was about post-recovery
 memory footprint patterns - OSD shrinks memory usage after successful
 recovery in a relatively long period, up to some days and by couple of
 fast 'leaps'. Heap has nothing to do with this bug I had not profiled
 the daemon itself yet.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OOM-Killer for ceph-osd

2014-04-27 Thread Andrey Korolyov

What # of objects do you have? After all, such large footprint can be
just a bug in your build if you do not have ultimate high object
count(~1e8) or any extraordinary configuration parameter.

On Mon, Apr 28, 2014 at 12:26 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 So, are you suggesting to lower the pg count ?
 Actually i'm using the suggested number of OSD*100/Replicas
 and I have just 2 OSDs per server.


 2014-04-24 19:34 GMT+02:00 Andrey Korolyov and...@xdel.ru:
 On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote:
 During a recovery, I'm hitting oom-killer for ceph-osd because it's
 using more than 90% of avaialble ram (8GB)

 How can I decrease the memory footprint during a recovery ?

 You can reduce pg count per OSD for example, it scales down well enough.
 OSD memory footprint (during recovery or normal operations) depends of
 number of objects, e.g. commited data and total count of PGs per OSD.
 Because deleting some data is not an option, I may suggest only one
 remaining way :)

 I had raised related question a long ago, it was about post-recovery
 memory footprint patterns - OSD shrinks memory usage after successful
 recovery in a relatively long period, up to some days and by couple of
 fast 'leaps'. Heap has nothing to do with this bug I had not profiled
 the daemon itself yet.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OOM-Killer for ceph-osd

2014-04-27 Thread Andrey Korolyov

On Mon, Apr 28, 2014 at 1:26 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 2014-04-27 23:20 GMT+02:00 Andrey Korolyov and...@xdel.ru:
 For the record, ``rados df'' will give an object count. Would you mind
 to send out your ceph.conf? I cannot imagine what parameter may raise
 memory consumption so dramatically, so config at a glance may reveal
 some detail. Also core dump should be extremely useful (though it`s
 better to pass the flag to Inktank there).

 http://pastie.org/pastes/9118130/text?key=vsjj5g4ybetbxu7swflyvq

 From the config below, i've removed the single mon definition to hide
 IPs and hostname from posting to ML.

 [global]
   auth cluster required = cephx
   auth service required = cephx
   auth client required = cephx
   fsid = 6b9916f9-c209-4f53-98c6-581adcdf0955
   osd pool default pg num = 8192
   osd pool default pgp num = 8192
   osd pool default size = 3

 [mon]
   mon osd down out interval = 600
   mon osd mon down reporters = 7

 [osd]
   osd mkfs type = xfs
   osd journal size = 16384
   osd mon heartbeat interval = 30
   # Performance tuning
   filestore merge threshold = 40
   filestore split multiple = 8
   osd op threads = 8
   # Recovery tuning
   osd recovery max active = 5
   osd max backfills = 2
   osd recovery op priority = 2

Nothing looks wrong, except heartbeat interval which probably should
be smaller due to recovery considerations. Try ``ceph osd tell X heap
release'' and if it will not change memory consumption, file a bug.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OOM-Killer for ceph-osd

2014-04-24 Thread Andrey Korolyov

On 04/24/2014 08:14 PM, Gandalf Corvotempesta wrote:
 During a recovery, I'm hitting oom-killer for ceph-osd because it's
 using more than 90% of avaialble ram (8GB)
 
 How can I decrease the memory footprint during a recovery ?

You can reduce pg count per OSD for example, it scales down well enough.
OSD memory footprint (during recovery or normal operations) depends of
number of objects, e.g. commited data and total count of PGs per OSD.
Because deleting some data is not an option, I may suggest only one
remaining way :)

I had raised related question a long ago, it was about post-recovery
memory footprint patterns - OSD shrinks memory usage after successful
recovery in a relatively long period, up to some days and by couple of
fast 'leaps'. Heap has nothing to do with this bug I had not profiled
the daemon itself yet.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Largest Production Ceph Cluster

2014-04-01 Thread Andrey Korolyov

On 04/01/2014 03:51 PM, Robert Sander wrote:
 On 01.04.2014 13:38, Karol Kozubal wrote:
 
 I am curious to know what is the largest known ceph production deployment?
 
 I would assume it is the CERN installation.
 
 Have a look at the slides from Frankfurt Ceph Day:
 
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
 
 Regards
 

Just curious, how CERN guys built the network topology to prevent
possible cluster splits, because split in the middle will cause huge
downtime even for a relatively short split time enough to mark half of
those 1k OSDs as down by remaining MON majority.

 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Convert version 1 RBD to version 2?

2014-03-25 Thread Andrey Korolyov

On 03/25/2014 02:08 PM, Graeme Lambert wrote:
 Hi Stuart,
 
 If this helps, these three lines will do it for you.  I'm sure you could
 rustle up a script to go through all of your images and do this for you.
 
 rbd export libvirt-pool/my-server - | rbd import --image-format 2 -
 libvirt-pool/my-server2
 rbd rm libvirt-pool/my-server
 rbd mv libvirt-pool/my-server2 libvirt-pool/my-server
 
 Best regards
 
 Graeme
 
 

Actually will commit all the bytes. If one prefer to keep discarded
plances, it is necessary to throw out the copy to the filesystem(or
implement chunk-reader pipe for rbd client).

  
 On 24/03/14 07:29, Stuart Longland wrote:
 Hi all,

 This might be a dumb question, but I'll ask anyway as I don't see it
 answered.

 I have a stack of RBD images in the default v1 format.  I'd like to use
 COW-cloning on these.

 How does one convert them to version 2 format?  Is there a tool to do
 the conversion or do I need to export each one and re-import them?

 Regards,
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Convert version 1 RBD to version 2?

2014-03-25 Thread Andrey Korolyov

On 03/25/2014 02:08 PM, Graeme Lambert wrote:
 Hi Stuart,
 
 If this helps, these three lines will do it for you.  I'm sure you could
 rustle up a script to go through all of your images and do this for you.
 
 rbd export libvirt-pool/my-server - | rbd import --image-format 2 -
 libvirt-pool/my-server2
 rbd rm libvirt-pool/my-server
 rbd mv libvirt-pool/my-server2 libvirt-pool/my-server
 
 Best regards
 
 Graeme
 
 

Actually will commit all the bytes. If one prefer to keep discarded
places, it is necessary to throw out the copy to the filesystem(or
implement chunk-reader pipe for rbd client).

  
 On 24/03/14 07:29, Stuart Longland wrote:
 Hi all,

 This might be a dumb question, but I'll ask anyway as I don't see it
 answered.

 I have a stack of RBD images in the default v1 format.  I'd like to use
 COW-cloning on these.

 How does one convert them to version 2 format?  Is there a tool to do
 the conversion or do I need to export each one and re-import them?

 Regards,
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Andrey Korolyov

On 03/13/2014 02:08 AM, Gandalf Corvotempesta wrote:
 I've increased PG number to a running cluster.
 After this operation, all OSDs from one node was marked as down.
 
 Now, after a while, i'm seeing that OSDs are slowly coming up again
 (sequentially) after rebalancing.
 
 Is this an expected behaviour ?

Hello Gandalf,

Yes, if you have essentially high amount of commited data in the cluster
and/or large number of PG(tens of thousands). If you have a room to
experiment with this transition from scratch you may want to play with
numbers in the OSD` queues since they causing deadlock-like behaviour on
operations like increasing PG count or large pool deletion. If cluster
has no I/O at all at the moment, such behaviour is not expected definitely.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Impact of disk encryption and recommendations?

2014-03-03 Thread Andrey Korolyov

Hello,

On Mon, Mar 3, 2014 at 2:41 PM, Pieter Koorts pie...@heisenbug.io wrote:
 Hi

 Does the disk encryption have a major impact on performance for a busy(ish)
 cluster?

 What are the thoughts of having the encryption enabled for all disks by
 default?

Encryption means stricter requirements to handle a power failure, because
container contents may be lost entirely as easy as regular filesystem may
get corrupt on same event. Therefore enforced sync policy along with the
additional CPU resources consumption and (very possibly) battery for disk
controller requirement describes all the difference.


 - Pieter

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Impact of disk encryption and recommendations?

2014-03-03 Thread Andrey Korolyov

On 03/03/2014 06:55 PM, Sage Weil wrote:
 On Mon, 3 Mar 2014, Andrey Korolyov wrote:
 Hello,

 On Mon, Mar 3, 2014 at 2:41 PM, Pieter Koorts pie...@heisenbug.io wrote:
 Hi

 Does the disk encryption have a major impact on performance for a busy(ish)
 cluster?

 What are the thoughts of having the encryption enabled for all disks by
 default?

 Encryption means stricter requirements to handle a power failure, because
 container contents may be lost entirely as easy as regular filesystem may
 get corrupt on same event. Therefore enforced sync policy along with the
 additional CPU resources consumption and (very possibly) battery for disk
 controller requirement describes all the difference.
 
 Hi Andrey,
 
 You're talking about dm-crypt, right?  How does that affect safety?  I 
 assumed that it passes IOs directly up and down the stack without
 reordering or buffering.
 

Hi,

Yes, my point is primarily about dm-crypt containers, HDD cache matters
there. Or, in most terrible case, for any FS with default mount options
when volume laid on it as a regular file (though nobody will do this for
Ceph because of related performance impact). My experience represents
the actual results for 'benchmarking' this three years ago but I think
not much changed there. May be it will be interesting to collect fault
statistics over different block sizes for crypto containers and raw
storage with default device cache settings.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph RDB, VMs, btrfs, COW, OSD journals, f2fs, SSDs

2014-03-02 Thread Andrey Korolyov

Hello,

Right now, none of filesystems whose CoW features can be used by Ceph
(btrfs and zfs in near future) are recommended for production usage
and it makes sense only for filestore mount point, not for journal.
I`m in doubt if there can be any advantages for performance by using
f2fs for journal over raw device target, but there is a worth to
compare it to ext4. F2fs will take great advantage for pure SSD-based
storage and for upcoming kv filestore over existing best practices
with XFS.

Probably f2fs will be able to reduce wearout factor comparing to the
regular discard(), but it should take really long to compare properly,
though I`ll be happy if someone will be able to make such comparison.

So, if you want to get rid of locking behavior operating with huge
snapshots, you may try btrfs, but it fits barely in any
near-production environment. XFS is way more stable, but unfortunate
snapshot deletion(very huge, though) may tear down pool I/O latencies
for a very long period.


On Mon, Mar 3, 2014 at 1:19 AM, Joshua Dotson j...@wrale.com wrote:
 Hello,

 If I'm storing large VM images on Ceph RDB, and I have OSD journals on SSD,
 should I _not_ be using a copy on write file system on the OSDs?  I read
 that large VM images don't play well with COW (e.g. btrfs) [1].  Does Ceph
 improve this situation? Would btrfs outperform non-cow filesystems in this
 setting?

 Also, I'm considering placing my OSD journals on f2fs-formatted partitions
 on my Samsung SSDs for hardware resiliency (Samsung created both my SSDs and
 f2fs) [2].  F2FS uses copy on write [3].  Has anyone ever tried this?
 Thoughts?

 [1] https://wiki.archlinux.org/index.php/Btrfs#Copy-On-Write_.28CoW.29
 [2] https://www.usenix.org/legacy/event/fast12/tech/full_papers/Min.pdf
 [3] http://www.dslreports.com/forum/r27846667-

 Thanks,
 Joshua

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Constant slow / blocked requests with otherwise healthy cluster

2013-11-27 Thread Andrey Korolyov

Hey,

What number do you have for a replication factor? As for three, 1.5k
IOPS may be a little bit high for 36 disks, and your OSD ids looks a bit
suspicious - there should not be 60+ OSDs based on calculation from
numbers below.

On 11/28/2013 12:45 AM, Oliver Schulz wrote:
 Dear Ceph Experts,
 
 our Ceph cluster suddenly went into a state of OSDs constantly having
 blocked or slow requests, rendering the cluster unusable. This happened
 during normal use, there were no updates, etc.
 
 All disks seem to be healthy (smartctl, iostat, etc.). A complete
 hardware reboot including system update on all nodes has not helped.
 The network equipment also shows no trouble.
 
 We'd be glad for any advice on how to diagnose and solve this, as
 the cluster is basically at a standstill and we urgently need
 to get it back into operation.
 
 Cluster structure: 6 Nodes, 6x 3TB disks plus 1x System/Journal SSD
 per node, one OSD per disk. We're running ceph version 0.67.4-1precise
 on Ubuntu 12.04.3 with kernel 3.8.0-33-generic (x86_64).
 
 ceph status shows something like (it varies):
 
 cluster 899509fe-afe4-42f4-a555-bb044ca0f52d
  health HEALTH_WARN 77 requests are blocked  32 sec
  monmap e1: 3 mons at
 {a=134.107.24.179:6789/0,b=134.107.24.181:6789/0,c=134.107.24.183:6789/0},
 election epoch 312, quorum 0,1,2 a,b,c
  osdmap e32600: 36 osds: 36 up, 36 in
   pgmap v16404527: 14304 pgs: 14304 active+clean; 20153 GB data,
 60630 GB used, 39923 GB / 100553 GB avail; 1506KB/s rd, 21246B/s wr,
 545op/s
  mdsmap e478: 1/1/1 up {0=c=up:active}, 1 up:standby-replay
 
 ceph health detail shows something like (it varies):
 
 HEALTH_WARN 363 requests are blocked  32 sec; 22 osds have slow
 requests
 363 ops are blocked  32.768 sec
 1 ops are blocked  32.768 sec on osd.0
 8 ops are blocked  32.768 sec on osd.3
 37 ops are blocked  32.768 sec on osd.12
 [...]
 11 ops are blocked  32.768 sec on osd.62
 45 ops are blocked  32.768 sec on osd.65
 22 osds have slow requests
 
 The number and identity of affected OSDs constantly changes
 (sometimes health even goes to OK for a moment).
 
 
 Cheers and thanks for any ideas,
 
 Oliver
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Placement groups on a 216 OSD cluster with multiple pools

2013-11-15 Thread Andrey Korolyov

Of course, but it means that in case of failure you can no longer trust
your data consistency and should recheck it against separately stored
checksums or so. I`m leaving aside such fact that Ceph will not probably
recover pool properly with replication number lower than 2 in many
cases. So generally yes, one may use no replication but it does not make
sense because for small amount of data there is very small savings and
for larger data cost of recheck/reupload will be higher than cost of
permanent additional storage.

On 11/15/2013 02:27 AM, Nigel Williams wrote:
 On 15/11/2013 8:57 AM, Dane Elwell wrote:
 [2] - I realise the dangers/stupidity of a replica size of 0, but some
 of the data we wish
 to store just isn’t /that/ important.
 
 We've been thinking of this too. The application is storing boot-images,
 ISOs, local repository mirrors etc where recovery is easy with a slight
 inconvenience if the data has to be re-fetched. This suggests a neat
 additional feature for Ceph would be the ability to have metadata
 attached to zero-replica objects that includes a URL where a copy could
 be recovered/re-fetched. Then it could all happen auto-magically.
 
 We also have users trampolining data between systems in order to buffer
 fast-data streams or handle data-surges. This can be zero-replica too.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Recovery took too long on cuttlefish

2013-11-13 Thread Andrey Korolyov

Hello,

Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with
large amount of data commit and relatively low PG rate,
I`ve observed unexplainable long recovery times for PGs
even if the degraded object count is almost zero:

04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean,
1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB
used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s;
865/4162926 degraded (0.021%);  recovering 2 o/s, 9251KB/s

at this moment we have freshly restarted cluster and large amount of PGs
in recovery_wait state; after a couple of minutes picture changes a little:

2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939
active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB
data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr,
1499op/s; 2456/4167201 degraded (0.059%)

and after a couple of hours we`re reaching a peak by degraded objects,
PGs still moving to active+clean:

2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191
active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB
data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s
wr, 16025op/s; 23495/4223061 degraded (0.556%)

After peak was passed, object count starts to decrease and seems cluster
will reach completely clean state in next ten hours.

For example, with PG count ten times higher recovery goes way faster:

2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171
active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB
data, 25609 GB used, 78538 GB / 101 TB avail; 3
298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%)

2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646
active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data,
25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s

Recovery and backfill settings was the same during all tests:
  osd_max_backfills: 1,
  osd_backfill_full_ratio: 0.85,
  osd_backfill_retry_interval: 10,
  osd_recovery_threads: 1,
  osd_recover_clone_overlap: true,
  osd_backfill_scan_min: 64,
  osd_backfill_scan_max: 512,
  osd_recovery_thread_timeout: 30,
  osd_recovery_delay_start: 300,
  osd_recovery_max_active: 5,
  osd_recovery_max_chunk: 8388608,
  osd_recovery_forget_lost_objects: false,
  osd_kill_backfill_at: 0,
  osd_debug_skip_full_check_in_backfill_reservation: false,
  osd_recovery_op_priority: 10,


Also during recovery some heartbeats may miss, it is not related to the
current situation but observed for a very long time(for now, seems
four-seconds delays between heartbeats distributed almost randomly over
a time flow):

2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr,
623op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr,
284op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s
wr, 1470op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr,
1820op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr,
13195op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr,
46511op/s; 15670/4227330 degraded (0.371%)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-osd and btrfs results in high disk usage

2013-11-09 Thread Andrey Korolyov

Observed same behaviour with higher disk resource consumption than xfs
before, but I`m wonder how it`s possible to get 5K IOPS on regular
(even cache-backed) device.

On Sat, Nov 9, 2013 at 7:55 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 i've deployed two OSDs with btrfs but i'm seeing really crazy disk / fs
 usage.

 The xfs ones have constant 10-20MB/s the brfs one has constant 100MB/s with
 5000 iop/s. And it's not btrfs it's the ceph-osd doing this amount of io at
 least iotop shows the ceph-osd writing this massive amount of data.

 Is this correct? Has anybody seen this before?

 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block Storage QoS

2013-11-07 Thread Andrey Korolyov

On Thu, Nov 7, 2013 at 11:50 PM, Wido den Hollander w...@42on.com wrote:
 On 11/07/2013 08:42 PM, Gruher, Joseph R wrote:

 Is there any plan to implement some kind of QoS in Ceph?  Say I want to
 provide service level assurance to my OpenStack VMs and I might have to
 throttle bandwidth to some to provide adequate bandwidth to others - is
 anything like that planned for Ceph?  Generally with regard to block
 storage (rbds), not object or filesystem.

 Or is there already a better way to do this elsewhere in the OpenStack
 cloud?


 I don't know if OpenStack supports it, but in CloudStack we recently
 implemented the I/O throttling mechanism of Qemu via libvirt.

 That might be a solution if OpenStack implements that as well?

Just a side note - current QEMU implements more gentle throttling than
a rest of the versions and it is very useful thing for NBD I/O burst
handling.


 Thanks,

 Joe



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk Density Considerations

2013-11-06 Thread Andrey Korolyov

On Wed, Nov 6, 2013 at 4:15 PM, Darren Birkett darren.birk...@gmail.com wrote:
 Hi,

 I understand from various reading and research that there are a number of
 things to consider when deciding how many disks one wants to put into a
 single chassis:

 1. Higher density means higher failure domain (more data to re-replicate if
 you lose a node)
 2. More disks means more CPU/memory horsepower to handle the number of OSDs
 3. Network becomes a bottleneck with too many OSDs per node
 4. ...

 We are looking at building high density nodes for small scale 'starter'
 deployments for our customers (maybe 4 or 5 nodes).  High density in this
 case could mean a 2u chassis with 2x external 45 disk JBOD containers
 attached.  That's 90 3TB disks/OSDs to be managed by a single node.  That's
 about 243TB of potential usable space, and so (assuming up to 75% fillage)
 maybe 182TB of potential data 'loss' in the event of a node failure.  On an
 uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say
 that would take about 45 hours to get the cluster back into an undegraded
 state - that is the requisite number of copies of all objects.


For such large number of disks you should consider that the cache
amortization will not take any place even if you are using 1GB
controller(s) - only tiered cache can be an option. Also recovery will
take much more time even if you have a room for client I/O in the
calculations because raw disks have very limited IOPS capacity and
recovery will either take a much longer than such expectations at a
glance or affect regular operations. For S3/Swift it may be acceptable
but for VM images it does not.

 Assuming that you can shove in a pair of hex core hyperthreaded processors,
 you're probably OK with number 2.  If you're already considering 10GbE
 networking for the storage network, there's probably not much you can do
 about 3 unless you want to spend a lot more money (and the reason we're
 going so dense is to keep this as a cheap option).  So the main thing would
 seem to be a real fear of 'losing' so much data in the event of a node
 failure.  Who wants to wait 45 hours (probably much longer assuming the
 cluster remains live and has production traffic traversing that networl) for
 the cluster to self-heal?

 But surely this fear is based on an assumption that in that time, you've not
 identified and replaced the failed chassis.  That you would sit for 2-3 days
 and just leave the cluster to catch up, and not actually address the broken
 node.  Given good data centre processes, a good stock of spare parts, isn't
 it more likely that you'd have replaced that node and got things back up and
 running in a mater of hours?  In all likelyhood, a node crash/failure is not
 likely to have taken out all, or maybe any, of the disks, and a new chassis
 can just have the JBODs plugged back in and away you go?

 I'm sure I'm missing some other pieces, but if you're comfortable with your
 hardware replacement processes, doesn't number 1 become a non-fear really? I
 understand that in some ways it goes against the concept of ceph being self
 healing, and that in an ideal world you'd have lots of lower density nodes
 to limit your failure domain, but when being driven by cost isn't this an OK
 way to look at things?  What other glaringly obvious considerations am I
 missing with this approach?

 Darren

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk Density Considerations

2013-11-06 Thread Andrey Korolyov

On Wed, Nov 6, 2013 at 6:42 PM, Darren Birkett darren.birk...@gmail.com wrote:

 On 6 November 2013 14:08, Andrey Korolyov and...@xdel.ru wrote:

  We are looking at building high density nodes for small scale 'starter'
  deployments for our customers (maybe 4 or 5 nodes).  High density in
  this
  case could mean a 2u chassis with 2x external 45 disk JBOD containers
  attached.  That's 90 3TB disks/OSDs to be managed by a single node.
  That's
  about 243TB of potential usable space, and so (assuming up to 75%
  fillage)
  maybe 182TB of potential data 'loss' in the event of a node failure.  On
  an
  uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations
  say
  that would take about 45 hours to get the cluster back into an
  undegraded
  state - that is the requisite number of copies of all objects.
 

 For such large number of disks you should consider that the cache
 amortization will not take any place even if you are using 1GB
 controller(s) - only tiered cache can be an option. Also recovery will
 take much more time even if you have a room for client I/O in the
 calculations because raw disks have very limited IOPS capacity and
 recovery will either take a much longer than such expectations at a
 glance or affect regular operations. For S3/Swift it may be acceptable
 but for VM images it does not.


 Sure, but my argument was that you are never likely to actually let that
 entire recovery operation complete - you're going to replace the hardware
 and plug the disks back in and let them catch up by log replay/backfill.
 Assuming you don't ever actually expect to really lose all data on 90 disks
 in one go...
 By tiered caching, do you mean using something like flashcache or bcache?

Exactly, just another step to offload CPU from I/O time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cuttlefish: pool recreation results in cluster crash

2013-10-19 Thread Andrey Korolyov

Hello,

I was able to reproduce following on the top of current cuttlefish:

- create pool,
- delete it after all pgs initialized,
- create new pool with same name after, say, ten seconds.

All osds dies immediately with attached trace. The problem exists in
bobtail as well.


pool-recreate.txt.gz
Description: GNU Zip compressed data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-20 Thread Andrey Korolyov

On Thu, Sep 19, 2013 at 8:12 PM, Joao Eduardo Luis
joao.l...@inktank.com wrote:
 On 09/19/2013 04:46 PM, Andrey Korolyov wrote:

 On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis
 joao.l...@inktank.com wrote:

 On 09/18/2013 11:25 PM, Andrey Korolyov wrote:


 Hello,

 Just restarted one of my mons after a month of uptime, memory commit
 raised ten times high than before:

 13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon

 normal one looks like

30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28
 ceph-mon



 Try running 'ceph heap stats', followed by 'ceph heap release', and then
 recheck the memory consumption for the monitor.


 It had shrinked to 350M RSS over night, so seems I need to restart
 this mon again or try with other one to reproduce the problem over
 next night. This monitor was a leader so I may check against other
 ones and see their peak consumption.


 Was that monitor attempting to join the quorum?

No, it had joined a long before.

As we discussed in IRC, I restarted non-leader mon and there is some
stat from freshly started mon process (which is joined quorum two
minutes age) :

ceph heap stats --keyfile admin -m 10.5.0.17:6789
mon.2tcmalloc heap stats:
MALLOC:   26256488 (   25.0 MiB) Bytes in use by application
MALLOC: +  11240284160 (10719.6 MiB) Bytes in page heap freelist
MALLOC: +  3184848 (3.0 MiB) Bytes in central cache freelist
MALLOC: +  8974848 (8.6 MiB) Bytes in transfer cache freelist
MALLOC: + 15560904 (   14.8 MiB) Bytes in thread cache freelists
MALLOC: + 22114456 (   21.1 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  11316375704 (10792.1 MiB) Actual memory used (physical + swap)
MALLOC: + 90226688 (   86.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  11406602392 (10878.2 MiB) Virtual address space used
MALLOC:
MALLOC:   4140  Spans in use
MALLOC: 14  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).

and after calling release:


# ceph heap release --keyfile admin -m 10.5.0.17:6789
mon.2 releasing free RAM back to system.
# ceph heap stats --keyfile admin -m 10.5.0.17:6789
mon.2tcmalloc heap stats:
MALLOC:   38925536 (   37.1 MiB) Bytes in use by application
MALLOC: + 13508608 (   12.9 MiB) Bytes in page heap freelist
MALLOC: +  2992112 (2.9 MiB) Bytes in central cache freelist
MALLOC: + 12092416 (   11.5 MiB) Bytes in transfer cache freelist
MALLOC: + 17547056 (   16.7 MiB) Bytes in thread cache freelists
MALLOC: + 22114456 (   21.1 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =107180184 (  102.2 MiB) Actual memory used (physical + swap)
MALLOC: +  11299422208 (10776.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  11406602392 (10878.2 MiB) Virtual address space used
MALLOC:
MALLOC:   4678  Spans in use
MALLOC: 14  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the






   -Joao


 --
 Joao Eduardo Luis
 Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-19 Thread Andrey Korolyov

On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis
joao.l...@inktank.com wrote:
 On 09/18/2013 11:25 PM, Andrey Korolyov wrote:

 Hello,

 Just restarted one of my mons after a month of uptime, memory commit
 raised ten times high than before:

 13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon

 normal one looks like

   30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28 ceph-mon


 Try running 'ceph heap stats', followed by 'ceph heap release', and then
 recheck the memory consumption for the monitor.

It had shrinked to 350M RSS over night, so seems I need to restart
this mon again or try with other one to reproduce the problem over
next night. This monitor was a leader so I may check against other
ones and see their peak consumption.




 monstore has simular size about 15G per monitor so only one problem is
 very unusual and terrifying memory consumption. Also it is possible
 that remaining mons running 0.61.7 but binary was updated long ago so
 it`s hard to tell which version is running not doing dump and
 disrupting quorum for a little - anyway I should tame current one.


 How big is the cluster?  15GB for the monitor store may not be that
 surprising if you have a bunch of OSDs and they're not completely healthy,
 as that will prevent the removal of old maps on the monitor side.

~8.5T commit and  2.5M objects but cluster is completely healthy on
the moment though recently I had run couple of recovery procedures and
it may affect mondir data allocation on day-long intervals.


 This could also be an issue with store compaction.

 Also, you should check whether the monitors are running 0.61.7 and, if so,
 update them to 0.61.8.  You should be able to get that version using the
 admin socket if you want to.

Just checked, the same 0.61.8.




   -Joao

 --
 Joao Eduardo Luis
 Software Engineer | http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-18 Thread Andrey Korolyov

Hello,

Just restarted one of my mons after a month of uptime, memory commit
raised ten times high than before:

13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon

normal one looks like

 30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28 ceph-mon

monstore has simular size about 15G per monitor so only one problem is
very unusual and terrifying memory consumption. Also it is possible
that remaining mons running 0.61.7 but binary was updated long ago so
it`s hard to tell which version is running not doing dump and
disrupting quorum for a little - anyway I should tame current one.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hit suicide timeout on osd start

2013-09-12 Thread Andrey Korolyov

A little follow-up:

One of cluster nodes(from not-yet-restarted set) went in some kind of
flapping state exposing cpu consumption peaks and latency spikes every
50 seconds. Even more interesting thing was that when we injected
non-zero debug_ms latency spikes had gone away, but cpu ones remains
as well. At the picture[0] below we had injected debug_ms 1 and log
file as /dev/null at the 19:03 and set it back to 0 at 19:13.

0. http://i.imgur.com/8BBWM7o.png


On Wed, Sep 11, 2013 at 5:05 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 Got so-famous error on 0.61.8, just for little disk overload on OSD
 daemon start. I currently have very large metadata per osd (about
 20G), this may be an issue.

 #0  0x7f2f46adeb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
 #1  0x00860469 in reraise_fatal (signum=6) at
 global/signal_handler.cc:58
 #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
 #3  signal handler called
 #4  0x7f2f44b45405 in raise () from /lib/x86_64-linux-gnu/libc.so.6
 #5  0x7f2f44b48b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #6  0x7f2f4544389d in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #7  0x7f2f45441996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #8  0x7f2f454419c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #9  0x7f2f45441bee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #10 0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1
 0 == \hit suicide timeout\, file=optimized out, line=79,
 func=0xa38c60 bool
 ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
 time_t)) at common/assert.cc:77
 #11 0x0087914b in ceph::HeartbeatMap::_check
 (this=this@entry=0x26560e0, h=optimized out, who=who@entry=0xa38b40
 is_healthy,
 now=now@entry=1378860192) at common/HeartbeatMap.cc:79
 #12 0x00879956 in ceph::HeartbeatMap::is_healthy
 (this=this@entry=0x26560e0) at common/HeartbeatMap.cc:130
 #13 0x00879f08 in ceph::HeartbeatMap::check_touch_file
 (this=0x26560e0) at common/HeartbeatMap.cc:141
 #14 0x009189f5 in CephContextServiceThread::entry
 (this=0x2652200) at common/ceph_context.cc:68
 #15 0x7f2f46ad6e9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #16 0x7f2f44c013dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #17 0x in ?? ()
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cp copies of sparse files become fully allocated

2013-09-10 Thread Andrey Korolyov

On Tue, Sep 10, 2013 at 3:03 AM, Josh Durgin josh.dur...@inktank.com wrote:
 On 09/09/2013 04:57 AM, Andrey Korolyov wrote:

 May I also suggest the same for export/import mechanism? Say, if image
 was created by fallocate we may also want to leave holes upon upload
 and vice-versa for export.


 Import and export already omit runs of zeroes. They could detect
 smaller runs (currently they look at object size chunks), and export
 might be more efficient if it used diff_iterate() instead of
 read_iterate(). Have you observed them misbehaving with sparse images?



Did you meant dumpling? As I had checked some months ago cuttlefish
not had such feature.

 On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil s...@inktank.com wrote:

 On Sat, 7 Sep 2013, Oliver Daudey wrote:

 Hey all,

 This topic has been partly discussed here:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html

 Tested on Ceph version 0.67.2.

 If you create a fresh empty image of, say, 100GB in size on RBD and then
 use rbd cp to make a copy of it, even though the image is sparse, the
 command will attempt to read every part of it and take far more time
 than expected.

 After reading the above thread, I understand why the copy of an
 essentially empty sparse image on RBD would take so long, but it doesn't
 explain why the copy won't be sparse itself.  If I use rbd cp to copy
 an image, the copy will take it's full allocated size on disk, even if
 the original was empty.  If I use the QEMU qemu-img-tool's
 convert-option to convert the original image to the copy without
 changing the format, essentially only making a copy, it takes it's time
 as well, but will be faster than rbd cp and the resulting copy will be
 sparse.

 Example-commands:
 rbd create --size 102400 test1
 rbd cp test1 test2
 qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3

 Shouldn't rbd cp at least have an option to attempt to sparsify the
 copy, or copy the sparse parts as sparse?  Same goes for rbd clone,
 BTW.


 Yep, this is in fact a bug.  Opened http://tracker.ceph.com/issues/6257.

 Thanks!
 sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Hit suicide timeout on osd start

2013-09-10 Thread Andrey Korolyov

Hello,

Got so-famous error on 0.61.8, just for little disk overload on OSD
daemon start. I currently have very large metadata per osd (about
20G), this may be an issue.

#0  0x7f2f46adeb7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00860469 in reraise_fatal (signum=6) at
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  signal handler called
#4  0x7f2f44b45405 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x7f2f44b48b5b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x7f2f4544389d in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x7f2f45441996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x7f2f454419c3 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x7f2f45441bee in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0090d2fa in ceph::__ceph_assert_fail (assertion=0xa38ab1
0 == \hit suicide timeout\, file=optimized out, line=79,
func=0xa38c60 bool
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
time_t)) at common/assert.cc:77
#11 0x0087914b in ceph::HeartbeatMap::_check
(this=this@entry=0x26560e0, h=optimized out, who=who@entry=0xa38b40
is_healthy,
now=now@entry=1378860192) at common/HeartbeatMap.cc:79
#12 0x00879956 in ceph::HeartbeatMap::is_healthy
(this=this@entry=0x26560e0) at common/HeartbeatMap.cc:130
#13 0x00879f08 in ceph::HeartbeatMap::check_touch_file
(this=0x26560e0) at common/HeartbeatMap.cc:141
#14 0x009189f5 in CephContextServiceThread::entry
(this=0x2652200) at common/ceph_context.cc:68
#15 0x7f2f46ad6e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#16 0x7f2f44c013dd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x in ?? ()
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cp copies of sparse files become fully allocated

2013-09-09 Thread Andrey Korolyov

May I also suggest the same for export/import mechanism? Say, if image
was created by fallocate we may also want to leave holes upon upload
and vice-versa for export.

On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil s...@inktank.com wrote:
 On Sat, 7 Sep 2013, Oliver Daudey wrote:
 Hey all,

 This topic has been partly discussed here:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html

 Tested on Ceph version 0.67.2.

 If you create a fresh empty image of, say, 100GB in size on RBD and then
 use rbd cp to make a copy of it, even though the image is sparse, the
 command will attempt to read every part of it and take far more time
 than expected.

 After reading the above thread, I understand why the copy of an
 essentially empty sparse image on RBD would take so long, but it doesn't
 explain why the copy won't be sparse itself.  If I use rbd cp to copy
 an image, the copy will take it's full allocated size on disk, even if
 the original was empty.  If I use the QEMU qemu-img-tool's
 convert-option to convert the original image to the copy without
 changing the format, essentially only making a copy, it takes it's time
 as well, but will be faster than rbd cp and the resulting copy will be
 sparse.

 Example-commands:
 rbd create --size 102400 test1
 rbd cp test1 test2
 qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3

 Shouldn't rbd cp at least have an option to attempt to sparsify the
 copy, or copy the sparse parts as sparse?  Same goes for rbd clone,
 BTW.

 Yep, this is in fact a bug.  Opened http://tracker.ceph.com/issues/6257.

 Thanks!
 sage
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Removing osd with zero data causes placement shift

2013-07-23 Thread Andrey Korolyov

Hello

I had a couple of osds with down+out state and completely clean
cluster, but after pushing a button ``osd crush remove'' there was
some data redistribution with shift proportional to osd weight in the
crushmap but lower than 'osd out' amount of data replacement over osd
with same weight approximate two times. This is some sort of
non-idempotency kept at least from bobtail series.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD crash upon pool creation

2013-07-15 Thread Andrey Korolyov

Hello,

Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some
disaster-alike behavior after ``pool create'' command - every osd
daemon in the cluster will die at least once(some will crash times in
a row after bringing back). Please take a look on the
backtraces(almost identical) below. Issue #5637 is created in the
tracker.

Thanks!

http://xdel.ru/downloads/poolcreate.txt.gz
http://xdel.ru/downloads/poolcreate2.txt.gz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal size suggestions

2013-07-10 Thread Andrey Korolyov

On Wed, Jul 10, 2013 at 3:28 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 Thank you for the response.
 You are talking of median expected writes, but should I consider the single
 disk write speed or the network speed? A single disk is 100MB/s so
 100*30=3000MB of journal for each osd? Or should I consider the network
 speed that is 1.25GB/s?
 Why 30 seconds? default flush frequency is 5 seconds.
 What do you mean with fine tuning spinning storage media? On which tuning
 are you referring to?

Since journal is created on per-osd basis, you should calculate it
with only disk speed in mind. As I remember no one referred directly
to flush interval when recommending referring to tens of seconds on
such calculation, neither do I - it`s just a safe road anyway to have
some capacity over this value. By fine tuning I meant such things as
readahead values, number of internal XFS partitions, size of XFS
chunks, hardware controller cache policy(if you have some) and so on -
being honest, filesystem tuning is not affecting performance so much
on general workload types, but may affect greatly on some specific
things like digits in the benchmark :) .

 Il giorno 09/lug/2013 23:45, Andrey Korolyov and...@xdel.ru ha scritto:

 On Wed, Jul 10, 2013 at 1:16 AM, Gandalf Corvotempesta
 gandalf.corvotempe...@gmail.com wrote:
  Hi,
  i'm planning a new cluster on a 10GbE network.
  Each storage node will have a maximum of 12 SATA disks and 2 SSD as
  journals.
 
  What do you suggest as journal size for each OSD? 5GB is enough?
  Should I just consider SATA writing speed when calculating journal
  size or also network speed?

 Hello,

 As many recommendations suggests before, you may set journal size
 proportional to amount of median (or peak, if expected) writes
 multiplied, say, by thirty seconds - that`s the safe area and you
 should not able to suffer because of journal size following this
 calculation. Twelve SATA disks in theory may have enough output to
 thrash 10G network but you`ll face lack of IOPS times before almost
 for sure, and OSD daemons are not working very close to the physical
 limits speaking of transferring data from/to disk, so fine tuning of
 spinning storage media still is primary target to play with in such
 configuration.

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal size suggestions

2013-07-09 Thread Andrey Korolyov

On Wed, Jul 10, 2013 at 1:16 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 Hi,
 i'm planning a new cluster on a 10GbE network.
 Each storage node will have a maximum of 12 SATA disks and 2 SSD as journals.

 What do you suggest as journal size for each OSD? 5GB is enough?
 Should I just consider SATA writing speed when calculating journal
 size or also network speed?

Hello,

As many recommendations suggests before, you may set journal size
proportional to amount of median (or peak, if expected) writes
multiplied, say, by thirty seconds - that`s the safe area and you
should not able to suffer because of journal size following this
calculation. Twelve SATA disks in theory may have enough output to
thrash 10G network but you`ll face lack of IOPS times before almost
for sure, and OSD daemons are not working very close to the physical
limits speaking of transferring data from/to disk, so fine tuning of
spinning storage media still is primary target to play with in such
configuration.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 115 matches

Mail list logo