Re: [ceph-users] xfs corruption

2016-03-06 Thread Ferhat Ozkasgarli
Rick; you mean Raid 0 environment right?

If you use raid 5 or raid 10 or some other more complex raid configuration
most of the physical disks' abilities vanishes. (trim, discard etc..)

Only handful of hardware raid cards able to pass trim and discard commands
to physical disks if the raid configuration is raid 0 or raid 1.

On Mon, Mar 7, 2016 at 9:21 AM, Ric Wheeler  wrote:

>
>
> It is perfectly reasonable and common to use hardware RAID cards in
> writeback mode under XFS (and under Ceph) if you configure them properly.
>
> The key thing is that for writeback cache enabled, you need to make sure
> that the S-ATA drives' write cache itself is disabled. Also make sure that
> your file system is mounted with "barrier" enabled.
>
> To check the backend write cache state on drives, you often need to use
> RAID card specific tools to query and set them.
>
> Regards,
>
> Ric
>
>
>
>
> On 02/27/2016 07:20 AM, fangchen sun wrote:
>
>>
>> Thank you for your response!
>>
>> All my hosts have raid cards. Some raid cards are in pass-throughput
>> mode, and the others are in write-back mode. I will set all raid cards
>> pass-throughput mode and observe for a period of time.
>>
>>
>> Best Regards
>> sunspot
>>
>>
>> 2016-02-25 20:07 GMT+08:00 Ferhat Ozkasgarli > >:
>>
>> This has happened me before but in virtual machine environment.
>>
>> The VM was KVM and storage was RBD. My problem was a bad cable in
>> network.
>>
>> You should check following details:
>>
>> 1-) Do you use any kind of hardware raid configuration? (Raid 0, 5 or
>> 10)
>>
>> Ceph does not work well on hardware raid systems. You should use raid
>> cards in HBA (non-raid) mode and let raid card pass-throughput the
>> disk.
>>
>> 2-) Check your network connections
>>
>> It mas seem a obvious solution but  believe me network is one of the
>> top
>> rated culprit in Ceph environments.
>>
>> 3-) If you are using SSD disk, make sure you use non-raid
>> configuration.
>>
>>
>>
>> On Tue, Feb 23, 2016 at 10:55 PM, fangchen sun > > wrote:
>>
>> Dear all:
>>
>> I have a ceph object storage cluster with 143 osd and 7 radosgw,
>> and
>> choose XFS as the underlying file system.
>> I recently ran into a problem that sometimes a osd is marked down
>> when
>> the returned value of the function "chain_setxattr()" is -117. I
>> only
>> umount the disk and repair it with "xfs_repair".
>>
>> os: centos 6.5
>> kernel version: 2.6.32
>>
>> the log for dmesg command:
>> [41796028.532225] Pid: 1438740, comm: ceph-osd Not tainted
>> 2.6.32-925.431.23.3.letv.el6.x86_64 #1
>> [41796028.532227] Call Trace:
>> [41796028.532255]  [] ?
>> xfs_error_report+0x3f/0x50 [xfs]
>> [41796028.532276]  [] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> [41796028.532296]  [] ?
>> xfs_corruption_error+0x5e/0x90 [xfs]
>> [41796028.532316]  [] ?
>> xfs_da_do_buf+0x6cc/0x770 [xfs]
>> [41796028.532335]  [] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> [41796028.532359]  [] ?
>> kmem_zone_alloc+0x77/0xf0 [xfs]
>> [41796028.532380]  [] ?
>> xfs_da_read_buf+0x2a/0x30 [xfs]
>> [41796028.532399]  [] ?
>> xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
>> [41796028.532426]  [] ?
>> xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
>> [41796028.532455]  [] ?
>> xfs_trans_add_item+0x57/0x70
>> [xfs]
>> [41796028.532476]  [] ?
>> xfs_bmbt_get_all+0x18/0x20 [xfs]
>> [41796028.532495]  [] ?
>> xfs_attr_set_int+0x3c4/0x510
>> [xfs]
>> [41796028.532517]  [] ?
>> xfs_da_do_buf+0x6db/0x770 [xfs]
>> [41796028.532536]  [] ? xfs_attr_set+0x81/0x90
>> [xfs]
>> [41796028.532560]  [] ?
>> __xfs_xattr_set+0x43/0x60 [xfs]
>> [41796028.532584]  [] ?
>> xfs_xattr_user_set+0x11/0x20
>> [xfs]
>> [41796028.532592]  [] ?
>> generic_setxattr+0xa2/0xb0
>> [41796028.532596]  [] ?
>> __vfs_setxattr_noperm+0x4e/0x160
>> [41796028.532600]  [] ?
>> inode_permission+0xa7/0x100
>> [41796028.532604]  [] ? vfs_setxattr+0xbc/0xc0
>> [41796028.532607]  [] ? setxattr+0xd0/0x150
>> [41796028.532612]  [] ?
>> __dequeue_entity+0x30/0x50
>> [41796028.532617]  [] ? __switch_to+0x26e/0x320
>> [41796028.532621]  [] ?
>> __sb_start_write+0x80/0x120
>> [41796028.532626]  [] ? thread_return+0x4e/0x760
>> [41796028.532630]  [] ? sys_fsetxattr+0xad/0xd0
>> [41796028.532633]  [] ?
>> system_call_fastpath+0x16/0x1b
>> [41796028.532636] XFS (sdi1): Corruption detected. Unmount and run
>> xfs_repair
>>
>> Any comments will be much appreciated!
>>
>> Best Regards!
>> sunspot
>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com

Re: [ceph-users] xfs corruption

2016-03-06 Thread Ric Wheeler



It is perfectly reasonable and common to use hardware RAID cards in writeback 
mode under XFS (and under Ceph) if you configure them properly.


The key thing is that for writeback cache enabled, you need to make sure that 
the S-ATA drives' write cache itself is disabled. Also make sure that your file 
system is mounted with "barrier" enabled.


To check the backend write cache state on drives, you often need to use RAID 
card specific tools to query and set them.


Regards,

Ric




On 02/27/2016 07:20 AM, fangchen sun wrote:


Thank you for your response!

All my hosts have raid cards. Some raid cards are in pass-throughput mode, and 
the others are in write-back mode. I will set all raid cards pass-throughput 
mode and observe for a period of time.



Best Regards
sunspot


2016-02-25 20:07 GMT+08:00 Ferhat Ozkasgarli >:


This has happened me before but in virtual machine environment.

The VM was KVM and storage was RBD. My problem was a bad cable in network.

You should check following details:

1-) Do you use any kind of hardware raid configuration? (Raid 0, 5 or 10)

Ceph does not work well on hardware raid systems. You should use raid
cards in HBA (non-raid) mode and let raid card pass-throughput the disk.

2-) Check your network connections

It mas seem a obvious solution but  believe me network is one of the top
rated culprit in Ceph environments.

3-) If you are using SSD disk, make sure you use non-raid configuration.



On Tue, Feb 23, 2016 at 10:55 PM, fangchen sun mailto:sunspot0...@gmail.com>> wrote:

Dear all:

I have a ceph object storage cluster with 143 osd and 7 radosgw, and
choose XFS as the underlying file system.
I recently ran into a problem that sometimes a osd is marked down when
the returned value of the function "chain_setxattr()" is -117. I only
umount the disk and repair it with "xfs_repair".

os: centos 6.5
kernel version: 2.6.32

the log for dmesg command:
[41796028.532225] Pid: 1438740, comm: ceph-osd Not tainted
2.6.32-925.431.23.3.letv.el6.x86_64 #1
[41796028.532227] Call Trace:
[41796028.532255]  [] ? xfs_error_report+0x3f/0x50 
[xfs]
[41796028.532276]  [] ? xfs_da_read_buf+0x2a/0x30 
[xfs]
[41796028.532296]  [] ?
xfs_corruption_error+0x5e/0x90 [xfs]
[41796028.532316]  [] ? xfs_da_do_buf+0x6cc/0x770 
[xfs]
[41796028.532335]  [] ? xfs_da_read_buf+0x2a/0x30 
[xfs]
[41796028.532359]  [] ? kmem_zone_alloc+0x77/0xf0 
[xfs]
[41796028.532380]  [] ? xfs_da_read_buf+0x2a/0x30 
[xfs]
[41796028.532399]  [] ?
xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
[41796028.532426]  [] ?
xfs_attr_leaf_addname+0x61/0x3d0 [xfs]
[41796028.532455]  [] ? xfs_trans_add_item+0x57/0x70
[xfs]
[41796028.532476]  [] ? xfs_bmbt_get_all+0x18/0x20 
[xfs]
[41796028.532495]  [] ? xfs_attr_set_int+0x3c4/0x510
[xfs]
[41796028.532517]  [] ? xfs_da_do_buf+0x6db/0x770 
[xfs]
[41796028.532536]  [] ? xfs_attr_set+0x81/0x90 [xfs]
[41796028.532560]  [] ? __xfs_xattr_set+0x43/0x60 
[xfs]
[41796028.532584]  [] ? xfs_xattr_user_set+0x11/0x20
[xfs]
[41796028.532592]  [] ? generic_setxattr+0xa2/0xb0
[41796028.532596]  [] ? 
__vfs_setxattr_noperm+0x4e/0x160
[41796028.532600]  [] ? inode_permission+0xa7/0x100
[41796028.532604]  [] ? vfs_setxattr+0xbc/0xc0
[41796028.532607]  [] ? setxattr+0xd0/0x150
[41796028.532612]  [] ? __dequeue_entity+0x30/0x50
[41796028.532617]  [] ? __switch_to+0x26e/0x320
[41796028.532621]  [] ? __sb_start_write+0x80/0x120
[41796028.532626]  [] ? thread_return+0x4e/0x760
[41796028.532630]  [] ? sys_fsetxattr+0xad/0xd0
[41796028.532633]  [] ? system_call_fastpath+0x16/0x1b
[41796028.532636] XFS (sdi1): Corruption detected. Unmount and run
xfs_repair

Any comments will be much appreciated!

Best Regards!
sunspot




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd up_from, up_thru

2016-03-06 Thread min fang
Dear,  I used osd dump to extract osd monmap, and found up_from, up_thru
list, what is the difference between up_from and up_thru?

osd.0 up   in  weight 1 up_from 673 up_thru 673 down_at 670
last_clean_interval [637,669)

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Christian Balzer

Hello,

On Mon, 7 Mar 2016 00:38:46 + Adrian Saul wrote:

> > >The Samsungs are the 850 2TB
> > > (MZ-75E2T0BW).  Chosen primarily on price.
> >
> > These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5
> > years). Unless you have a read-only cluster, you will wind up spending
> > MORE on replacing them (and/or loosing data when 2 fail at the same
> > time) than going with something more sensible like Samsung's DC models
> > or the Intel DC ones (S3610s come to mind for "normal" use).
> > See also the current "List of SSDs" thread in this ML.
> 
> This was a metric I struggled to find and would have been useful in
> comparison.  I am sourcing prices on the SM863s anyway.  That SSD thread
> has been good to follow as well.
> 
Yeah, they are most likely a better fit and if they are doing OK with sync
writes you could most likely get away with having their journals on them
same SSD.

> > Fast, reliable, cheap. Pick any 2.
> 
> Yup - unfortunately cheap is fixed, reliable is the reason we are doing
> this however fast is now a must have.  the normal
> engineering/management dilemma.
> 
Indeed.

> > On your test setup or even better the Solaris one, have a look at
> > their media wearout, or  Wear_Leveling_Count as Samsung calls it.
> > I bet that makes for some scary reading.
> 
> For the Evos we found no tools we could use on Solaris - also because we
> have cheap nasty SAS interposers in that setup most tools don't work
> anyway.  Until we pull a disk and put it into a windows box we can't do
> any sort of diagnostics on it.  It would be useful to see because we
> have those disks taking a fair brunt of our performance workload now.
> 
smartmontools aka smartctl not working for you, presumably because of the
intermediate SAS shenanigans?

> > Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
> > "long" distance replication due to the incurred latencies.
> >
> > That's unless your replication is happening "above" Ceph in the iSCSI
> > bits with something that's more optimized for this.
> >
> > Something along the lines of the DRBD proxy has been suggested for
> > Ceph, but if at all it is a backburner project at best from what I
> > gather.
> 
> We can fairly easily do low latency links (telco) but are looking at the
> architecture to try and limit that sort of long replication - doing
> replication at application and database levels instead.  The site to
> site replication would be limited to some clusters or applications that
> need sync replication for availability.
> 
Yeah, I figured the Telco part, but for our planned DC move I ran
some numbers and definitely want to stay below 10km between them
(Infiniband here).

Note that you can of course create CRUSH rules that will give you either
location replicated or only locally replicated OSDs and thus pools, but it
may be a bit daunting at first.

> > There are some ways around this, which may or may not be suitable for
> > your use case.
> > EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> > Of course this comes at a performance penalty, which you can offset
> > again with for example fast RAID controllers with HW cache to some
> > extend. But it may well turn out to be zero sum game.
> 
> I modelled an EC setup but that was at a multi site level with local
> cache tiers in front, and it was going to be too big a challenge to do
> as a new untested platform with too many latency questions.  Within a
> site EC was to going to be cost effective as to do properly I would need
> to up the number of hosts and that pushed the pricing up too far, even
> if I went with smaller less configured hosts.
> 
Yes, the per-node basic cost can be an issue, but Ceph really likes many
smallish things over few largish ones for the same size.
 
> I thought about hardware RAID as well, but as I would need to do host
> level redundancy anyway it was not gaining any efficiency - less risk
> but I would still need to replicate anyway so why not just go disk to
> disk.  More than likely I would quietly work in higher protection as we
> go live and deal with it later as a capacity expansion.
> 
The later sounds like a plan.
For the former consider this simple example:
4 storage nodes, each with 4 RAID6 OSDs, Ceph size=2 and min_size=1,
mon_osd_down_out_subtree_limit = host.

In this scenario you can loose any 2 disks w/o an OSD going down, up to 4
disks w/o data loss and a whole node as well w/o the cluster stopping. 
The mon_osd_down_out_subtree_limit will also stop things from rebalancing
in case of a node crash/reboot, until you decide so otherwise manually.
The idea here is that it's likely a lot quicker to get a node back up than
to reshuffle all that data.

With normal, size 2 replication and single disk OSDs, any
simultaneous/overlapping loss of 2 disks is going to loose you data,
potentially effecting many if not ALL of your VM images.

There have been a lot of discussion about reliability with various
replication level

[ceph-users] how to downgrade when upgrade from firefly to hammer fail

2016-03-06 Thread Dong Wu
hi, cephers
I want to upgrade my ceph cluster from firefly(0.80.11) to hammer,
when i successfully install hammer deb package on all my hosts, then i
update monitor first, and it success.
but when i restart osds on one host to upgrade, it failed, osds
cannot startup, then i want to downgrade to firefly again to keep my
cluster going on, after i reinstall firefly deb package, i failed to
start osds on the host, here is the log:

2016-03-07 09:47:14.704242 7f2f11ba87c0  0 ceph version 0.80.11
(8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid
37459
2016-03-07 09:47:14.709159 7f2f11ba87c0 -1
filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version
stamp 4. Please run the FileStore update script before starting the
OSD, or set filestore_update_to to 3
2016-03-07 09:47:14.709176 7f2f11ba87c0 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-0: (22) Invalid argument
2016-03-07 09:47:18.385399 7f98478187c0  0 ceph version 0.80.11
(8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid
39041
2016-03-07 09:47:18.390320 7f98478187c0 -1
filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version
stamp 4. Please run the FileStore update script before starting the
OSD, or set filestore_update_to to 3
2016-03-07 09:47:18.390337 7f98478187c0 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-0: (22) Invalid argument

  how can i downgrade to firefly successfully?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier operation clarifications

2016-03-06 Thread Christian Balzer

Hello,

I'd like to get some insights, confirmations from people here who are
either familiar with the code or have this tested more empirically than me
(the VM/client node of my test cluster is currently pinning for the
fjords).

When it comes to flushing/evicting we already established that this
triggers based on PG utilization, not a pool wide one.
So for example in a pool with 1024TB capacity (set via target_max_bytes)
and 1024 PGs and a cache_target_dirty_ratio of 0.5 flushing will start
when the first PG reaches 512MB utilization.

However while the documentation states that the least recently objects are
evicted when things hit the cache_target_full_ratio, it is less than clear
(understatement of the year) when flushing is concerned. 
To quote:
"When the cache pool consists of a certain percentage of modified (or
dirty) objects, the cache tiering agent will flush them to the storage pool."

How do we read this?
When hitting 50% (as in the example above) all of the dirty objects will
get flushed? 
That doesn't match what I'm seeing nor would it be a sensible course of
action to unleash such a potentially huge torrent of writes.

If we interpret this as "get the dirty objects below the threshold" (which
is what seems to happen) there are 2 possible courses of action here:

1. Flush dirty object(s) from the PG that has reached the threshold. 
A sensible course of action in terms of reducing I/Os, but it may keep
flushing the same objects over and over again if they happen to be on the
"full" PG.

2. Flush dirty objects from all PGs (most likely in a least recently used
fashion) and stop when we're eventually under the threshold by having
finally hit the "full" PG. 
Results in a lot more IO but will of course create more clean objects
available for eviction if needed.
This is what I think is happening.

So, is there any "least recently used" consideration in effect here,
or is the only way to avoid (pointless) flushes by setting
"cache_min_flush_age" accordingly?

Unlike for flushes above, eviction clearly states that it's going by
"least recently used".
Which in the case of per PG operation would violate that promise, as
people of course expect this to be pool wide.
And if it is indeed pool wide, the same effect as above will happen,
evictions will happen until the "full" PG gets hit, evicting far more than
would have been needed.


Something to maybe consider would be a target value, for example with
"cache_target_full_ratio" at 0.80 and "cache_target_full_ratio_target" at
0.78, evicting things until it reaches the target ratio. 

Lastly, while we have perf counters like "tier_dirty", a gauge for dirty
and clean objects/bytes would be quite useful to me at least.
And clearly the cache tier agent already has those numbers. 
Right now I'm guestimating that most of my cache objects are actually
clean (from VM reboots, only read, never written to), but I have no way
to tell for sure.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier operation clarifications

2016-03-06 Thread Christian Balzer
On Sat, 5 Mar 2016 06:08:49 +0100 Francois Lafont wrote:

> Hello,
> 
> On 04/03/2016 09:17, Christian Balzer wrote:
> 
> > Unlike the subject may suggest, I'm mostly going to try and explain how
> > things work with cache tiers, as far as I understand them.
> > Something of a reference to point to. [...]
> 
> I'm currently unqualified concerning cache tiering but I'm pretty
> sure that your post is very relevant and I think you should make
> a pull-request on the Ceph documentation where you could bring all
> these lights. Here, your explanations will be lost in the depths
> of the mailing list. ;)
> 
Thanks, I'm working on version 2, as I already found some omissions, the
typical problem when you're trying to explain something that you're
familiar with or did spend a lot of time with recently.

Once I'm happy with the result and nobody pointed out factual errors for a
week or 2, I'll consider submitting it.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Adrian Saul
> >The Samsungs are the 850 2TB
> > (MZ-75E2T0BW).  Chosen primarily on price.
>
> These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
> Unless you have a read-only cluster, you will wind up spending MORE on
> replacing them (and/or loosing data when 2 fail at the same time) than going
> with something more sensible like Samsung's DC models or the Intel DC ones
> (S3610s come to mind for "normal" use).
> See also the current "List of SSDs" thread in this ML.

This was a metric I struggled to find and would have been useful in comparison. 
 I am sourcing prices on the SM863s anyway.  That SSD thread has been good to 
follow as well.

> Fast, reliable, cheap. Pick any 2.

Yup - unfortunately cheap is fixed, reliable is the reason we are doing this 
however fast is now a must have.  the normal engineering/management dilemma.

> On your test setup or even better the Solaris one, have a look at their media
> wearout, or  Wear_Leveling_Count as Samsung calls it.
> I bet that makes for some scary reading.

For the Evos we found no tools we could use on Solaris - also because we have 
cheap nasty SAS interposers in that setup most tools don't work anyway.  Until 
we pull a disk and put it into a windows box we can't do any sort of 
diagnostics on it.  It would be useful to see because we have those disks 
taking a fair brunt of our performance workload now.

> Note that Ceph (RBD/RADOS to be precise) isn't particular suited for "long"
> distance replication due to the incurred latencies.
>
> That's unless your replication is happening "above" Ceph in the iSCSI bits 
> with
> something that's more optimized for this.
>
> Something along the lines of the DRBD proxy has been suggested for Ceph,
> but if at all it is a backburner project at best from what I gather.

We can fairly easily do low latency links (telco) but are looking at the 
architecture to try and limit that sort of long replication - doing replication 
at application and database levels instead.  The site to site replication would 
be limited to some clusters or applications that need sync replication for 
availability.

> There are some ways around this, which may or may not be suitable for your
> use case.
> EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> Of course this comes at a performance penalty, which you can offset again
> with for example fast RAID controllers with HW cache to some extend.
> But it may well turn out to be zero sum game.

I modelled an EC setup but that was at a multi site level with local cache 
tiers in front, and it was going to be too big a challenge to do as a new 
untested platform with too many latency questions.  Within a site EC was to 
going to be cost effective as to do properly I would need to up the number of 
hosts and that pushed the pricing up too far, even if I went with smaller less 
configured hosts.

I thought about hardware RAID as well, but as I would need to do host level 
redundancy anyway it was not gaining any efficiency - less risk but I would 
still need to replicate anyway so why not just go disk to disk.  More than 
likely I would quietly work in higher protection as we go live and deal with it 
later as a capacity expansion.

> Another thing is to use a cache pool (with top of the line SSDs), this is of
> course only a sensible course of action if your hot objects will fit in there.
> In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
> everything is as fast as can be expected and the VMs (their time
> critical/sensitive application to be precise) are happy campers.

This is the model I am working to - our "fast" workloads using SSD caches  in 
front of bulk SATA, sizing the SSDs at around 25% of the capacity we require 
for "fast" storage.

For the "bulk" storage I would still use the SSD cache but sized to 10% of the 
SATA usable capacity.   I figure once we get live we can adjust numbers as 
required - expand with more cache hosts if needed.

> There's a counter in Ceph (counter-filestore_journal_bytes) that you can
> graph for journal usage.
> The highest I have ever seen is about 100MB for HDD based OSDs, less than
> 8MB for SSD based ones with default(ish) Ceph parameters.
>
> Since you seem to have experience with ZFS (I don't really, but I read alot
> ^o^), consider the Ceph journal equivalent to the ZIL.
> It is a write only journal, it never gets read from unless there is a crash.
> That is why sequential, sync write speed is the utmost criteria for Ceph
> journal device.
>
> If I recall correctly you were testing with 4MB block streams, thus pretty
> much filling the pipe to capacity, atop on your storage nodes will give a good
> insight.
>
> The journal is great to cover some bursts, but the Ceph OSD is flushing things
> from RAM to the backing storage on configurable time limits and once these
> are exceeded and/or you run out RAM (pagecache), you are limited to what
> your backing storage can sustain.
>
> Now in real lif

Re: [ceph-users] Cache Pool and EC: objects didn't flush to a cold EC storage

2016-03-06 Thread Christian Balzer
On Sun, 6 Mar 2016 12:17:48 +0300 Mike Almateia wrote:

> Hello Cephers!
> 
> When my cluster hit "full ratio" settings, objects from cache pull 
> didn't flush to a cold storage.
> 
As always, versions of everything, Ceph foremost.

> 1. Hit the 'full ratio':
> 
> 2016-03-06 11:35:23.838401 osd.64 10.22.11.21:6824/31423 4327 : cluster 
> [WRN] OSD near full (90%)
> 2016-03-06 11:35:55.447205 osd.64 10.22.11.21:6824/31423 4329 : cluster 
> [WRN] OSD near full (90%)
> 2016-03-06 11:36:29.255815 osd.64 10.22.11.21:6824/31423 4332 : cluster 
> [WRN] OSD near full (90%)
> 2016-03-06 11:37:04.769765 osd.64 10.22.11.21:6824/31423 4333 : cluster 
> [WRN] OSD near full (90%)
> ...
> 
You want to:
a) read the latest (master) documentation for cache tiering
b) this ML and it archives, in particular the current thread titled 
"Cache tier operation clarifications"

In short, target_max_bytes or objects NEEDs to be set.

> 2. Well, ok. Set the option 'ceph osd pool set hotec 
> cache_target_full_ratio 0.8'.
> But no one of objects didn't flush at all
>
Flush and evict are 2 different things.

cache_target_dirty_ratio needs to be set as well (below full) for this to
work, aside from the issue above.

 
> 3. Ok. Try flush all object manually:
> [root@c1 ~]# rados -p hotec cache-flush-evict-all 
> 
>  rbd_data.34d1f5746d773.00016ba9
> 
> 4. After full day objects still in cache pool, didn't flush at all:
> [root@c1 ~]# rados df
> pool name KB  objects   clones degraded 
>   unfound   rdrd KB   wrwr KB
> data   0000 
> 064   158212215700473
> hotec  797656118 2503075500 
> 0   370599163045649 69947951  17786794779
> rbd0000 
> 00000
>total used  2080570792 25030755
> 
> It a bug or predictable action?
> 
If you didn't set the cache to forward mode first, it will fill up again
immediately.
 
Christian

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache Pool and EC: objects didn't flush to a cold EC storage

2016-03-06 Thread Mike Almateia

Hello Cephers!

When my cluster hit "full ratio" settings, objects from cache pull 
didn't flush to a cold storage.


1. Hit the 'full ratio':

2016-03-06 11:35:23.838401 osd.64 10.22.11.21:6824/31423 4327 : cluster 
[WRN] OSD near full (90%)
2016-03-06 11:35:55.447205 osd.64 10.22.11.21:6824/31423 4329 : cluster 
[WRN] OSD near full (90%)
2016-03-06 11:36:29.255815 osd.64 10.22.11.21:6824/31423 4332 : cluster 
[WRN] OSD near full (90%)
2016-03-06 11:37:04.769765 osd.64 10.22.11.21:6824/31423 4333 : cluster 
[WRN] OSD near full (90%)

...

2. Well, ok. Set the option 'ceph osd pool set hotec 
cache_target_full_ratio 0.8'.

But no one of objects didn't flush at all

3. Ok. Try flush all object manually:
[root@c1 ~]# rados -p hotec cache-flush-evict-all 


rbd_data.34d1f5746d773.00016ba9

4. After full day objects still in cache pool, didn't flush at all:
[root@c1 ~]# rados df
pool name KB  objects   clones degraded 
 unfound   rdrd KB   wrwr KB
data   0000 
   064   158212215700473
hotec  797656118 2503075500 
   0   370599163045649 69947951  17786794779
rbd0000 
   00000

  total used  2080570792 25030755

It a bug or predictable action?

--
Mike. runs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com