Re: [ceph-users] Question on cephfs recovery tools

2015-09-14 Thread Shinobu Kinjo
Thank you for getting back to me with more.
What my understanding is what you would like to do are:


  1.How you recover broken metadata / data.
  2.How you avoid same condition from the next.


Regarding to No.2, developers should have this responsibility.
Because you can not do anything, once system is hung.

So you just have no choice except for rebooting system or something you did -;
That's not quite good.

At least system says something like:


 Oh, wait wait, we are doing something right now... -;


and then solves the issue as background process.

Shinobu

- Original Message -
From: "Goncalo Borges" 
To: "Shinobu Kinjo" , "John Spray" 
Cc: ceph-users@lists.ceph.com
Sent: Tuesday, September 15, 2015 12:39:57 PM
Subject: Re: [ceph-users] Question on cephfs recovery tools

Hi Shinobu


>>> c./ After recovering the cluster, I though I was in a cephfs situation where
>>> I had
>>>  c.1 files with holes (because of lost PGs and objects in the data pool)
>>>  c.2 files without metadata (because of lost PGs and objects in the
>>> metadata pool)
>> What does "files without metadata" mean?  Do you mean their objects
>> were in the data pool but they didn't appear in your filesystem mount?
>>
>>>  c.3 metadata without associated files (because of lost PGs and objects
>>> in the data pool)
>> So you mean you had files with the expected size but zero data, right?
>>
>>> I've tried to run the recovery tools, but I have several doubts which I did
>>> not found described in the documentation
>>>  - Is there a specific order / a way to run the tools for the c.1, c.2
>>> and c.3 cases I mentioned?
> I'm still trying to understand what you try to say in your
> original message but I have not been able to get you yet.
>
> Can you summarize like:
>
>   1. What current status is.
>e.g: working but not as expected.
>
>   2. What your thought (, guess or whatever) is about your cluster.
>e.g: broken metadata, data or whatever you're thinking now.
>   
>   3. What you exactly did shortly not bla bla bla...
>
>   4. What you really want to do (shortly)?

I was trying to give the full context of my tests so that all the 
information is available.

After Jonh's response, and some further thinking, I think I understand 
partially what actions have to be done in a scenario like the one I've 
created

The whole idea is, given a scenario where there is loss of data and 
metadata, what can be done from the admin side to recover the cephfs.

Nevertheless, since this email thread is already long, I'll try to send 
a new email more focused.

Cheers and Thanks for the replies
Goncalo



-- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-14 Thread Goncalo Borges

Hi Shinobu



c./ After recovering the cluster, I though I was in a cephfs situation where
I had
 c.1 files with holes (because of lost PGs and objects in the data pool)
 c.2 files without metadata (because of lost PGs and objects in the
metadata pool)

What does "files without metadata" mean?  Do you mean their objects
were in the data pool but they didn't appear in your filesystem mount?


 c.3 metadata without associated files (because of lost PGs and objects
in the data pool)

So you mean you had files with the expected size but zero data, right?


I've tried to run the recovery tools, but I have several doubts which I did
not found described in the documentation
 - Is there a specific order / a way to run the tools for the c.1, c.2
and c.3 cases I mentioned?

I'm still trying to understand what you try to say in your
original message but I have not been able to get you yet.

Can you summarize like:

  1. What current status is.
   e.g: working but not as expected.

  2. What your thought (, guess or whatever) is about your cluster.
   e.g: broken metadata, data or whatever you're thinking now.
  
  3. What you exactly did shortly not bla bla bla...


  4. What you really want to do (shortly)?


I was trying to give the full context of my tests so that all the 
information is available.


After Jonh's response, and some further thinking, I think I understand 
partially what actions have to be done in a scenario like the one I've 
created


The whole idea is, given a scenario where there is loss of data and 
metadata, what can be done from the admin side to recover the cephfs.


Nevertheless, since this email thread is already long, I'll try to send 
a new email more focused.


Cheers and Thanks for the replies
Goncalo



--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-14 Thread Goncalo Borges

Hello John...

Thank you for the replies. I do have some comments in line.




Bare a bit with me while I give you a bit of context. Questions will appear
at the end.

1) I am currently running ceph 9.0.3 and I have install it  to test the
cephfs recovery tools.

2) I've created a situation where I've deliberately (on purpose) lost some
data and metadata (check annex 1 after the main email).

You're only *maybe* losing metadata here, as your procedure is
targeting OSDs that contain data, and just hoping that those OSDs also
contain some metadata.


My procedure was aiming to hit servers with both data and metadata. This 
is actually why I get the PG / OSD mapping, using the file inode, in the 
data and metadata pool, and then destroy those OSDs.


   5) Get the file / PG / OSD mapping

   # ceph osd map cephfs_dt 124.
   osdmap e479 pool 'cephfs_dt' (1) object '124.' -> pg 1.c18fbb6f 
(1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)

   # ceph osd map cephfs_mt 124.
   osdmap e479 pool 'cephfs_mt' (2) object '124.' -> pg 2.c18fbb6f 
(2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)


Please note that I've destroyed OSDs 6, 13, 15, 19, 23 and 27. If this 
procedure is not hitting the file metadata, than the problem may be that 
I am not understand how the metadata is being stored and mapped to OSDs.





Finally the questions:

a./ Under a situation as the one describe above, how can we safely terminate
cephfs in the clients? I have had situations where umount simply hangs and
there is no real way to unblock the situation unless I reboot the client. If
we have hundreds of clients, I would like to avoid that.

In your procedure, the umount problems have nothing to do with
corruption.  It's (sometimes) hanging because the MDS is offline.  If
the client has dirty metadata, it may not be able to flush it until
the MDS is online -- there's no general way to "abort" this without
breaking userspace semantics.  Similar case:
http://tracker.ceph.com/issues/9477

Rebooting the machine is actually correct, as it ensures that we can
kill the filesystem mount at the same time as any application
processes using it, and therefore not break the filesystem semantics
from the point of view of those applications.

All that said, from a practical point of view we probably do need some
slightly nicer abort hooks that allow admins to "break the rules" in
crazy situations.



From my experience, I do think that this will eventually be needed by 
any admin at some point.







b./ I was expecting to have lost metadata information since I've clean OSDs
where metadata information was stored for the
/cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that the
/'cephfs/goncalo/5Gbytes_029.txt' was still properly referenced, without me
having to run any recover tool. What am I missing?

I would guess that when you deleted 6/21 of your OSDs, you just
happened not to hit any metadata journal objects.  The journal
replayed, the MDS came back online, and your metadata was back in
cache.


I do understand your explanation about the journal. So, in the 
assumption that no objects related to the journal have been destroyed, 
the system has been able to replay the operations logged in the journal, 
and reconstruct the loss metadata info. Can we actually tune the size of 
the journal?


Just as a side comment, the journal seems corrupted anyway

   # cephfs-journal-tool journal inspect
   2015-09-14 17:20:15.708622 7f7e2d2ec8c0 -1 Bad entry start ptr
   (0x1c0) at 0x196dae0
   Overall journal integrity: DAMAGED
   Corrupt regions:
  0x196dae0-

   # cephfs-journal-tool event get summary
   2015-09-14 17:22:54.235848 7f8cf306b8c0 -1 Bad entry start ptr
   (0x1c0) at 0x196dae0
   Events by type:
  OPEN: 46
  SESSION: 13
  SUBTREEMAP: 16
  UPDATE: 13157
   Errors: 0

Nevertheless, at this point I just decided to reset the journal

   # cephfs-journal-tool journal reset
   old journal was 4194304~22470246
   new journal start will be 29360128 (2695578 bytes past old end)
   writing journal head
   writing EResetJournal entry
   done

   #  cephfs-journal-tool journal inspect
   Overall journal integrity: OK



I've tried to run the recovery tools, but I have several doubts which I did
not found described in the documentation
 - Is there a specific order / a way to run the tools for the c.1, c.2
and c.3 cases I mentioned?

Right now your best reference might be the test code (linked above).
These tools are not finished yet, and I doubt we will write user
documentation until they're more complete (probably in Jewel).  Even
then, the tools are designed to enable expert support intervention in
disasters, not to provide a general "wizard" for fixing filesystems
(yet) -- ideally we would always specifically identify what was broken
in a filesystem before starting to use the (potentially dangerous)
tools that modify metada

Re: [ceph-users] CephFS and caching

2015-09-14 Thread Gregory Farnum
On Thu, Sep 10, 2015 at 1:07 PM, Kyle Hutson  wrote:
> A 'rados -p cachepool ls' takes about 3 hours - not exactly useful.
>
> I'm intrigued that you say a single read may not promote it into the cache.
> My understanding is that if you have an EC-backed pool the clients can't
> talk to them directly, which means they would necessarily be promoted to the
> cache pool so the client could read it. Is my understanding wrong?

You're not wrong (although you may be in the future!), but for
instance there is code somewhere (in Infernalis-to-be, I think?) that
will simply proxy the read through the cache OSD, without actually
reading up the data.

>
> I'm also wondering if it's possible to use RAM as a read-cache layer.
> Obviously, we don't want this for write-cache because of power outages,
> motherboard failures, etc., but it seems to make sense for a read-cache. Is
> that something that's being done, can be done, is going to be done, or has
> even been considered?

We don't do anything special to make use of RAM because the linux page
cache is generally more intelligent about it than we could be.
Sufficiently hot objects will be in RAM without us needing to do
anything special. ;)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-14 Thread Gregory Farnum
It's been a while since I looked at this, but my recollection is that
the FileStore will check if it should split on every object create,
and will check if it should merge on every delete. It's conceivable it
checks for both whenever the number of objects changes, though, which
would make things easier.

I don't think scrub or anything else will do the work, though. :/
-Greg

On Tue, Sep 8, 2015 at 2:26 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nick Fisk
>> Sent: 06 September 2015 15:11
>> To: 'Shinobu Kinjo' ; 'GuangYang'
>> 
>> Cc: 'ceph-users' ; 'Nick Fisk' 
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Just a quick update after up'ing the thresholds, not much happened. This is
>> probably because the merge threshold is several times less than the trigger
>> for the split. So I have now bumped the merge threshold up to 1000
>> temporarily to hopefully force some DIR's to merge.
>>
>> I believe this has started to happen, but it only seems to merge right at the
>> bottom of the tree.
>>
>> Eg
>>
>> /var/lib/ceph/osd/ceph-1/current/0.106_head/DIR_6/DIR_0/DIR_1/
>>
>> All the Directory's only 1 have directory in them, DIR_1 is the only one in 
>> the
>> path that has any objects in it. Is this the correct behaviour? Is there any
>> impact from having these deeper paths compared to when the objects are
>> just in the root directory?
>>
>> I guess the only real way to get the objects back into the root would be to
>> out->drain->in the OSD?
>>
>>
>> > -Original Message-
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> > Of Shinobu Kinjo
>> > Sent: 05 September 2015 01:42
>> > To: GuangYang 
>> > Cc: ceph-users ; Nick Fisk
>> > 
>> > Subject: Re: [ceph-users] Ceph performance, empty vs part full
>> >
>> > Very nice.
>> > You're my hero!
>> >
>> >  Shinobu
>> >
>> > - Original Message -
>> > From: "GuangYang" 
>> > To: "Shinobu Kinjo" 
>> > Cc: "Ben Hines" , "Nick Fisk" ,
>> > "ceph- users" 
>> > Sent: Saturday, September 5, 2015 9:40:06 AM
>> > Subject: RE: [ceph-users] Ceph performance, empty vs part full
>> >
>> > 
>> > > Date: Fri, 4 Sep 2015 20:31:59 -0400
>> > > From: ski...@redhat.com
>> > > To: yguan...@outlook.com
>> > > CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
>> > > Subject: Re: [ceph-users] Ceph performance, empty vs part full
>> > >
>> > >> IIRC, it only triggers the move (merge or split) when that folder
>> > >> is hit by a
>> > request, so most likely it happens gradually.
>> > >
>> > > Do you know what causes this?
>> > A requests (read/write/setxattr, etc) hitting objects in that folder.
>> > > I would like to be more clear "gradually".
>
>
> Does anyone know if a scrub is included in this? I have kicked off a deep 
> scrub of an OSD and yet I still don't see merging happening, even with a 
> merge threshold of 1000.
>
> Example
> /var/lib/ceph/osd/ceph-0/current/0.108_head : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8 : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0 : 0 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1 : 15 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_4 : 85 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_B : 63 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_D : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_8 : 73 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_0 : 77 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_6 : 79 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_3 : 67 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_E : 94 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_C : 91 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_A : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_5 : 96 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_2 : 88 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_9 : 70 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_1 : 95 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_7 : 87 files
> /var/lib/ceph/osd/ceph-0/current/0.108_head/DIR_8/DIR_0/DIR_1/DIR_F : 88 files
>
>
>
>> > >
>> > > Shinobu
>> > >
>> > > - Original Message -
>> > > From: "GuangYang" 
>> > > To: "Ben Hines" , "Nick Fisk" 
>> > > Cc: "ceph-users" 
>> > > Sent: Saturday, September 5, 2015 9:27:31 AM
>> > > Subject: Re: [ceph-users] Ceph performance, empty vs part full
>> > >
>> > > IIRC, it only triggers the move (merge or split) when that folder is
>> > > hit by a
>> > request, so most likely it happens gradu

Re: [ceph-users] rados bench seq throttling

2015-09-14 Thread Gregory Farnum
On Thu, Sep 10, 2015 at 1:02 PM, Deneau, Tom  wrote:
> Running 9.0.3 rados bench on a 9.0.3 cluster...
> In the following experiments this cluster is only 2 osd nodes, 6 osds each
> and a separate mon node (and a separate client running rados bench).
>
> I have two pools populated with 4M objects.  The pools are replicated x2
> with identical parameters.  The objects appear to be spread evenly across the 
> 12 osds.
>
> In all cases I drop caches on all nodes before doing a rados bench seq test.
> In all cases I run rados bench seq for identical times (30 seconds) and in 
> that time
> we do not run out of objects to read from the pool.
>
> I am seeing significant bandwidth differences between the following:
>
>* running a single instance of rados bench reading from one pool with 32 
> threads
>  (bandwidth approx 300)
>
>* running two instances rados bench each reading from one of the two pools
>  with 16 threads per instance (combined bandwidth approx. 450)
>
> I have already increased the following:
>   objecter_inflight_op_bytes = 10485760
>   objecter_inflight_ops = 8192
>   ms_dispatch_throttle_bytes = 1048576000  #didn't seem to have any effect
>
> The disks and network are not reaching anywhere near 100% utilization
>
> What is the best way to diagnose what is throttling things in the 
> one-instance case?

Pretty sure the rados bench main threads are just running into their
limits. There's some work that Piotr (I think?) has been doing to make
it more efficient if you want to browse the PRs, but I don't think
they're even in a dev release yet.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Query about contribution regarding monitoring of Ceph Object Storage

2015-09-14 Thread Gregory Farnum
On Sat, Sep 12, 2015 at 6:13 AM, pragya jain  wrote:
> Hello all
>
> I am carrying out research in the area of cloud computing under Department
> of CS, University of Delhi. I would like to contribute my research work
> regarding monitoring of Ceph Object Storage to the Ceph community.
>
> Please help me by providing the appropriate link with whom I can connect to
> know if my work is relevant for contribution.

If you'd like to discuss it, the ceph-devel list
(ceph-de...@vger.kernel.org) is the place to suggest ideas. If you
have completed work you can send in a pull request at Github
(github.com/ceph/ceph) as well to show off the code.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Cache tier full not evicting

2015-09-14 Thread deeepdish
Thanks Nick.   That did it!   Cache cleans it self up now.   

> On Sep 14, 2015, at 11:49 , Nick Fisk  wrote:
> 
> Have you set the target_max_bytes? Otherwise those ratios are not relative to 
> anything, they use the target_max_bytes as a max, not the pool size.
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of deeepdish
> Sent: 14 September 2015 16:27
> To: ceph-users@lists.ceph.com 
> Subject: [ceph-users] Cache tier full not evicting
>  
> Hi Everyone,
>  
> Getting close to cracking my understanding of cache tiering, and ec pools.   
> Stuck on one anomaly which I do not understand — spent hours reviewing docs 
> online, can’t seem to pin point what I’m doing wrong.   Referencing 
> http://xo4t.mj.am/link/xo4t/no2irn4/1/4BSmK1EUshpYjOdI2VWk4g/aHR0cDovL2NlcGguY29tL2RvY3MvbWFzdGVyL3JhZG9zL29wZXJhdGlvbnMvY2FjaGUtdGllcmluZy8
>  
> 
>  
> Setup:
>  
> Test / PoC Lab environment (not production)
>  
> 1x [26x OSD/MON host]
> 1x MON VM
>  
> Erasure coded pool consisting of 10 spinning OSDs  (journals on SSDs - 5:1 
> spinner:SSD ratio)
> Cache tier consisting of 2 SSD OSDs
>  
> Issue:
>  
> Cache tier is not honoring configured thresholds.   In my particular case, I 
> have 2 OSDs in pool ‘cache’ (140G each == 280G total pool capacity).   
>  
> Pool cache is configured with replica factor of 2 (size = 2, min size = 1)
>  
> Initially I tried the following settings:
>  
> ceph osd pool set cache cache_target_dirty_ratio 0.3
> ceph osd pool set cache cache_target_full_ratio 0.7
> ceph osd pool set cache cache_min_flush_age 1
> ceph osd pool set cache cache_min_evict_age 1
>  
> My cache tier’s utilization hit 96%+, causing the pool to run out of capacity.
>  
> I realized that in a replicated pool, only 1/2 the capacity is available and 
> made the following adjustments:
>  
> ceph osd pool set cache cache_target_dirty_ratio 0.1
> ceph osd pool set cache cache_target_full_ratio 0.3
> ceph osd pool set cache cache_min_flush_age 1
> ceph osd pool set cache cache_min_evict_age 1
>  
> The above implies that 0.3 = 60% of replicated (2x) pool size) and 0.1 = 20% 
> of replicated (2x) pool size.   
>  
> Even with above revised values, I still see the cache tier getting full.  
>  
> The cache tier can only be flushed / evicted by manually running the 
> following:
>  
> rados -p cache cache-flush-evict-all
>  
> Thank you.
>  
>  
>  
>  
>  
>  
>  
>  
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Starting a Non-default Cluster at Machine Startup

2015-09-14 Thread John Cobley

Hi All,

Hope someone can point me in the right direction here.  I've been 
following the instructions on the manual deployment of a Ceph cluster 
here - http://docs.ceph.com/docs/master/install/manual-deployment/ .  
All is going OK however we are setting up our cluster with a 
non-default  cluster name.  It's basically a proof-of-concept cluster 
that will probably be kept running after we setup our final cluster.


Now after a /lot /of fiddling around I finally managed to get Ceph and 
it's monitors running using -

/etc/init.d/ceph -a --cluster testcluster start
Now is it possible to get the server to start my cluster on boot, that 
is without my intervention?


Regards,

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD refuses to start after problem with adding monitors

2015-09-14 Thread Chang, Fangzhe (Fangzhe)
Hi,

I started a new Ceph cluster with a single instance, and later added two new 
osd on different machines using ceph-deploy. The osd data directories reside on 
a separate disk than the conventional /var/local/ceph/osd- directory.  
Correspondingly, I changed the replication factor size to 3 though the min_size 
parameter stays at 1.

As a next step, I tried to expand the number of monitors. However, the effort 
of adding two new monitors using ceph-deploy failed. The 'ceph status' command 
only reveals the original monitor whereas the two new monitors are visible when 
retrieving monmap. To resolve the problem, I was looking around and found  the 
'ceph mon add' command. The moment I tried this command, everything got stuck. 
'ceph status' simply hungs. Ceph daemon can no longer be started --- it seems 
that the osd sub-command times out.

Any clue on where to look at for the problems or how to fix them?

Another small problem: Since the osd data directory is not provided in 
/etc/ceph/ceph.conf, I'm wondering how ceph knows  where to find it.

Thanks

Fangzhe

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier full not evicting

2015-09-14 Thread Nick Fisk
Have you set the target_max_bytes? Otherwise those ratios are not relative to 
anything, they use the target_max_bytes as a max, not the pool size.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
deeepdish
Sent: 14 September 2015 16:27
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Cache tier full not evicting

 

Hi Everyone,

 

Getting close to cracking my understanding of cache tiering, and ec pools.   
Stuck on one anomaly which I do not understand — spent hours reviewing docs 
online, can’t seem to pin point what I’m doing wrong.   Referencing 
http://ceph.com/docs/master/rados/operations/cache-tiering/

 

Setup:

 

Test / PoC Lab environment (not production)

 

1x [26x OSD/MON host]

1x MON VM

 

Erasure coded pool consisting of 10 spinning OSDs  (journals on SSDs - 5:1 
spinner:SSD ratio)

Cache tier consisting of 2 SSD OSDs

 

Issue:

 

Cache tier is not honoring configured thresholds.   In my particular case, I 
have 2 OSDs in pool ‘cache’ (140G each == 280G total pool capacity).   

 

Pool cache is configured with replica factor of 2 (size = 2, min size = 1)

 

Initially I tried the following settings:

 

ceph osd pool set cache cache_target_dirty_ratio 0.3

ceph osd pool set cache cache_target_full_ratio 0.7

ceph osd pool set cache cache_min_flush_age 1

ceph osd pool set cache cache_min_evict_age 1

 

My cache tier’s utilization hit 96%+, causing the pool to run out of capacity.

 

I realized that in a replicated pool, only 1/2 the capacity is available and 
made the following adjustments:

 

ceph osd pool set cache cache_target_dirty_ratio 0.1

ceph osd pool set cache cache_target_full_ratio 0.3

ceph osd pool set cache cache_min_flush_age 1

ceph osd pool set cache cache_min_evict_age 1

 

The above implies that 0.3 = 60% of replicated (2x) pool size) and 0.1 = 20% of 
replicated (2x) pool size.   

 

Even with above revised values, I still see the cache tier getting full.  

 

The cache tier can only be flushed / evicted by manually running the following:

 

rados -p cache cache-flush-evict-all

 

Thank you.

 

 

 

 

 

 

 

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tier full not evicting

2015-09-14 Thread deeepdish
Hi Everyone,

Getting close to cracking my understanding of cache tiering, and ec pools.   
Stuck on one anomaly which I do not understand — spent hours reviewing docs 
online, can’t seem to pin point what I’m doing wrong.   Referencing 
http://ceph.com/docs/master/rados/operations/cache-tiering/ 


Setup:

Test / PoC Lab environment (not production)

1x [26x OSD/MON host]
1x MON VM

Erasure coded pool consisting of 10 spinning OSDs  (journals on SSDs - 5:1 
spinner:SSD ratio)
Cache tier consisting of 2 SSD OSDs

Issue:

Cache tier is not honoring configured thresholds.   In my particular case, I 
have 2 OSDs in pool ‘cache’ (140G each == 280G total pool capacity).   

Pool cache is configured with replica factor of 2 (size = 2, min size = 1)

Initially I tried the following settings:

ceph osd pool set cache cache_target_dirty_ratio 0.3
ceph osd pool set cache cache_target_full_ratio 0.7
ceph osd pool set cache cache_min_flush_age 1
ceph osd pool set cache cache_min_evict_age 1

My cache tier’s utilization hit 96%+, causing the pool to run out of capacity.

I realized that in a replicated pool, only 1/2 the capacity is available and 
made the following adjustments:

ceph osd pool set cache cache_target_dirty_ratio 0.1
ceph osd pool set cache cache_target_full_ratio 0.3
ceph osd pool set cache cache_min_flush_age 1
ceph osd pool set cache cache_min_evict_age 1

The above implies that 0.3 = 60% of replicated (2x) pool size) and 0.1 = 20% of 
replicated (2x) pool size.   

Even with above revised values, I still see the cache tier getting full.  

The cache tier can only be flushed / evicted by manually running the following:

rados -p cache cache-flush-evict-all

Thank you.







 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Adam Heczko
Hi folks, OTOH BTRFS warns users about 'nobarrier' mount option [1]
'Do not use device barriers. NOTE: Using this option greatly increases the
chances of you experiencing data corruption during a power failure
situation. This means full file-system corruption, and not just losing or
corrupting data that was being written during a power cut or kernel panic '

[1] https://btrfs.wiki.kernel.org/index.php/Mount_options

Looks like kernel panic situation, which could be caused by totally non
file system relevant or disk relevant error, also could kill file system.
IMO more tests are required to determine what's the filesystem trash
probability when using 'nobarrier' mount option.


On Mon, Sep 14, 2015 at 11:23 AM, Jan Schermer  wrote:

> I looked into this just last week.
>
> Everybody seems to think it's safe to disable barriers if you have a
> non-volatile cache on the block device (be it controller, drive or SAN
> array), all documentation for major databases and distributions indicate
> you can disable them safely in this case.
>
> Someone would have to dig throught the source code, but the only
> difference with barriers disable should be lack of "flush" command sent to
> the drive.
> However, if the "flush"is one level up, the requests could be in fact
> reordered.
>
> Let's just hope someone didn't screw up...
>
> Jan
>
> > On 14 Sep 2015, at 11:15, Nick Fisk  wrote:
> >
> >
> >
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> >> Christian Balzer
> >> Sent: 14 September 2015 09:43
> >> To: ceph-us...@ceph.com
> >> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
> >>
> >>
> >> Hello,
> >>
> >> Firstly thanks to Richard on getting back to us about this.
> >>
> >> On Mon, 14 Sep 2015 09:31:01 +0100 Nick Fisk wrote:
> >>
> >>> Are we sure nobarriers is safe? From what I understand barriers are
> >>> there to ensure correct ordering of writes, not just to make sure data
> >>> is flushed down to a non-volatile medium. Although the Intel SSD’s
> >>> have power loss protection, is there not a risk that the Linux
> >>> scheduler might be writing data out of order to the SSD’s, meaning
> >>> that in the case of power loss, essential FS data might be lost in the
> OS
> >> buffers?
> >>>
> >> The way I understand it barriers ensure order and thus consistency in
> face of
> >> non-volatile caches.
> >> So DC Intel SSDs are on the same page as BBU backed cached RAID
> >> controllers with HW cache (and the HDD caches turned OFF!).
> >> That is, completely safe with no-barriers.
> >>
> >> To quote from the mount man page:
> >> ---
> >> This enables/disables barriers.  barrier=0 disables it, barrier=1
> enables  it.
> >> Write  barriers  enforce proper on-disk ordering of journal commits,
> making
> >> volatile disk write caches safe to use, at some performance penalty.
> The
> >> ext3 filesystem does not enable write barriers by default.  Be sure to
> enable
> >> barriers unless your disks are battery-backed one way or another.
> Otherwise
> >> you risk filesystem corruption in case of power failure.
> >> ---
> >>
> >> Unflushed (dirty) data in the page cache is _always_ lost when the power
> >> fails.
> >
> > But that was my point, barriers should make sure that the data left in
> page cache is not in an order that would cause corruption. Ie data written
> but journal hasn't.
> >
> > This guy seems to think the same
> >
> >
> http://symcbean.blogspot.co.uk/2014/03/warning-bbwc-may-be-bad-for-your-health.html
> >
> > But then a Redhat bug was closed suggesting that FS journal operations
> are always done as NOOP regardless of the scheduler.so I guess that
> means it's safe???
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1104380
> >
> >
> >>
> >> That said, having to disable barriers to make Avago/LSI happy is not
> >> something that gives me the warm fuzzies.
> >>
> >> Christian
> >>>
> >>>
> >>> Maybe running with the NOOP scheduler and nobarriers maybe safe, but
> >>> unless someone with more knowledge on the subject can confirm, I would
> >>> be wary about using nobarriers with CFQ or Deadline.
> >>>
> >>>
> >>>
> >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >>> Of Richard Bade Sent: 14 September 2015 01:31
> >>> Cc: ceph-us...@ceph.com
> >>> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
> >>>
> >>>
> >>>
> >>> Hi Everyone,
> >>>
> >>> I updated the firmware on 3 S3710 drives (one host) last Tuesday and
> >>> have not seen any ATA resets or Task Aborts on that host in the 5 days
> >>> since.
> >>>
> >>> I also set nobarriers on another host on Wednesday and have only seen
> >>> one Task Abort, and that was on an S3710.
> >>>
> >>> I have seen 18 ATA resets or Task Aborts on the two hosts that I made
> >>> no changes on.
> >>>
> >>> It looks like this firmware has fixed my issues, but it looks like
> >>> nobarriers also improves the situation significantly. Which seems to
> >>> Corr

Re: [ceph-users] Monitor segfault

2015-09-14 Thread Eino Tuominen
Hello,

I'm pretty sure I did it just like you were trying to do. The cluster has since 
been upgraded a couple of times. Unfortunately I can't remember when I created 
that particular faulty rule.

-- 
  Eino Tuominen

> Kefu Chai  kirjoitti 14.9.2015 kello 11.57:
> 
> Eino,
> 
> - Original Message -
>> From: "Gregory Farnum" 
>> To: "Eino Tuominen" 
>> Cc: ceph-us...@ceph.com, "Kefu Chai" , j...@suse.de
>> Sent: Monday, August 31, 2015 4:45:40 PM
>> Subject: Re: [ceph-users] Monitor segfault
>> 
>>> On Mon, Aug 31, 2015 at 9:33 AM, Eino Tuominen  wrote:
>>> Hello,
>>> 
>>> I'm getting a segmentation fault error from the monitor of our test
>>> cluster. The cluster was in a bad state because I have recently removed
>>> three hosts from it. Now I started cleaning it up and first marked the
>>> removed osd's as lost (ceph osd lost), and then I tried to remove the
>>> osd's from the crush map (ceph osd crush remove). After a few successful
>>> commands the cluster ceased to respond. On monitor seemed to stay up (it
> 
> Eino, i was looking at your issue at http://tracker.ceph.com/issues/12876.
> seems it is due to a fault crush rule,  see 
> http://tracker.ceph.com/issues/12876#note-5.
> may i know how you managed to inject it into the monitor? i tried using
> 
> $ ceph osd setcrushmap -i new-crush-map
> Error EINVAL: Failed to parse crushmap: *** Caught signal (Segmentation 
> fault) **
> 
> but no luck.
> 
>>> was responding through the admin socket), so I stopped it and used
>>> monmaptool to remove the failed monitor from the monmap. But, now also the
>>> second monitor segfaults when I try to start it.
>>> 
>>> The cluster does not have any important data, but I'd like to get the
>>> monitors up as a practice. How do I debug this further?
>>> 
>>> Linux cephmon-test-02 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00
>>> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> The output:
>>> 
>>> -2> 2015-08-31 10:28:52.606894 7f8ab493c8c0  0 log_channel(cluster) log
>>> [INF] : pgmap v1845959: 6288 pgs: 55 inactive, 153 active, 473
>>> active+clean, 1 stale+active+undersized+degraded+remapped, 455
>>> stale+incomplete, 272 peering, 145 stale+down+peering, 6
>>> degraded+remapped, 1 active+recovery_wait+degraded, 70
>>> undersized+degraded+remapped, 504 incomplete, 206
>>> active+undersized+degraded+remapped, 2 stale+active+clean+inconsistent,
>>> 101 down+peering, 59 active+undersized+degraded+remapped+backfilling, 294
>>> remapped, 11 active+undersized+degraded+remapped+wait_backfill, 1264
>>> active+remapped, 5 stale+undersized+degraded, 1
>>> active+undersized+remapped, 1 stale+active+undersized+degraded, 23
>>> stale+remapped+incomplete, 297 remapped+peering, 1
>>> active+remapped+wait_backfill, 1 degraded, 32 undersized+degraded, 454
>>> active+undersized+degraded, 7 active+recovery_wait+degraded+remapped,
>>> 1134 stale+active+clean, 142 remapped+incomplete, 115 stale+peering, 3
>>> active+recovering+degraded+remapped;
>>>  10014 GB data, 5508 GB used, 41981 GB / 47489 GB avail; 33343/19990223
>>>  objects degraded (0.167%); 45721/19990223 objects misplaced (0.229%)
>>>-1> 2015-08-31 10:28:52.606969 7f8ab493c8c0  0 log_channel(cluster) log
>>>[INF] : mdsmap e1: 0/0/1 up
>>> 0> 2015-08-31 10:28:52.617974 7f8ab493c8c0 -1 *** Caught signal
>>> (Segmentation fault) **
>>> in thread 7f8ab493c8c0
>>> 
>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>> 1: /usr/bin/ceph-mon() [0x9a98aa]
>>> 2: (()+0x10340) [0x7f8ab3a3d340]
>>> 3: (crush_do_rule()+0x292) [0x85ada2]
>>> 4: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector>> std::allocator >*, int*, unsigned int*) const+0xeb) [0x7a85cb]
>>> 5: (OSDMap::pg_to_raw_up(pg_t, std::vector >*,
>>> int*) const+0x94) [0x7a8a64]
>>> 6: (OSDMap::remove_redundant_temporaries(CephContext*, OSDMap const&,
>>> OSDMap::Incremental*)+0x317) [0x7ab8f7]
>>> 7: (OSDMonitor::create_pending()+0xf69) [0x60fdb9]
>>> 8: (PaxosService::_active()+0x709) [0x6047b9]
>>> 9: (PaxosService::election_finished()+0x67) [0x604ad7]
>>> 10: (Monitor::win_election(unsigned int, std::set,
>>> std::allocator >&, unsigned long, MonCommand const*, int,
>>> std::set, std::allocator > const*)
>>> +0x236) [0x5c34a6]
>>> 11: (Monitor::win_standalone_election()+0x1cc) [0x5c388c]
>>> 12: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
>>> 13: (Monitor::init()+0xd5) [0x5c4645]
>>> 14: (main()+0x2470) [0x5769c0]
>>> 15: (__libc_start_main()+0xf5) [0x7f8ab1ec7ec5]
>>> 16: /usr/bin/ceph-mon() [0x5984f7]
>>> NOTE: a copy of the executable, or `objdump -rdS ` is needed
>>> to interpret this.
>> 
>> Can you get a core dump, open it in gdb, and provide the output of the
>> "backtrace" command?
>> 
>> The cluster is for some reason trying to create new PGs and something
>> is going wrong; I suspect the monitors aren't handling the loss of PGs
>> properly. :/
>> -Greg
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://l

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Jan Schermer
I looked into this just last week.

Everybody seems to think it's safe to disable barriers if you have a 
non-volatile cache on the block device (be it controller, drive or SAN array), 
all documentation for major databases and distributions indicate you can 
disable them safely in this case.

Someone would have to dig throught the source code, but the only difference 
with barriers disable should be lack of "flush" command sent to the drive.
However, if the "flush"is one level up, the requests could be in fact reordered.

Let's just hope someone didn't screw up...

Jan

> On 14 Sep 2015, at 11:15, Nick Fisk  wrote:
> 
> 
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Christian Balzer
>> Sent: 14 September 2015 09:43
>> To: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
>> 
>> 
>> Hello,
>> 
>> Firstly thanks to Richard on getting back to us about this.
>> 
>> On Mon, 14 Sep 2015 09:31:01 +0100 Nick Fisk wrote:
>> 
>>> Are we sure nobarriers is safe? From what I understand barriers are
>>> there to ensure correct ordering of writes, not just to make sure data
>>> is flushed down to a non-volatile medium. Although the Intel SSD’s
>>> have power loss protection, is there not a risk that the Linux
>>> scheduler might be writing data out of order to the SSD’s, meaning
>>> that in the case of power loss, essential FS data might be lost in the OS
>> buffers?
>>> 
>> The way I understand it barriers ensure order and thus consistency in face of
>> non-volatile caches.
>> So DC Intel SSDs are on the same page as BBU backed cached RAID
>> controllers with HW cache (and the HDD caches turned OFF!).
>> That is, completely safe with no-barriers.
>> 
>> To quote from the mount man page:
>> ---
>> This enables/disables barriers.  barrier=0 disables it, barrier=1  enables  
>> it.
>> Write  barriers  enforce proper on-disk ordering of journal commits, making
>> volatile disk write caches safe to use, at some performance penalty.  The
>> ext3 filesystem does not enable write barriers by default.  Be sure to enable
>> barriers unless your disks are battery-backed one way or another.  Otherwise
>> you risk filesystem corruption in case of power failure.
>> ---
>> 
>> Unflushed (dirty) data in the page cache is _always_ lost when the power
>> fails.
> 
> But that was my point, barriers should make sure that the data left in page 
> cache is not in an order that would cause corruption. Ie data written but 
> journal hasn't.
> 
> This guy seems to think the same
> 
> http://symcbean.blogspot.co.uk/2014/03/warning-bbwc-may-be-bad-for-your-health.html
> 
> But then a Redhat bug was closed suggesting that FS journal operations are 
> always done as NOOP regardless of the scheduler.so I guess that means 
> it's safe???
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1104380
> 
> 
>> 
>> That said, having to disable barriers to make Avago/LSI happy is not
>> something that gives me the warm fuzzies.
>> 
>> Christian
>>> 
>>> 
>>> Maybe running with the NOOP scheduler and nobarriers maybe safe, but
>>> unless someone with more knowledge on the subject can confirm, I would
>>> be wary about using nobarriers with CFQ or Deadline.
>>> 
>>> 
>>> 
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> Of Richard Bade Sent: 14 September 2015 01:31
>>> Cc: ceph-us...@ceph.com
>>> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
>>> 
>>> 
>>> 
>>> Hi Everyone,
>>> 
>>> I updated the firmware on 3 S3710 drives (one host) last Tuesday and
>>> have not seen any ATA resets or Task Aborts on that host in the 5 days
>>> since.
>>> 
>>> I also set nobarriers on another host on Wednesday and have only seen
>>> one Task Abort, and that was on an S3710.
>>> 
>>> I have seen 18 ATA resets or Task Aborts on the two hosts that I made
>>> no changes on.
>>> 
>>> It looks like this firmware has fixed my issues, but it looks like
>>> nobarriers also improves the situation significantly. Which seems to
>>> Correlate with your experience Christian.
>>> 
>>> Thanks everyone for the info in this thread, I plan to update the
>>> firmware on the remainder of the S3710 drives this week and also set
>>> nobarriers.
>>> 
>>> Regards,
>>> 
>>> Richard
>>> 
>>> 
>>> 
>>> On 8 September 2015 at 14:27, Richard Bade >>  > wrote:
>>> 
>>> Hi Christian,
>>> 
>>> 
>>> 
>>> On 8 September 2015 at 14:02, Christian Balzer >>  > wrote:
>>> 
>>> Indeed. But first a word about the setup where I'm seeing this.
>>> These are 2 mailbox server clusters (2 nodes each), replicating via
>>> DRBD over Infiniband (IPoIB at this time), LSI 3008 controller. One
>>> cluster with the Samsung DC SSDs, one with the Intel S3610.
>>> 2 of these chassis to be precise:
>>> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-
>> DC0FR.cf
>>> m
>>> 
>>> 
>>> 
>>> We are using the same box, but DC0R (no i

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 14 September 2015 09:43
> To: ceph-us...@ceph.com
> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
> 
> 
> Hello,
> 
> Firstly thanks to Richard on getting back to us about this.
> 
> On Mon, 14 Sep 2015 09:31:01 +0100 Nick Fisk wrote:
> 
> > Are we sure nobarriers is safe? From what I understand barriers are
> > there to ensure correct ordering of writes, not just to make sure data
> > is flushed down to a non-volatile medium. Although the Intel SSD’s
> > have power loss protection, is there not a risk that the Linux
> > scheduler might be writing data out of order to the SSD’s, meaning
> > that in the case of power loss, essential FS data might be lost in the OS
> buffers?
> >
> The way I understand it barriers ensure order and thus consistency in face of
> non-volatile caches.
> So DC Intel SSDs are on the same page as BBU backed cached RAID
> controllers with HW cache (and the HDD caches turned OFF!).
> That is, completely safe with no-barriers.
> 
> To quote from the mount man page:
> ---
> This enables/disables barriers.  barrier=0 disables it, barrier=1  enables  
> it.
> Write  barriers  enforce proper on-disk ordering of journal commits, making
> volatile disk write caches safe to use, at some performance penalty.  The
> ext3 filesystem does not enable write barriers by default.  Be sure to enable
> barriers unless your disks are battery-backed one way or another.  Otherwise
> you risk filesystem corruption in case of power failure.
> ---
> 
> Unflushed (dirty) data in the page cache is _always_ lost when the power
> fails.

But that was my point, barriers should make sure that the data left in page 
cache is not in an order that would cause corruption. Ie data written but 
journal hasn't.

This guy seems to think the same

http://symcbean.blogspot.co.uk/2014/03/warning-bbwc-may-be-bad-for-your-health.html

But then a Redhat bug was closed suggesting that FS journal operations are 
always done as NOOP regardless of the scheduler.so I guess that means it's 
safe???

https://bugzilla.redhat.com/show_bug.cgi?id=1104380


> 
> That said, having to disable barriers to make Avago/LSI happy is not
> something that gives me the warm fuzzies.
> 
> Christian
> >
> >
> > Maybe running with the NOOP scheduler and nobarriers maybe safe, but
> > unless someone with more knowledge on the subject can confirm, I would
> > be wary about using nobarriers with CFQ or Deadline.
> >
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Richard Bade Sent: 14 September 2015 01:31
> > Cc: ceph-us...@ceph.com
> > Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
> >
> >
> >
> > Hi Everyone,
> >
> > I updated the firmware on 3 S3710 drives (one host) last Tuesday and
> > have not seen any ATA resets or Task Aborts on that host in the 5 days
> > since.
> >
> > I also set nobarriers on another host on Wednesday and have only seen
> > one Task Abort, and that was on an S3710.
> >
> > I have seen 18 ATA resets or Task Aborts on the two hosts that I made
> > no changes on.
> >
> > It looks like this firmware has fixed my issues, but it looks like
> > nobarriers also improves the situation significantly. Which seems to
> > Correlate with your experience Christian.
> >
> > Thanks everyone for the info in this thread, I plan to update the
> > firmware on the remainder of the S3710 drives this week and also set
> > nobarriers.
> >
> > Regards,
> >
> > Richard
> >
> >
> >
> > On 8 September 2015 at 14:27, Richard Bade  >  > wrote:
> >
> > Hi Christian,
> >
> >
> >
> > On 8 September 2015 at 14:02, Christian Balzer  >  > wrote:
> >
> > Indeed. But first a word about the setup where I'm seeing this.
> > These are 2 mailbox server clusters (2 nodes each), replicating via
> > DRBD over Infiniband (IPoIB at this time), LSI 3008 controller. One
> > cluster with the Samsung DC SSDs, one with the Intel S3610.
> > 2 of these chassis to be precise:
> > https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-
> DC0FR.cf
> > m
> >
> >
> >
> > We are using the same box, but DC0R (no infiniband) so I guess not
> > surprising we're seeing the same thing happening.
> >
> >
> >
> >
> >
> > Of course latest firmware and I tried this with any kernel from Debian
> > 3.16 to stock 4.1.6.
> >
> > With nobarrier I managed to trigger the error only once yesterday on
> > the DRBD replication target, not the machine that actual has the FS
> mounted.
> > Usually I'd be able to trigger quite a bit more often during those tests.
> >
> > So this morning I updated the firmware of all S3610s on one node and
> > removed the nobarrier flag. It took a lot of punishment, but
> > eventually this happened:
> > ---
> > Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting
> > task abort! scmd(880fdc85b680) Sep  

Re: [ceph-users] SOLVED: CRUSH odd bucket affinity / persistence

2015-09-14 Thread Christian Balzer

Hello,

looking at your example HW configuration below, I'd suggest you scour the
ML archives for some (rather recent) discussions about node size, mixing
SSD and HDD pools on the same node and the performance (or lack of it when
it comes to cache tiers). 

Christian

On Sun, 13 Sep 2015 16:58:42 -0400 deeepdish wrote:

> Thanks Nick.   Looking at the script its something along the lines I was
> after.  
> 
> I just realized that I could create multiple availability group “hosts”
> however your statement is valid that the failure domain is an entire
> host.
> 
> Thanks for all your help everyone.
> 
> > On Sep 13, 2015, at 11:47 , Nick Fisk  wrote:
> > 
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of deeepdish
> >> Sent: 13 September 2015 02:47
> >> To: Johannes Formann 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] CRUSH odd bucket affinity / persistence
> >> 
> >> Johannes,
> >> 
> >> Thank you — "osd crush update on start = false” did the trick.   I
> >> wasn’t aware that ceph has automatic placement logic for OSDs
> >> (http://permalink.gmane.org/gmane.comp.file-
> >> systems.ceph.user/9035).   This brings up a best practice question..
> >> 
> >> How is the configuration of OSD hosts with multiple storage types
> >> (e.g. spinners + flash/ssd), typically implemented in the field from
> >> a crush map / device location perspective?   Preference is for a
> >> scale out design.
> > 
> > I use something based on this script:
> > 
> > https://gist.github.com/wido/5d26d88366e28e25e23d
> > 
> > With the crush hook location config value in ceph.conf. You can pretty
> > much place OSD's wherever you like with it.
> > 
> >> 
> >> In addition to the SSDs which are used for a EC cache tier, I’m also
> >> planning a 5:1 ratio of spinners to SSD for journals.   In this case
> >> I want to implement an availability groups within the OSD host itself.
> >> 
> >> e.g. in a 26-drive chassis, there will be 6 SSDs + 20 spinners.   [2
> >> SSDs for replicated cache tier, 4 SSDs will create 5 availability
> >> groups of 5 spinners each]   The idea is to have CRUSH take into
> >> account SSD journal failure (affecting 5 spinners).
> > 
> > By default Ceph will make the host the smallest failure domain, so I'm
> > not sure if there is any benefit to identifying to crush that several
> > OSD's share one journal. Whether you lose 1 OSD or all OSD's from a
> > server, there shouldn't be any difference to the possibility of data
> > loss. Or have I misunderstood your question?
> > 
> >> 
> >> Thanks.
> >> 
> >> 
> >> 
> >> On Sep 12, 2015, at 19:11 , Johannes Formann 
> >> wrote:
> >> 
> >> Hi,
> >> 
> >> 
> >> I’m having a (strange) issue with OSD bucket persistence / affinity
> >> on my test cluster..
> >> 
> >> The cluster is PoC / test, by no means production.   Consists of a
> >> single OSD / MON host + another MON running on a KVM VM.
> >> 
> >> Out of 12 OSDs I’m trying to get osd.10 and osd.11 to be part of the
> >> ssd bucket in my CRUSH map.   This works fine when either editing the
> >> CRUSH map by hand (exporting, decompile, edit, compile, import), or
> >> via the ceph osd crush set command:
> >> 
> >> "ceph osd crush set osd.11 0.140 root=ssd”
> >> 
> >> I’m able to verify that the OSD / MON host and another MON I have
> >> running see the same CRUSH map.
> >> 
> >> After rebooting OSD / MON host, both osd.10 and osd.11 become part of
> >> the default bucket.   How can I ensure that ODSs persist in their
> >> configured buckets?
> >> 
> >> I guess you have set "osd crush update on start = true"
> >> (http://ceph.com/docs/master/rados/operations/crush-map/ ) and only
> >> the default „root“-entry.
> >> 
> >> Either fix the „root“-Entry in the ceph.conf or set osd crush update
> >> on start = false.
> >> 
> >> greetings
> >> 
> >> Johannes
> > 
> > 
> > 
> > 
> > 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Christian Balzer

Hello,

Firstly thanks to Richard on getting back to us about this.

On Mon, 14 Sep 2015 09:31:01 +0100 Nick Fisk wrote:

> Are we sure nobarriers is safe? From what I understand barriers are
> there to ensure correct ordering of writes, not just to make sure data
> is flushed down to a non-volatile medium. Although the Intel SSD’s have
> power loss protection, is there not a risk that the Linux scheduler
> might be writing data out of order to the SSD’s, meaning that in the
> case of power loss, essential FS data might be lost in the OS buffers?
>
The way I understand it barriers ensure order and thus consistency in face
of non-volatile caches. 
So DC Intel SSDs are on the same page as BBU backed cached RAID
controllers with HW cache (and the HDD caches turned OFF!). 
That is, completely safe with no-barriers.

To quote from the mount man page:
---
This enables/disables barriers.  barrier=0 disables it, barrier=1  enables  it. 
  Write  barriers  enforce proper on-disk ordering of journal commits, making 
volatile disk write caches safe to use, at some performance penalty.  The ext3 
filesystem does not enable write barriers by default.  Be sure to enable 
barriers unless your disks are battery-backed one way or another.  Otherwise 
you risk filesystem corruption in case of power failure.
---

Unflushed (dirty) data in the page cache is _always_ lost when the power
fails.

That said, having to disable barriers to make Avago/LSI happy is not
something that gives me the warm fuzzies.
 
Christian
>  
> 
> Maybe running with the NOOP scheduler and nobarriers maybe safe, but
> unless someone with more knowledge on the subject can confirm, I would
> be wary about using nobarriers with CFQ or Deadline.
> 
>  
> 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Richard Bade Sent: 14 September 2015 01:31
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD
> 
>  
> 
> Hi Everyone,
> 
> I updated the firmware on 3 S3710 drives (one host) last Tuesday and
> have not seen any ATA resets or Task Aborts on that host in the 5 days
> since.
> 
> I also set nobarriers on another host on Wednesday and have only seen
> one Task Abort, and that was on an S3710.
> 
> I have seen 18 ATA resets or Task Aborts on the two hosts that I made no
> changes on.
> 
> It looks like this firmware has fixed my issues, but it looks like
> nobarriers also improves the situation significantly. Which seems to
> Correlate with your experience Christian.
> 
> Thanks everyone for the info in this thread, I plan to update the
> firmware on the remainder of the S3710 drives this week and also set
> nobarriers.
> 
> Regards,
> 
> Richard
> 
>  
> 
> On 8 September 2015 at 14:27, Richard Bade   > wrote:
> 
> Hi Christian,
> 
>  
> 
> On 8 September 2015 at 14:02, Christian Balzer   > wrote:
> 
> Indeed. But first a word about the setup where I'm seeing this.
> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
> with the Samsung DC SSDs, one with the Intel S3610.
> 2 of these chassis to be precise:
> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm
> 
>  
> 
> We are using the same box, but DC0R (no infiniband) so I guess not
> surprising we're seeing the same thing happening.
> 
>  
> 
> 
> 
> Of course latest firmware and I tried this with any kernel from Debian
> 3.16 to stock 4.1.6.
> 
> With nobarrier I managed to trigger the error only once yesterday on the
> DRBD replication target, not the machine that actual has the FS mounted.
> Usually I'd be able to trigger quite a bit more often during those tests.
> 
> So this morning I updated the firmware of all S3610s on one node and
> removed the nobarrier flag. It took a lot of punishment, but eventually
> this happened:
> ---
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task
> abort! scmd(880fdc85b680) Sep  8 10:43:47 mbx09 kernel:
> [ 1743.358339] sd 0:0:1:0: [sdb] CDB: Write(10) 2a 00 0e 9a fb b8 00 00
> 08 00 Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1:
> handle(0x000a), sas_address(0x443322110100), phy(1) Sep  8 10:43:47
> mbx09 kernel: [ 1743.358348] scsi target0:0:1:
> enclosure_logical_id(0x5003048019e98d00), slot(1) Sep  8 10:43:47 mbx09
> kernel: [ 1743.387951] sd 0:0:1:0: task abort: SUCCESS
> scmd(880fdc85b680) --- Note that on the un-patched node (DRBD
> replication target) I managed to trigger this bug 3 times in the same
> period.
> 
> So unless Intel has something to say (and given that this happens with
> Samsungs as well), I'd still look beady eyed at LSI/Avago...
> 
>  
> 
> Yes, I think there may be more than one issue here. The reduction in
> occurrences seems to prove there is an issue fixed by the Intel
> firmware, but something is still happening.
> 
> Once I have updated the firmware on the driv

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Nick Fisk
Are we sure nobarriers is safe? From what I understand barriers are there to 
ensure correct ordering of writes, not just to make sure data is flushed down 
to a non-volatile medium. Although the Intel SSD’s have power loss protection, 
is there not a risk that the Linux scheduler might be writing data out of order 
to the SSD’s, meaning that in the case of power loss, essential FS data might 
be lost in the OS buffers?

 

Maybe running with the NOOP scheduler and nobarriers maybe safe, but unless 
someone with more knowledge on the subject can confirm, I would be wary about 
using nobarriers with CFQ or Deadline.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Richard Bade
Sent: 14 September 2015 01:31
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD

 

Hi Everyone,

I updated the firmware on 3 S3710 drives (one host) last Tuesday and have not 
seen any ATA resets or Task Aborts on that host in the 5 days since.

I also set nobarriers on another host on Wednesday and have only seen one Task 
Abort, and that was on an S3710.

I have seen 18 ATA resets or Task Aborts on the two hosts that I made no 
changes on.

It looks like this firmware has fixed my issues, but it looks like nobarriers 
also improves the situation significantly. Which seems to Correlate with your 
experience Christian.

Thanks everyone for the info in this thread, I plan to update the firmware on 
the remainder of the S3710 drives this week and also set nobarriers.

Regards,

Richard

 

On 8 September 2015 at 14:27, Richard Bade mailto:hitr...@gmail.com> > wrote:

Hi Christian,

 

On 8 September 2015 at 14:02, Christian Balzer mailto:ch...@gol.com> > wrote:

Indeed. But first a word about the setup where I'm seeing this.
These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
with the Samsung DC SSDs, one with the Intel S3610.
2 of these chassis to be precise:
https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm

 

We are using the same box, but DC0R (no infiniband) so I guess not surprising 
we're seeing the same thing happening.

 



Of course latest firmware and I tried this with any kernel from Debian
3.16 to stock 4.1.6.

With nobarrier I managed to trigger the error only once yesterday on the
DRBD replication target, not the machine that actual has the FS mounted.
Usually I'd be able to trigger quite a bit more often during those tests.

So this morning I updated the firmware of all S3610s on one node and
removed the nobarrier flag. It took a lot of punishment, but eventually
this happened:
---
Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task abort! 
scmd(880fdc85b680)
Sep  8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB: Write(10) 
2a 00 0e 9a fb b8 00 00 08 00
Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1: handle(0x000a), 
sas_address(0x443322110100), phy(1)
Sep  8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1: 
enclosure_logical_id(0x5003048019e98d00), slot(1)
Sep  8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort: SUCCESS 
scmd(880fdc85b680)
---
Note that on the un-patched node (DRBD replication target) I managed to
trigger this bug 3 times in the same period.

So unless Intel has something to say (and given that this happens with
Samsungs as well), I'd still look beady eyed at LSI/Avago...

 

Yes, I think there may be more than one issue here. The reduction in 
occurrences seems to prove there is an issue fixed by the Intel firmware, but 
something is still happening.

Once I have updated the firmware on the drives on one of our hosts tonight, 
hopefully I can get some more statistics and pinpoint if there is another issue 
specifically with the LSI3008.

I'd be interested to know if the combination of nobarriers and the updated 
firmware fixes the issue.

 

Regards,

Richard

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com