Re: Serious problem after increase pg_num in pool

2012-02-20 Thread Sławomir Skowron
If there is no chance to stabilize this cluster i will try something like this.

- stop one machine in cluster.
- check if its still ok, and data are available
- make new fs on one machine
- migrate data by rados via obsync
- expand new cluster by second, and third machine
- change keys for radosgw etc
- new cluster is up with old dara

I can be done to migrate objects in .rgw.buckets pool via obsync ??

Dnia 21 lut 2012 o godz. 07:46 "Sławomir Skowron"  napisał(a):

> 40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
> destroyed.
>
> Ceph -s reports 224 GB in normal state.
>
> Pozdrawiam
>
> iSS
>
> Dnia 20 lut 2012 o godz. 21:19 Sage Weil  napisał(a):
>
>> Ooh, the pg split functionality is currently broken, and we weren't
>> planning on fixing it for a while longer.  I didn't realize it was still
>> possible to trigger from the monitor.
>>
>> I'm looking at how difficult it is to make it work (even inefficiently).
>>
>> How much data do you have in the cluster?
>>
>> sage
>>
>>
>>
>>
>> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
>>
>>> and this in ceph -w
>>>
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
>>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
>>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
>>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
>>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
>>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
>>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
>>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
>>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
>>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
>>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
>>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
>>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
>>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
>>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
>>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
>>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
>>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
>>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
>>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
>>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
>>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
>>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
>>> [20,51,64]
>>>
>>> 2012/2/20 S?awomir Skowron :
 After increase number pg_num from 8 to 100 in .rgw.buckets i have some
 serious problems.

 pool name   category KB  objects   clones
  degraded  unfound   rdrd KB   wr
 wr KB
 .intent-log -   4662   190
  0   00026502
 26501
 .log-  000
  0   000   9137

Re: Serious problem after increase pg_num in pool

2012-02-20 Thread Sławomir Skowron
40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
destroyed.

Ceph -s reports 224 GB in normal state.

Pozdrawiam

iSS

Dnia 20 lut 2012 o godz. 21:19 Sage Weil  napisał(a):

> Ooh, the pg split functionality is currently broken, and we weren't
> planning on fixing it for a while longer.  I didn't realize it was still
> possible to trigger from the monitor.
>
> I'm looking at how difficult it is to make it work (even inefficiently).
>
> How much data do you have in the cluster?
>
> sage
>
>
>
>
> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
>
>> and this in ceph -w
>>
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
>> [20,51,64]
>>
>> 2012/2/20 S?awomir Skowron :
>>> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
>>> serious problems.
>>>
>>> pool name   category KB  objects   clones
>>>   degraded  unfound   rdrd KB   wr
>>> wr KB
>>> .intent-log -   4662   190
>>>   0   00026502
>>> 26501
>>> .log-  000
>>>   0   000   913732
>>> 913342
>>> .rgw-  1   100
>>>   0   0109
>>>7
>>> .rgw.buckets-   39582566737070
>>>8061   0865940   610896
>>> 36050541
>>> .rgw.control-  010
>>>   0   0000
>>>0
>>> .users  -  110
>>>   0  

Re: v0.42 released

2012-02-20 Thread Sage Weil
On Mon, 20 Feb 2012, Diego Woitasen wrote:
> Production ready, right?

We are very close, at least with RADOS and RBD.

This is what we are focused on:

 - improving qa coverage.  it's grown by leaps and bounds over the last 
   several months, and is getting better.
 - osd stability.  we are cleaning up a few bits of problematic code, and 
   ensuring that the core functionality is well tested.  that includes 
   api coverage, load/stress testing, and failure testing.  most of the 
   recent work is in the 'thrashing' tests, which continously mark osds 
   in/out and/or restart daemons.  these tests are all passing.
 - administrator tools.  we've identified the top 10 most likely failure 
   scenarios that people are likely to see, and are going down the list to 
   ensure that they are detected, reported, diagnosable, and ideally 
   fixable.
 - key/value objects.  this is on of the few recent bits of new 
   functionality, but it's needed to make radosgw perform well for large 
   bucket sizes.

And very soon (next sprint?) we will be returning some attention to RBD
caching and layering.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.42 released

2012-02-20 Thread Diego Woitasen
On Mon, Feb 20, 2012 at 3:01 PM, Sage Weil  wrote:
> v0.42 is ready!  This has mostly been a stabilization release, with a few
> critical bugs fixed.  There is also an across-the-board change in data
> structure encoding that is not backwards-compatible, but is designed to
> allow future changes to be (both forwards- and backwards-).
>
> Notable changes include:
>
>  * osd: new (non-backwards compatible!) encoding for all structures
>  * osd: fixed bug with transactions being non-atomic (leaking across
>   commit boundaries)
>  * osd: track in-progress requests, log slow ones
>  * osd: randomly choose pull target during recovery (better load balance)
>  * osd: fixed recovery stall
>  * mon: a few recovery bug fixes
>  * mon: trim old auth files
>  * mon: better detection/warning about down pgs
>  * objecter: expose in-process requests via admin socket
>  * new infrastructure for testing data structure encoding changes (forward
>   and backward compatibility)
>
> Aside from the data structure encoding change, there is relatively little
> new code since v0.41.  This should be a pretty solid release.
>
> For v0.43, we are working on merging a few big changes.  The main one is a
> new key/value interface for objects: each object, instead of storing a
> blob of bytes, would consist of a (potentially large) set of key/value
> pairs that can be set/queried efficiently.  This is going to make a huge
> difference for radosgw performance with large buckets, and will help with
> large directories as well.  There is also ongoing stabilization work with
> the OSD and new interfaces for administrators to query the state of the
> cluster and diagnose common problems.
>
> v0.42 can be found from the usual locations:
>
>  * Git at git://github.com/NewDreamNetwork/ceph.git
>  * Tarball at http://ceph.newdream.net/download/ceph-0.42.tar.gz
>  * For Debian/Ubuntu packages, see 
> http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#installing-the-packages
>
> sage

Production ready, right?

:)

-- 
Diego Woitasen
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Which SSD method is better for performance?

2012-02-20 Thread Paul Pettigrew
G'day Greg, thanks for the fast response.

Yes, I forgot to explicitly state the Journal would go to SATA Journals in 
CASE1, and it is easy to appreciate the performance impact of this case as you 
documented nicely in your response.

Re your second point: 
> The other big advantage an SSD provides is in write latency; if you're 
> journaling on an SSD you can send things to disk and get a commit back 
> without having to wait on rotating media. How big an impact that will make 
> will depend on your other config options and use case, though.

Are you able to detail which config options tune this, and an example use case 
to illustrate?

Many thanks

Paul


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 21 February 2012 10:50 AM
To: Paul Pettigrew
Cc: Sage Weil; Wido den Hollander; ceph-devel@vger.kernel.org
Subject: Re: Which SSD method is better for performance?

On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew  
wrote:
> Thanks Sage
>
> So following through by two examples, to confirm my understanding
>
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s 
> each 1x SSD able to do sustained read/write speed of 475MB/s
>
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations Sustained write sent to 
> Ceph of very large file say 500GB (therefore caches all used up and 
> bottleneck becomes SATA IO speed) Gives 8x 138MB/s = 1,104 MB/s
>
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD Sustained 
> write (with OSD-Journal to SSD) sent to Ceph of very large file (say 
> 500GB) Write spilt across 8x OSD-Journal partitions on the single SSD 
> = limited to aggregate of 475MB/s
>
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
>
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Well, you seem to be leaving out the journals entirely in the first case. You 
could put them on a separate partition on the SATA disks if you wanted, which 
(on a modern drive) would net you half the single-stream throughput, or 
~552MB/s aggregate.

The other big advantage an SSD provides is in write latency; if you're 
journaling on an SSD you can send things to disk and get a commit back without 
having to wait on rotating media. How big an impact that will make will depend 
on your other config options and use case, though.
-Greg

>
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
>
> Paul
>
>
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; ceph-devel@vger.kernel.org
> Subject: RE: Which SSD method is better for performance?
>
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
>> And secondly, should the SSD Journal sizes be large or small?  Ie, is 
>> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
>> possible? There are many forum posts that say 100-200MB will suffice.
>> A quick piece of advice will save us hopefully sever days of 
>> reconfiguring and benchmarking the Cluster :-)
>
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
>
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> 
>
> sage
>
>
>
>>
>> Thanks
>>
>> Paul
>>
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den 
>> Hollander
>> Sent: Tuesday, 14 February 2012 10:46 PM
>> To: Paul Pettigrew
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Which SSD method is better for performance?
>>
>> Hi,
>>
>> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
>> > G'day all
>> >
>> > About to commence an R&D eval of the Ceph platform having been impressed 
>> > with the momentum achieved over the past 12mths.
>> >
>> > I have one question re design before rolling out to metal
>> >
>> > I will be using 1x SS

RE: Which SSD method is better for performance?

2012-02-20 Thread Sage Weil
On Tue, 21 Feb 2012, Paul Pettigrew wrote:
> Thanks Sage
> 
> So following through by two examples, to confirm my understanding
> 
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s each
> 1x SSD able to do sustained read/write speed of 475MB/s
> 
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations
> Sustained write sent to Ceph of very large file say 500GB (therefore caches 
> all used up and bottleneck becomes SATA IO speed) 
> Gives 8x 138MB/s = 1,104 MB/s
> 
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD
> Sustained write (with OSD-Journal to SSD) sent to Ceph of very large file 
> (say 500GB)
> Write spilt across 8x OSD-Journal partitions on the single SSD = limited to 
> aggregate of 475MB/s
> 
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
> 
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Modulo the missing journals in case 1, I think so.  For most people, 
though, it is pretty rare to try to saturate every disk... there is 
usually some small write and/or read activity going on, and maxing out the 
SSD isn't a problem.  It sounds like you have bonded 10gige interfaces to 
drive this?

It may be possible for ceph-osd to skip the journal when it isn't able to 
keep up the with file system.  That will give you crummy latency (since 
writes won't commit until the fs does a sync/commit), but the latency is 
already bad if the journal is behind.  We already do something similar if 
the journal fills up.  (This would only work with btrfs; for other file 
systems we also need the journal to preserve transaction atomicity.)

sage


> 
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
> 
> Paul
> 
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; ceph-devel@vger.kernel.org
> Subject: RE: Which SSD method is better for performance?
> 
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
> > And secondly, should the SSD Journal sizes be large or small?  Ie, is 
> > say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
> > possible? There are many forum posts that say 100-200MB will suffice.
> > A quick piece of advice will save us hopefully sever days of 
> > reconfiguring and benchmarking the Cluster :-)
> 
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
> 
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> 
> 
> sage
> 
> 
> 
> > 
> > Thanks
> > 
> > Paul
> > 
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den 
> > Hollander
> > Sent: Tuesday, 14 February 2012 10:46 PM
> > To: Paul Pettigrew
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: Which SSD method is better for performance?
> > 
> > Hi,
> > 
> > On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
> > > G'day all
> > >
> > > About to commence an R&D eval of the Ceph platform having been impressed 
> > > with the momentum achieved over the past 12mths.
> > >
> > > I have one question re design before rolling out to metal
> > >
> > > I will be using 1x SSD drive per storage server node (assume it is 
> > > /dev/sdb for this discussion), and cannot readily determine the pro/con's 
> > > for the two methods of using it for OSD-Journal, being:
> > > #1. place it in the main [osd] stanza and reference the whole drive 
> > > as a single partition; or
> > 
> > That won't work. If you do that all OSD's will try to open the journal. 
> > The journal for each OSD has to be unique.
> > 
> > > #2. partition up the disk, so 1x partition per SATA HDD, and place 
> > > each partition in the [osd.N] portion
> > 
> > That would be your best option.
> > 
> > I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
> > 
> > the VG "data" is placed on a SSD (Intel X25-M).
> > 
> > >
> > > So 

Re: Which SSD method is better for performance?

2012-02-20 Thread Gregory Farnum
On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew
 wrote:
> Thanks Sage
>
> So following through by two examples, to confirm my understanding
>
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s each
> 1x SSD able to do sustained read/write speed of 475MB/s
>
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations
> Sustained write sent to Ceph of very large file say 500GB (therefore caches 
> all used up and bottleneck becomes SATA IO speed)
> Gives 8x 138MB/s = 1,104 MB/s
>
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD
> Sustained write (with OSD-Journal to SSD) sent to Ceph of very large file 
> (say 500GB)
> Write spilt across 8x OSD-Journal partitions on the single SSD = limited to 
> aggregate of 475MB/s
>
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
>
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Well, you seem to be leaving out the journals entirely in the first
case. You could put them on a separate partition on the SATA disks if
you wanted, which (on a modern drive) would net you half the
single-stream throughput, or ~552MB/s aggregate.

The other big advantage an SSD provides is in write latency; if you're
journaling on an SSD you can send things to disk and get a commit back
without having to wait on rotating media. How big an impact that will
make will depend on your other config options and use case, though.
-Greg

>
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
>
> Paul
>
>
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; ceph-devel@vger.kernel.org
> Subject: RE: Which SSD method is better for performance?
>
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
>> And secondly, should the SSD Journal sizes be large or small?  Ie, is
>> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as
>> possible? There are many forum posts that say 100-200MB will suffice.
>> A quick piece of advice will save us hopefully sever days of
>> reconfiguring and benchmarking the Cluster :-)
>
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
>
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> 
>
> sage
>
>
>
>>
>> Thanks
>>
>> Paul
>>
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den
>> Hollander
>> Sent: Tuesday, 14 February 2012 10:46 PM
>> To: Paul Pettigrew
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Which SSD method is better for performance?
>>
>> Hi,
>>
>> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
>> > G'day all
>> >
>> > About to commence an R&D eval of the Ceph platform having been impressed 
>> > with the momentum achieved over the past 12mths.
>> >
>> > I have one question re design before rolling out to metal
>> >
>> > I will be using 1x SSD drive per storage server node (assume it is 
>> > /dev/sdb for this discussion), and cannot readily determine the pro/con's 
>> > for the two methods of using it for OSD-Journal, being:
>> > #1. place it in the main [osd] stanza and reference the whole drive
>> > as a single partition; or
>>
>> That won't work. If you do that all OSD's will try to open the journal.
>> The journal for each OSD has to be unique.
>>
>> > #2. partition up the disk, so 1x partition per SATA HDD, and place
>> > each partition in the [osd.N] portion
>>
>> That would be your best option.
>>
>> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
>>
>> the VG "data" is placed on a SSD (Intel X25-M).
>>
>> >
>> > So if I were to code #1 in the ceph.conf file, it would be:
>> > [osd]
>> > osd journal = /dev/sdb
>> >
>> > Or, #2 would be like:
>> > [osd.0]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdc
>> >          osd journal = /dev/sdb5
>> > [osd.1]
>> >          host = ceph1
>> >    

RE: Which SSD method is better for performance?

2012-02-20 Thread Paul Pettigrew
Thanks Sage

So following through by two examples, to confirm my understanding

HDD SPECS:
8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s each
1x SSD able to do sustained read/write speed of 475MB/s

CASE1
(not using SSD)
8x OSD's each for the SATA HDD's
Therefore able to parallelise IO operations
Sustained write sent to Ceph of very large file say 500GB (therefore caches all 
used up and bottleneck becomes SATA IO speed) 
Gives 8x 138MB/s = 1,104 MB/s

CASE 2
(using 1x SSD)
SSD partitioned into 8x separate partitions, 1x for each OSD
Sustained write (with OSD-Journal to SSD) sent to Ceph of very large file (say 
500GB)
Write spilt across 8x OSD-Journal partitions on the single SSD = limited to 
aggregate of 475MB/s

ANALYSIS:
If my examples are how Ceph operates, then it is necessary to not exceed a 
ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
bottleneck.

Is this analysis accurate? Are there other benefits that SSD provide (including 
in non-sustained peak write performance use case) that would otherwise justify 
their usage? What ratios are other users sticking to when deciding for their 
design?

Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page I 
will be offering to Sage to include in the main Ceph wiki site.

Paul



-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, 20 February 2012 1:16 PM
To: Paul Pettigrew
Cc: Wido den Hollander; ceph-devel@vger.kernel.org
Subject: RE: Which SSD method is better for performance?

On Mon, 20 Feb 2012, Paul Pettigrew wrote:
> And secondly, should the SSD Journal sizes be large or small?  Ie, is 
> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
> possible? There are many forum posts that say 100-200MB will suffice.
> A quick piece of advice will save us hopefully sever days of 
> reconfiguring and benchmarking the Cluster :-)

ceph-osd will periodically do a 'commit' to ensure that stuff in the journal is 
written safely to the file system.  On btrfs that's a snapshot, on anything 
else it's a sync(2).  When the journals hits 50% we trigger a commit, or when a 
timer expires (I think 30 seconds by default).  There is some overhead 
associated with the sync/snapshot, so less is generally better.

A decent rule of thumb is probably to make the journal big enough to consume 
sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
If the journal is on the same spindle as the fs, it'll be probably half that...


sage



> 
> Thanks
> 
> Paul
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den 
> Hollander
> Sent: Tuesday, 14 February 2012 10:46 PM
> To: Paul Pettigrew
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Which SSD method is better for performance?
> 
> Hi,
> 
> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
> > G'day all
> >
> > About to commence an R&D eval of the Ceph platform having been impressed 
> > with the momentum achieved over the past 12mths.
> >
> > I have one question re design before rolling out to metal
> >
> > I will be using 1x SSD drive per storage server node (assume it is /dev/sdb 
> > for this discussion), and cannot readily determine the pro/con's for the 
> > two methods of using it for OSD-Journal, being:
> > #1. place it in the main [osd] stanza and reference the whole drive 
> > as a single partition; or
> 
> That won't work. If you do that all OSD's will try to open the journal. 
> The journal for each OSD has to be unique.
> 
> > #2. partition up the disk, so 1x partition per SATA HDD, and place 
> > each partition in the [osd.N] portion
> 
> That would be your best option.
> 
> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
> 
> the VG "data" is placed on a SSD (Intel X25-M).
> 
> >
> > So if I were to code #1 in the ceph.conf file, it would be:
> > [osd]
> > osd journal = /dev/sdb
> >
> > Or, #2 would be like:
> > [osd.0]
> >  host = ceph1
> >  btrfs devs = /dev/sdc
> >  osd journal = /dev/sdb5
> > [osd.1]
> >  host = ceph1
> >  btrfs devs = /dev/sdd
> >  osd journal = /dev/sdb6
> > [osd.2]
> >  host = ceph1
> >  btrfs devs = /dev/sde
> >  osd journal = /dev/sdb7
> > [osd.3]
> >  host = ceph1
> >  btrfs devs = /dev/sdf
> >  osd journal = /dev/sdb8
> >
> > I am asking therefore, is the added work (and constraints) of specifying 
> > down to individual partitions per #2 worth it in performance gains? Does it 
> > not also have a constraint, in that if I wanted to add more HDD's into the 
> > server (we buy 45 bay units, and typically provision HDD's "on demand" i.e. 
> > 15x at a time as usage grows), I would have to additionally partition the 
> > SSD (taking it offline) - but if it were #1 option, I woul

Re: [PATCH 04/11] ceph: Push file_update_time() into ceph_page_mkwrite()

2012-02-20 Thread Sage Weil
On Mon, 20 Feb 2012, Jan Kara wrote:
> On Thu 16-02-12 11:13:53, Sage Weil wrote:
> > On Thu, 16 Feb 2012, Alex Elder wrote:
> > > On Thu, 2012-02-16 at 14:46 +0100, Jan Kara wrote:
> > > > CC: Sage Weil 
> > > > CC: ceph-devel@vger.kernel.org
> > > > Signed-off-by: Jan Kara 
> > > 
> > > 
> > > This will update the timestamp even if a write
> > > fault fails, which is different from before.
> > > 
> > > Hard to avoid though.
> > > 
> > > Looks good to me.
> > 
> > Yeah.  Let's put something in the tracker to take a look later (I think we 
> > can do better), but this is okay for now.
> > 
> > Signed-off-by: Sage Weil 
>   Thanks! Just an administrative note - the tag above should rather be
> Acked-by or Reviewed-by. You'd use Signed-off-by only if you took the patch
> and merged it via your tree... So can I add Acked-by?

Oh, right.  Acked-by!

sage

> 
>   Honza
> 
> > > Signed-off-by: Alex Elder 
> > > 
> > > >  fs/ceph/addr.c |3 +++
> > > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index 173b1d2..12b139f 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1181,6 +1181,9 @@ static int ceph_page_mkwrite(struct 
> > > > vm_area_struct *vma, struct vm_fault *vmf)
> > > > loff_t size, len;
> > > > int ret;
> > > >  
> > > > +   /* Update time before taking page lock */
> > > > +   file_update_time(vma->vm_file);
> > > > +
> > > > size = i_size_read(inode);
> > > > if (off + PAGE_CACHE_SIZE <= size)
> > > > len = PAGE_CACHE_SIZE;
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> -- 
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] ceph: Push file_update_time() into ceph_page_mkwrite()

2012-02-20 Thread Jan Kara
On Thu 16-02-12 11:13:53, Sage Weil wrote:
> On Thu, 16 Feb 2012, Alex Elder wrote:
> > On Thu, 2012-02-16 at 14:46 +0100, Jan Kara wrote:
> > > CC: Sage Weil 
> > > CC: ceph-devel@vger.kernel.org
> > > Signed-off-by: Jan Kara 
> > 
> > 
> > This will update the timestamp even if a write
> > fault fails, which is different from before.
> > 
> > Hard to avoid though.
> > 
> > Looks good to me.
> 
> Yeah.  Let's put something in the tracker to take a look later (I think we 
> can do better), but this is okay for now.
> 
> Signed-off-by: Sage Weil 
  Thanks! Just an administrative note - the tag above should rather be
Acked-by or Reviewed-by. You'd use Signed-off-by only if you took the patch
and merged it via your tree... So can I add Acked-by?

Honza

> > Signed-off-by: Alex Elder 
> > 
> > >  fs/ceph/addr.c |3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 173b1d2..12b139f 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1181,6 +1181,9 @@ static int ceph_page_mkwrite(struct vm_area_struct 
> > > *vma, struct vm_fault *vmf)
> > >   loff_t size, len;
> > >   int ret;
> > >  
> > > + /* Update time before taking page lock */
> > > + file_update_time(vma->vm_file);
> > > +
> > >   size = i_size_read(inode);
> > >   if (off + PAGE_CACHE_SIZE <= size)
> > >   len = PAGE_CACHE_SIZE;
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Serious problem after increase pg_num in pool

2012-02-20 Thread Sage Weil
Ooh, the pg split functionality is currently broken, and we weren't 
planning on fixing it for a while longer.  I didn't realize it was still 
possible to trigger from the monitor.

I'm looking at how difficult it is to make it work (even inefficiently).  

How much data do you have in the cluster?

sage




On Mon, 20 Feb 2012, S?awomir Skowron wrote:

> and this in ceph -w
> 
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
> [20,51,64]
> 
> 2012/2/20 S?awomir Skowron :
> > After increase number pg_num from 8 to 100 in .rgw.buckets i have some
> > serious problems.
> >
> > pool name       category                 KB      objects       clones
> >   degraded      unfound           rd        rd KB           wr
> > wr KB
> > .intent-log     -                       4662           19            0
> >           0           0            0            0        26502
> > 26501
> > .log            -                          0            0            0
> >           0           0            0            0       913732
> > 913342
> > .rgw            -                          1           10            0
> >           0           0            1            0            9
> >    7
> > .rgw.buckets    -                   39582566        73707            0
> >        8061           0        86594            0       610896
> > 36050541
> > .rgw.control    -                          0            1            0
> >           0           0            0            0            0
> >    0
> > .users          -                          1            1            0
> >           0           0            0            0            1
> >    1
> > .users.uid      -                          1            2            0
> >           0           0            2            1            3
> >    3
> > data            -                          0            0        

Re: Serious problem after increase pg_num in pool

2012-02-20 Thread Sławomir Skowron
and this in ceph -w

2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
[20,51,64]

2012/2/20 Sławomir Skowron :
> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
> serious problems.
>
> pool name       category                 KB      objects       clones
>   degraded      unfound           rd        rd KB           wr
> wr KB
> .intent-log     -                       4662           19            0
>           0           0            0            0        26502
> 26501
> .log            -                          0            0            0
>           0           0            0            0       913732
> 913342
> .rgw            -                          1           10            0
>           0           0            1            0            9
>    7
> .rgw.buckets    -                   39582566        73707            0
>        8061           0        86594            0       610896
> 36050541
> .rgw.control    -                          0            1            0
>           0           0            0            0            0
>    0
> .users          -                          1            1            0
>           0           0            0            0            1
>    1
> .users.uid      -                          1            2            0
>           0           0            2            1            3
>    3
> data            -                          0            0            0
>           0           0            0            0            0
>    0
> metadata        -                          0            0            0
>           0           0            0            0            0
>    0
> rbd             -                   21590723         5328            0
>           1           0           77           75      3013595
> 378345507
>  total used       229514252        79068
>  total avail    19685615164
>  total space    20980898464
>
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:

Serious problem after increase pg_num in pool

2012-02-20 Thread Sławomir Skowron
After increase number pg_num from 8 to 100 in .rgw.buckets i have some
serious problems.

pool name   category KB  objects   clones
   degraded  unfound   rdrd KB   wr
wr KB
.intent-log -   4662   190
   0   00026502
26501
.log-  000
   0   000   913732
913342
.rgw-  1   100
   0   0109
7
.rgw.buckets-   39582566737070
8061   0865940   610896
36050541
.rgw.control-  010
   0   0000
0
.users  -  110
   0   0001
1
.users.uid  -  120
   0   0213
3
data-  000
   0   0000
0
metadata-  000
   0   0000
0
rbd -   21590723 53280
   1   0   77   75  3013595
378345507
  total used   22951425279068
  total avail19685615164
  total space20980898464

2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:

Re: Problem after ceph-osd crash

2012-02-20 Thread Sage Weil
On Mon, 20 Feb 2012, Oliver Francke wrote:
> Hi Sage,
> 
> On 02/20/2012 06:41 PM, Sage Weil wrote:
> > On Mon, 20 Feb 2012, Oliver Francke wrote:
> > > Hi,
> > > 
> > > we are just in trouble after some mess with trying to include a new
> > > OSD-node
> > > into our cluster.
> > > 
> > > We get some weird "libceph: corrupt inc osdmap epoch 880 off 102
> > > (c9001db8990a of c9001db898a4-c9001db89dae)"

I just retested the kernel client against the new server code and I don't 
see this.  If you can pull the osdmap/880 file from the monitor data 
directory (soon, please, the monitor will delete it once things fully 
recover and move on) I can see what the data looks like.

> > > 
> > > on the console.
> > > The whole system is in a state ala:
> > > 
> > > 012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43
> > > active+recovering+degraded+remapped+backfill, 218 active+recovering, 437
> > > active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB
> > > /
> > > 29794 GB avail; 272914/1349073 degraded (20.230%)
> > > 
> > > and sometimes the ceph-osd on node0 is crashing. At the moment of writing,
> > > the
> > > degrading continues to shrink down below 20%.
> > How did ceph-osd crash?  Is there a dump in the log?
> 
> 'course I will provide all logs, uhm, a bit later, we are busy to start all
> VM's, and handle first customer-tickets right now ;-)
>
> To be most complete for the collection, would you be so kind to give a 
> list of all necessary kern.log osdX.log etc.?

I think just the crashed osd log will be enough.  It looks like the rest 
of the cluster is recovering ok...

Are the VMs running on top of the kernel rbd client, or KVM+librbd?

sage


> 
> Thnx for the fast reaction,
> 
> Oliver.
> 
> > sage
> > 
> > > Any clues?
> > > 
> > > Thnx in @vance,
> > > 
> > > Oliver.
> > > 
> > > -- 
> > > 
> > > Oliver Francke
> > > 
> > > filoo GmbH
> > > Moltkestraße 25a
> > > 0 Gütersloh
> > > HRB4355 AG Gütersloh
> > > 
> > > Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
> > > 
> > > Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> 
> 
> -- 
> 
> Oliver Francke
> 
> filoo GmbH
> Moltkestraße 25a
> 0 Gütersloh
> HRB4355 AG Gütersloh
> 
> Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
> 
> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

v0.42 released

2012-02-20 Thread Sage Weil
v0.42 is ready!  This has mostly been a stabilization release, with a few 
critical bugs fixed.  There is also an across-the-board change in data 
structure encoding that is not backwards-compatible, but is designed to 
allow future changes to be (both forwards- and backwards-).

Notable changes include:

 * osd: new (non-backwards compatible!) encoding for all structures
 * osd: fixed bug with transactions being non-atomic (leaking across 
   commit boundaries)
 * osd: track in-progress requests, log slow ones
 * osd: randomly choose pull target during recovery (better load balance)
 * osd: fixed recovery stall
 * mon: a few recovery bug fixes
 * mon: trim old auth files
 * mon: better detection/warning about down pgs
 * objecter: expose in-process requests via admin socket
 * new infrastructure for testing data structure encoding changes (forward 
   and backward compatibility)

Aside from the data structure encoding change, there is relatively little 
new code since v0.41.  This should be a pretty solid release.

For v0.43, we are working on merging a few big changes.  The main one is a 
new key/value interface for objects: each object, instead of storing a 
blob of bytes, would consist of a (potentially large) set of key/value 
pairs that can be set/queried efficiently.  This is going to make a huge 
difference for radosgw performance with large buckets, and will help with 
large directories as well.  There is also ongoing stabilization work with 
the OSD and new interfaces for administrators to query the state of the 
cluster and diagnose common problems.

v0.42 can be found from the usual locations:

 * Git at git://github.com/NewDreamNetwork/ceph.git
 * Tarball at http://ceph.newdream.net/download/ceph-0.42.tar.gz
 * For Debian/Ubuntu packages, see 
http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#installing-the-packages

sage

Re: Problem after ceph-osd crash

2012-02-20 Thread Oliver Francke

Hi Sage,

On 02/20/2012 06:41 PM, Sage Weil wrote:

On Mon, 20 Feb 2012, Oliver Francke wrote:

Hi,

we are just in trouble after some mess with trying to include a new OSD-node
into our cluster.

We get some weird "libceph: corrupt inc osdmap epoch 880 off 102
(c9001db8990a of c9001db898a4-c9001db89dae)"

on the console.
The whole system is in a state ala:

012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43
active+recovering+degraded+remapped+backfill, 218 active+recovering, 437
active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB /
29794 GB avail; 272914/1349073 degraded (20.230%)

and sometimes the ceph-osd on node0 is crashing. At the moment of writing, the
degrading continues to shrink down below 20%.

How did ceph-osd crash?  Is there a dump in the log?


'course I will provide all logs, uhm, a bit later, we are busy to start 
all VM's, and handle first customer-tickets right now ;-)


To be most complete for the collection, would you be so kind to give a 
list of all necessary kern.log osdX.log etc.?


Thnx for the fast reaction,

Oliver.


sage


Any clues?

Thnx in @vance,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem after ceph-osd crash

2012-02-20 Thread Sage Weil
On Mon, 20 Feb 2012, Oliver Francke wrote:
> Hi,
> 
> we are just in trouble after some mess with trying to include a new OSD-node
> into our cluster.
> 
> We get some weird "libceph: corrupt inc osdmap epoch 880 off 102
> (c9001db8990a of c9001db898a4-c9001db89dae)"
> 
> on the console.
> The whole system is in a state ala:
> 
> 012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43
> active+recovering+degraded+remapped+backfill, 218 active+recovering, 437
> active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB /
> 29794 GB avail; 272914/1349073 degraded (20.230%)
> 
> and sometimes the ceph-osd on node0 is crashing. At the moment of writing, the
> degrading continues to shrink down below 20%.

How did ceph-osd crash?  Is there a dump in the log?

sage

> 
> Any clues?
> 
> Thnx in @vance,
> 
> Oliver.
> 
> -- 
> 
> Oliver Francke
> 
> filoo GmbH
> Moltkestraße 25a
> 0 Gütersloh
> HRB4355 AG Gütersloh
> 
> Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
> 
> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

Problem after ceph-osd crash

2012-02-20 Thread Oliver Francke

Hi,

we are just in trouble after some mess with trying to include a new 
OSD-node into our cluster.


We get some weird "libceph: corrupt inc osdmap epoch 880 off 102 
(c9001db8990a of c9001db898a4-c9001db89dae)"


on the console.
The whole system is in a state ala:

012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43 
active+recovering+degraded+remapped+backfill, 218 active+recovering, 437 
active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 
GB / 29794 GB avail; 272914/1349073 degraded (20.230%)


and sometimes the ceph-osd on node0 is crashing. At the moment of 
writing, the degrading continues to shrink down below 20%.


Any clues?

Thnx in @vance,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Which SSD method is better for performance?

2012-02-20 Thread Wido den Hollander

Hi,

On 02/20/2012 03:36 AM, Paul Pettigrew wrote:

G'day Wido

Great advice, thanks! We settled on 1x LVM partition on SSD for OSD-Journal.

A quick follow up if I may please?


"A last note, if you use a SSD for your journaling, make sure that you align your 
partitions which the page size of the SSD, otherwise you'd run into the write 
amplification of the SSD, resulting in a performance loss."

Do you have any technical doco on how to achieve this?  I am happy to value-add 
and write it up in a format that can go back into the wiki for others to follow.

And secondly, should the SSD Journal sizes be large or small?  Ie, is say 1G 
partition per paired 2-3TB SATA disk OK? Or as large an SSD as possible? There 
are many forum posts that say 100-200MB will suffice.  A quick piece of advice 
will save us hopefully sever days of reconfiguring and benchmarking the Cluster 
:-)



Like sage pointed out, a journal of something like 2 ~ 4GB should be 
sufficient in most cases.


If you search the web for partition alignment on SSD's you'll find 
multiple topics, like this one: 
http://www.ocztechnologyforum.com/forum/showthread.php?54379-Linux-Tips-tweaks-and-alignment&p=472998&viewfull=1#post472998


I ended up doing (with a Intel X25-M 80GB) (in parted):

unit s
mklabel gpt
mkpart primary 1024 137363455

That gave me one partition on which I placed an PV + VG.

You should however know that a 4k write to the SSD will result in 
re-programming a 256k page inside the SSD.


I'm not sure how OSD's do their journal writes (which size), because 
with ext4 you can do:


mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sdb1

That would align ext4 writes to 256k resulting in less page 
reprogramming inside the SSD.


I didn't do that thorough testing yet. But it could be that a lot of 
small writes could trigger a big write amplification inside the SSD 
because the OSD commits such small blocks.


Wido


Thanks

Paul


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den Hollander
Sent: Tuesday, 14 February 2012 10:46 PM
To: Paul Pettigrew
Cc: ceph-devel@vger.kernel.org
Subject: Re: Which SSD method is better for performance?

Hi,

On 02/14/2012 01:39 AM, Paul Pettigrew wrote:

G'day all

About to commence an R&D eval of the Ceph platform having been impressed with 
the momentum achieved over the past 12mths.

I have one question re design before rolling out to metal

I will be using 1x SSD drive per storage server node (assume it is /dev/sdb for 
this discussion), and cannot readily determine the pro/con's for the two 
methods of using it for OSD-Journal, being:
#1. place it in the main [osd] stanza and reference the whole drive as
a single partition; or


That won't work. If you do that all OSD's will try to open the journal.
The journal for each OSD has to be unique.


#2. partition up the disk, so 1x partition per SATA HDD, and place
each partition in the [osd.N] portion


That would be your best option.

I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf

the VG "data" is placed on a SSD (Intel X25-M).



So if I were to code #1 in the ceph.conf file, it would be:
[osd]
osd journal = /dev/sdb

Or, #2 would be like:
[osd.0]
  host = ceph1
  btrfs devs = /dev/sdc
  osd journal = /dev/sdb5
[osd.1]
  host = ceph1
  btrfs devs = /dev/sdd
  osd journal = /dev/sdb6
[osd.2]
  host = ceph1
  btrfs devs = /dev/sde
  osd journal = /dev/sdb7
[osd.3]
  host = ceph1
  btrfs devs = /dev/sdf
  osd journal = /dev/sdb8

I am asking therefore, is the added work (and constraints) of specifying down to 
individual partitions per #2 worth it in performance gains? Does it not also have a 
constraint, in that if I wanted to add more HDD's into the server (we buy 45 bay units, 
and typically provision HDD's "on demand" i.e. 15x at a time as usage grows), I 
would have to additionally partition the SSD (taking it offline) - but if it were #1 
option, I would only have to add more [osd.N] sections (and not have to worry about 
getting the SSD with 45x partitions)?



You'd still have to go for #2. However, running 45 OSD's on a single machine is 
a bit tricky imho.

If that machine fails you would loose 45 OSD's at once, that will put a lot of 
stress on the recovery of your cluster.

You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB of 
RAM I guess.

A last note, if you use a SSD for your journaling, make sure that you align 
your partitions which the page size of the SSD, otherwise you'd run into the 
write amplification of the SSD, resulting in a performance loss.

Wido


One final related question, if I were to use #1 method (which I would prefer if there is no material performance or 
other reason to use #2), then that specification (i.e. the "osd journal = /dev/sdb") SSD disk reference would 
have to be identical on all o

Re: [PATCH 04/11] ceph: Push file_update_time() into ceph_page_mkwrite()

2012-02-20 Thread Jan Kara
On Thu 16-02-12 13:04:37, Alex Elder wrote:
> On Thu, 2012-02-16 at 14:46 +0100, Jan Kara wrote:
> > CC: Sage Weil 
> > CC: ceph-devel@vger.kernel.org
> > Signed-off-by: Jan Kara 
> 
> 
> This will update the timestamp even if a write
> fault fails, which is different from before.
>
> Hard to avoid though.
  Yes. Relatively easy solution for this (at least for some filesystems)
is to include time update into other operations filesystem is doing.
Usually filesystem can do some preparations, then take page lock and then
update time stamps together with other things it wants to do. It usually
will be even faster than current scheme. But I decided to leave that to
fs maintainers because page_mkwrite() code tends to be tricky wrt locking
etc.


> Looks good to me.
> 
> Signed-off-by: Alex Elder 
  Thanks.

Honza

> >  fs/ceph/addr.c |3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 173b1d2..12b139f 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1181,6 +1181,9 @@ static int ceph_page_mkwrite(struct vm_area_struct 
> > *vma, struct vm_fault *vmf)
> > loff_t size, len;
> > int ret;
> >  
> > +   /* Update time before taking page lock */
> > +   file_update_time(vma->vm_file);
> > +
> > size = i_size_read(inode);
> > if (off + PAGE_CACHE_SIZE <= size)
> > len = PAGE_CACHE_SIZE;
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html