Re: monitor start up question

2012-07-25 Thread Sage Weil
On Wed, 25 Jul 2012, Mandell Degerness wrote:
> When a cluster has been shut down and then re-started, how do the
> monitors know what the cluster fsid is?  Is it stored somewhere?

It's embedded in the monmap, currently found at $mon_data/monmap/.  Not terribly convenient, sorry!

> I would like to be able to verify, before starting a monitor on a
> given server, if an existing monitor directory belongs to the current
> cluster or to a previous cluster incarnation.  With the OSDs, I can
> just check the cluster_fsid file.

We can add a similar file in the $mon_data directory in future version.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


monitor start up question

2012-07-25 Thread Mandell Degerness
When a cluster has been shut down and then re-started, how do the
monitors know what the cluster fsid is?  Is it stored somewhere?

I would like to be able to verify, before starting a monitor on a
given server, if an existing monitor directory belongs to the current
cluster or to a previous cluster incarnation.  With the OSDs, I can
just check the cluster_fsid file.

Regards,
Mandell Degerness
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Mehdi Abaakouk
Hi Florian,

On Wed, Jul 25, 2012 at 10:06:04PM +0200, Florian Haas wrote:
> Hi Mehdi,
> For the OSD tests, which OSD filesystem are you testing on? Are you   
>   
> 
> using a separate journal device? If yes, what type? 

Actually, I use xfs and the journal is on a same disk in an other partition.
After reading documentation, I seems that using a dedicated disk is 
better and SSD is a good choice.

> seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd
> if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0
> 
> Just making sure: are you getting the same numbers just with dd,
> rather than dd invoked by seekwatcher?

yes

> 
> Also, for your dd latency test of 4M direct I/O reads writes, you seem
> to be getting 39 and 300 ms average latency, yet further down it says
> "RBD latency read/write: 28ms and 114.5ms". Any explanation for the
> write latency being cut in half on what was apparently a different
> test run?

Yes this is a different run, the one on the bottom was with less servers
but with better hardware.

> 
> Also, were read and write caches cleared between tests? (echo 3 >
> /proc/sys/vm/drop_caches)

No, I will add it 

> Cheers,
> Florian

I known that my setup is not really optimal,
Writing these tests help me to understand how ceph work and
I'm sure with your advice I will build a better cluster :)

Thanks for your help.

Cheers,
-- 
Mehdi Abaakouk
mail: sil...@sileht.net
irc: sileht


signature.asc
Description: Digital signature


Re: Ceph Benchmark HowTo

2012-07-25 Thread Tommi Virtanen
On Wed, Jul 25, 2012 at 1:25 PM, Gregory Farnum  wrote:
> Yeah, an average isn't necessarily very useful here — it's what you
> get because that's easy to implement (with a sum and a counter
> variable, instead of binning). The inclusion of max and min latencies
> is an attempt to cheaply compensate for that...but if somebody wants
> to find/write an appropriately-licensed statistical counting library
> and integrate it with rados bench, then (say it with me) contributions
> are welcome! ;)

How about "output results in a good machine-readable format and here's
the pandas script to crunch it to a useful summary".

http://pandas.pydata.org/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Gregory Farnum
On Wed, Jul 25, 2012 at 1:06 PM, Florian Haas  wrote:
> Hi Mehdi,
>
> great work! A few questions (for you, Mark, and anyone else watching
> this thread) regarding the content of that wiki page:
>
> For the OSD tests, which OSD filesystem are you testing on? Are you
> using a separate journal device? If yes, what type?
>
> For the RADOS benchmarks:
>
> # rados bench -p pbench 900 seq
> ...
>611  16 17010 16994   111.241   104   1.05852  0.574897
>612  16 17037 17021   111.236   108   1.17321  0.574932
>613  16 17056 17040   111.17876   1.01611  0.574903
>  Total time run:613.339616
> Total reads made: 17056
> Read size:4194304
> Bandwidth (MB/sec):111.234
>
> Average Latency:   0.575252
> Max latency:   1.65182
> Min latency:   0.07418
>
> How meaningful is it to use a (arithmetic) average here, consisting
> the min and max differ by a factor of 22? Aren't we being bitten by
> outliers pretty severely here, and wouldn't, say, a median be much
> more useful? (Actually, would the "max latency" include the initial
> hunt for a mon and the mon/osdmap exchange?)

Yeah, an average isn't necessarily very useful here — it's what you
get because that's easy to implement (with a sum and a counter
variable, instead of binning). The inclusion of max and min latencies
is an attempt to cheaply compensate for that...but if somebody wants
to find/write an appropriately-licensed statistical counting library
and integrate it with rados bench, then (say it with me) contributions
are welcome! ;)


> seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd
> if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0
>
> Just making sure: are you getting the same numbers just with dd,
> rather than dd invoked by seekwatcher?
>
> Also, for your dd latency test of 4M direct I/O reads writes, you seem
> to be getting 39 and 300 ms average latency, yet further down it says
> "RBD latency read/write: 28ms and 114.5ms". Any explanation for the
> write latency being cut in half on what was apparently a different
> test run?
>
> Also, were read and write caches cleared between tests? (echo 3 >
> /proc/sys/vm/drop_caches)
>
> Cheers,
> Florian
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Florian Haas
Hi Mehdi,

great work! A few questions (for you, Mark, and anyone else watching
this thread) regarding the content of that wiki page:

For the OSD tests, which OSD filesystem are you testing on? Are you
using a separate journal device? If yes, what type?

For the RADOS benchmarks:

# rados bench -p pbench 900 seq
...
   611  16 17010 16994   111.241   104   1.05852  0.574897
   612  16 17037 17021   111.236   108   1.17321  0.574932
   613  16 17056 17040   111.17876   1.01611  0.574903
 Total time run:613.339616
Total reads made: 17056
Read size:4194304
Bandwidth (MB/sec):111.234

Average Latency:   0.575252
Max latency:   1.65182
Min latency:   0.07418

How meaningful is it to use a (arithmetic) average here, consisting
the min and max differ by a factor of 22? Aren't we being bitten by
outliers pretty severely here, and wouldn't, say, a median be much
more useful? (Actually, would the "max latency" include the initial
hunt for a mon and the mon/osdmap exchange?)



seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd
if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0

Just making sure: are you getting the same numbers just with dd,
rather than dd invoked by seekwatcher?

Also, for your dd latency test of 4M direct I/O reads writes, you seem
to be getting 39 and 300 ms average latency, yet further down it says
"RBD latency read/write: 28ms and 114.5ms". Any explanation for the
write latency being cut in half on what was apparently a different
test run?

Also, were read and write caches cleared between tests? (echo 3 >
/proc/sys/vm/drop_caches)

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Florian Haas
On Tue, Jul 24, 2012 at 6:19 PM, Tommi Virtanen  wrote:
> On Tue, Jul 24, 2012 at 8:55 AM, Mark Nelson  wrote:
>> personally I think it's fine to have it on the wiki.  I do want to stress
>> that performance is going to be (hopefully!) improving over the next couple
>> of months so we will probably want to have updated results (or at least
>> remove old results!) as things improve.  Also, I'm not sure if we will be
>> keeping the wiki around in it's current form. There was some talk about
>> migrating to something else, but I don't really remember the details.
>
> Sounds like a job for doc/dev/benchmark/index.rst!  (It, or parts of
> it, can move out from under "Internal" if/when it gets user friendly
> enough to not need as much skill to use.)

If John is currently busy (which I assume he always is :) ), I should
be able to take care of that. In that case, would someone please open
a documentation bug and assign that to me?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Mehdi Abaakouk
On Tue, Jul 24, 2012 at 10:55:37AM -0500, Mark Nelson wrote:
> On 07/24/2012 09:43 AM, Mehdi Abaakouk wrote:
> 
> Thanks for taking the time to put all of your benchmarking
> procedures into writing!  Having this kind of community
>
> ...
>

Thanks, for yours comments and these tools, that will help me for sure.


-- 
Mehdi Abaakouk
mail: sil...@sileht.net
irc: sileht


signature.asc
Description: Digital signature


Re: Clusters and pools

2012-07-25 Thread Wido den Hollander



On 07/25/2012 06:34 AM, Ryan Nicholson wrote:

I'm running a cluster based on 4 hosts that each have 3 fast, SCSI osd's, and 1 
very large SATA osd, meaning, 12 fast osd's and 4 slow osd's total. I wish to 
segregate these into 2 pools, that operate independently. The goal is to use 
the faster disks as an area to hold rbd based VM's, and the larger area to host 
rbd-base large volumes (to start), and possibly have that become just a big 
cephfs area once, the fs side of things is considered more stable.

Now, I've been thrown a couple options, and am still unsettled. Which is best?:
- Create 2 independent clusters; one with the 12 SCSI osd's and the other with 
just the 4 large OSD's on the same hosts. This seems to be more complex from a 
scripting and boot time standpoint, but easier for my head.
- Create a single cluster and use CRUSH rules to separate the two. This one STILL has me lost, as 
I'm having trouble understanding the crushmap syntax, the Crushmap import/export commands, and the 
other mkpool or otherwise commands from the docs in order to "make rbd's come from this faster 
pool", while "cephfs, you come from this slower pool". I really would like to 
entertain this path, however, as this allows ceph to handle the entire situation, and, it would 
seem more elegant.

I'm also open to other options as well.


The "easiest" way to approach this:

Set up the cluster with the 12 fast OSD's first en leave the other 4 out 
of the configuration.


Get everything up and running and play with it.

Then, add the 4 remaining OSD's to the cluster:
1. Add them to ceph.conf
2. Increment max_osd
3. Add them to the keyring
4. Format the OSD's
5. Start the OSD's

Now they should show up in your "ceph -s" output, but no data will go to 
them.


The next step is to export your current crushmap:

$ ceph osd getcrushmap -o crushmap
$ crushtool -d crushmap -o crushmap.txt

You should now add 4 new hosts to the crushmap, something like 
"hostA-slow" and add one OSD under each of them.


Now you can add a new rack called "slowrbd" for example, add a new pool 
and a new rule afterwards.


Compile crushmap.txt back again to "crushmap" and load it into the cluster.

You can now create a new pool with a specific crushrule.

All the data in that pool will go onto those 4 slower OSD's.

Wido



Thanks!

Ryan Nicholson

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EU Ceph.com mirror for Debian/Ubuntu packages

2012-07-25 Thread Wido den Hollander

Hi,

On a couple of systems I'm using the Debian packages provided on 
Ceph.com, but these packages are hosted on a CA based server.


In the EU that's rather slow, especially when updating multiple servers 
and when downloading the debug packages.


As I'm lazy I don't want to maintain my own mirror with my own build 
packages, I rather use the ones build by Ceph.com


Could we set up a EU mirror like eu.ceph.com or nl.ceph.com?

deb http://eu.ceph.com/debian/ precise main

That could update from the main mirror with rsync every hour.

I could offer some space if needed?

Thanks,

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html