Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread JM
Hi Roland,

You should tune your Ceph Crushmap with a custom rule in order to do that
(write first on s3 and then to others). This custom rule will be applied
then to your proxmox pool.
(what you want to do is only interesting if you run VM from host s3)

Can you give us your crushmap ?



2015-01-13 22:03 GMT+01:00 Roland Giesler rol...@giesler.za.net:

 I have a 4 node ceph cluster, but the disks are not equally distributed
 across all machines (they are substantially different from each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
 and two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
 mostly from there, but I want to make sure that the writes that happen to
 the ceph cluster get written to the local osd's on s3 first and then the
 additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

 regards


 *Roland *

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Roland Giesler
On 14 January 2015 at 12:08, JM jmaxi...@gmail.com wrote:

 Hi Roland,

 You should tune your Ceph Crushmap with a custom rule in order to do that
 (write first on s3 and then to others). This custom rule will be applied
 then to your proxmox pool.
 (what you want to do is only interesting if you run VM from host s3)

 Can you give us your crushmap ?


Please note that I made a mistake in my email.  The machine that I want to
run on write first, is S1 not S3

For the life of me I cannot find how to extract the crush map.  I found:

ceph osd getcrushmap -o crushfilename

Where can I find the crush file?  I've never needed this.
This is my first installation, so please bear with my while I learn!

Lionel: I read what you're saying.  However, the strange thing is that last
year I had this Windows 2008 VM running on the same cluster without changes
and coming back from leave in the new year, it has crawled to a painfully
slow state.  And I don't quite know where to start to trace this.  The
windows machine is not the problem, since even before windows starts up the
boot process of the VM is very slow.

thanks

Roland







 2015-01-13 22:03 GMT+01:00 Roland Giesler rol...@giesler.za.net:

 I have a 4 node ceph cluster, but the disks are not equally distributed
 across all machines (they are substantially different from each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
 and two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
 mostly from there, but I want to make sure that the writes that happen to
 the ceph cluster get written to the local osd's on s3 first and then the
 additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

 regards


 *Roland *

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread JM
# Get the compiled crushmap
root@server01:~# ceph osd getcrushmap -o /tmp/myfirstcrushmap

# Decompile the compiled crushmap above
root@server01:~# crushtool -d /tmp/myfirstcrushmap -o
/tmp/myfirstcrushmap.txt

then give us your /tmp/myfirstcrushmap.txt file.. :)


2015-01-14 17:36 GMT+01:00 Roland Giesler rol...@giesler.za.net:

 On 14 January 2015 at 12:08, JM jmaxi...@gmail.com wrote:

 Hi Roland,

 You should tune your Ceph Crushmap with a custom rule in order to do that
 (write first on s3 and then to others). This custom rule will be applied
 then to your proxmox pool.
 (what you want to do is only interesting if you run VM from host s3)

 Can you give us your crushmap ?


 Please note that I made a mistake in my email.  The machine that I want to
 run on write first, is S1 not S3

 For the life of me I cannot find how to extract the crush map.  I found:

 ceph osd getcrushmap -o crushfilename

 Where can I find the crush file?  I've never needed this.
 This is my first installation, so please bear with my while I learn!

 Lionel: I read what you're saying.  However, the strange thing is that
 last year I had this Windows 2008 VM running on the same cluster without
 changes and coming back from leave in the new year, it has crawled to a
 painfully slow state.  And I don't quite know where to start to trace
 this.  The windows machine is not the problem, since even before windows
 starts up the boot process of the VM is very slow.

 thanks

 Roland







 2015-01-13 22:03 GMT+01:00 Roland Giesler rol...@giesler.za.net:

 I have a 4 node ceph cluster, but the disks are not equally distributed
 across all machines (they are substantially different from each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
 and two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
 mostly from there, but I want to make sure that the writes that happen to
 the ceph cluster get written to the local osd's on s3 first and then the
 additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

 regards


 *Roland *

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Roland Giesler
So you can see my server names and their osd's too...

# idweighttype nameup/downreweight
-111.13root default
-28.14host h1
 1 0.9osd.1 up1
 3 0.9osd.3 up1
 4 0.9osd.4 up1
 50.68osd.5 up1
 60.68osd.6 up1
 70.68osd.7 up1
 80.68osd.8 up1
 90.68osd.9 up1
100.68osd.10up1
110.68osd.11up1
120.68osd.12up1
-30.45host s3
 20.45osd.2 up1
-4 0.9host s2
13 0.9osd.13up1
-51.64host s1
140.29osd.14up1
 00.27osd.0down   0
150.27osd.15up1
160.27osd.16up1
170.27osd.17up1
180.27osd.18up1

regards

Roland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Roland Giesler
On 14 January 2015 at 21:46, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler rol...@giesler.za.net
 wrote:
  I have a 4 node ceph cluster, but the disks are not equally distributed
  across all machines (they are substantially different from each other)
 
  One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
 and
  two machines have only two 1 TB drives each (s2  s1).
 
  Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
  mostly from there, but I want to make sure that the writes that happen to
  the ceph cluster get written to the local osd's on s3 first and then
 the
  additional writes/copies get done to the network.
 
  Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
  relevant.

 In general you can't set up Ceph to write to the local node first. In
 some specific cases you can if you're willing to do a lot more work
 around placement, and this *might* be one of those cases.

 To do this, you'd need to change the CRUSH rules pretty extensively,
 so that instead of selecting OSDs at random, they have two steps:
 1) starting from bucket s3, select a random OSD and put it at the
 front of the OSD list for the PG.
 2) Starting from a bucket which contains all the other OSDs, select
 N-1 more at random (where N is the number of desired replicas).


I understand in principle what you're saying.  Let me go back a step and
ask the question somewhat differently then:

I have set up 4 machines in a cluster.  When I created the Windows 2008
server VM on S1 (I corrected my first email: I have three Sunfire X series
servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it
was running normally, pretty close to what I had on the bare metal.  About
a month later (after being on leave to 2 weeks), I found a machine that is
crawling at a snail pace and I cannot figure out why.

So instead of suggesting something from my side (without in-depth knowledge
yet), what should I do to get this machine to run at speed again?

Further to my hardware and network:

S1: 2 x Quad Code Xeon, 36GB RAM, 8 x 300GB HDD's
S2: 1 x Opteron Dual Core, 8GB RAM, 2 x 750GB HDD's
S3: 1 x Opetron Dual Core, 8GB RAM, 2 x 750GB HDD's
H1: 1 x Xeon Dual Core, 5GB RAM, 12 x 1TB HDD's
(All these machines are at full drive capacity, that is all their slots are
being utilised)

All the servers are linked with Dual Gigabit Ethernet connections to a
switch with LACP enabled and the links are bonded on each server.  While
this doesn't raise the total transfer speed, it does allow more bandwidth
between the servers.

The H1 machine is only running ceph and thus acts only as storage.  The
other machines (S1, S2  S3) are for web servers (development and
production), the Windows 2008 server and a few other functions all managed
from proxmox.

The hardware is what my client has been using, but there were lots of
inefficiencies and little redundancy in the setup before we embarked on
this project.  However, the hardware is sufficient for their needs.

I hope that gives you a reasonable picture of the setup so be able to give
me some advice on how to troubleshoot this.

regards

Roland




 You can look at the documentation on CRUSH or search the list archives
 for more on this subject.

 Note that doing this has a bunch of down sides: you'll have balance
 issues because every piece of data will be on the s3 node (that's a
 TERRIBLE name for a project which has API support for Amazon S3, btw
 :p), if you add new VMs on a different node they'll all be going to
 the s3 node for all their writes (unless you set them up on a
 different pool with different CRUSH rules), s3 will be satisfying all
 the read requests so the other nodes are just backups in case of disk
 failure, etc.
 -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Gregory Farnum
On Fri, Jan 16, 2015 at 2:52 AM, Roland Giesler rol...@giesler.za.net wrote:
 On 14 January 2015 at 21:46, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler rol...@giesler.za.net
 wrote:
  I have a 4 node ceph cluster, but the disks are not equally distributed
  across all machines (they are substantially different from each other)
 
  One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
  and
  two machines have only two 1 TB drives each (s2  s1).
 
  Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
  mostly from there, but I want to make sure that the writes that happen
  to
  the ceph cluster get written to the local osd's on s3 first and then
  the
  additional writes/copies get done to the network.
 
  Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
  relevant.

 In general you can't set up Ceph to write to the local node first. In
 some specific cases you can if you're willing to do a lot more work
 around placement, and this *might* be one of those cases.

 To do this, you'd need to change the CRUSH rules pretty extensively,
 so that instead of selecting OSDs at random, they have two steps:
 1) starting from bucket s3, select a random OSD and put it at the
 front of the OSD list for the PG.
 2) Starting from a bucket which contains all the other OSDs, select
 N-1 more at random (where N is the number of desired replicas).


 I understand in principle what you're saying.  Let me go back a step and ask
 the question somewhat differently then:

 I have set up 4 machines in a cluster.  When I created the Windows 2008
 server VM on S1 (I corrected my first email: I have three Sunfire X series
 servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it
 was running normally, pretty close to what I had on the bare metal.  About a
 month later (after being on leave to 2 weeks), I found a machine that is
 crawling at a snail pace and I cannot figure out why.

You mean one of the VMs has very slow disk access? Or one of the hosts
is very slow?

In any case, you'd need to look at what about that system is different
from the others and poke at that difference until it exposes an issue,
I suppose.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Roland Giesler
On 16 January 2015 at 17:15, Gregory Farnum g...@gregs42.com wrote:

  I have set up 4 machines in a cluster.  When I created the Windows 2008
  server VM on S1 (I corrected my first email: I have three Sunfire X
 series
  servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it
  was running normally, pretty close to what I had on the bare metal.
 About a
  month later (after being on leave to 2 weeks), I found a machine that is
  crawling at a snail pace and I cannot figure out why.

 You mean one of the VMs has very slow disk access? Or one of the hosts
 is very slow?


The Windows 2008 VM is very slow.  Inside Windows all seems normal, the
CPU's are never more 20% used and when navigating even the menus take a
long time to respond.  The Host (S1) is not slow.


 In any case, you'd need to look at what about that system is different
 from the others and poke at that difference until it exposes an issue,
 I suppose.


I'll move the machine to one of the smaller hosts (S2 or S3).  I'll just
have to lower the spec of the VM, since I've set RAM at 10GB, which is much
more than S2 or S3 have.  Let's see what happens.



 -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-16 Thread Roland Giesler
On 14 January 2015 at 12:08, JM jmaxi...@gmail.com wrote:

 Hi Roland,

 You should tune your Ceph Crushmap with a custom rule in order to do that
 (write first on s3 and then to others). This custom rule will be applied
 then to your proxmox pool.
 (what you want to do is only interesting if you run VM from host s3)

 Can you give us your crushmap ?



# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host h1 {
id -2# do not change unnecessarily
# weight 8.140
alg straw
hash 0# rjenkins1
item osd.1 weight 0.900
item osd.3 weight 0.900
item osd.4 weight 0.900
item osd.5 weight 0.680
item osd.6 weight 0.680
item osd.7 weight 0.680
item osd.8 weight 0.680
item osd.9 weight 0.680
item osd.10 weight 0.680
item osd.11 weight 0.680
item osd.12 weight 0.680
}
host s3 {
id -3# do not change unnecessarily
# weight 0.450
alg straw
hash 0# rjenkins1
item osd.2 weight 0.450
}
host s2 {
id -4# do not change unnecessarily
# weight 0.900
alg straw
hash 0# rjenkins1
item osd.13 weight 0.900
}
host s1 {
id -5# do not change unnecessarily
# weight 1.640
alg straw
hash 0# rjenkins1
item osd.14 weight 0.290
item osd.0 weight 0.270
item osd.15 weight 0.270
item osd.16 weight 0.270
item osd.17 weight 0.270
item osd.18 weight 0.270
}
root default {
id -1# do not change unnecessarily
# weight 11.130
alg straw
hash 0# rjenkins1
item h1 weight 8.140
item s3 weight 0.450
item s2 weight 0.900
item s1 weight 1.640
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

thanks so far!

regards

Roland






 2015-01-13 22:03 GMT+01:00 Roland Giesler rol...@giesler.za.net:

 I have a 4 node ceph cluster, but the disks are not equally distributed
 across all machines (they are substantially different from each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3)
 and two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
 mostly from there, but I want to make sure that the writes that happen to
 the ceph cluster get written to the local osd's on s3 first and then the
 additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

 regards


 *Roland *

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-14 Thread Lionel Bouton
On 01/13/15 22:03, Roland Giesler wrote:
 I have a 4 node ceph cluster, but the disks are not equally
 distributed across all machines (they are substantially different from
 each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS
 (s3) and two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my
 VM's mostly from there, but I want to make sure that the writes that
 happen to the ceph cluster get written to the local osd's on s3
 first and then the additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

I don't think it is possible because I believe it would break its
durability guarantees: IIRC each write is only considered done when all
replicas have been written to (with default settings 3 replicas on 3
different servers) so you have to wait for 3 servers to acknowledge the
write for it to complete in the VM.

You could maybe achieve what you want by using tiering with a cache pool
configured to use only disks on the s3 server in write-back mode but
given the user experience reports on the list it may actually perform
worse than your current setup.

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.

2015-01-14 Thread Gregory Farnum
On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler rol...@giesler.za.net wrote:
 I have a 4 node ceph cluster, but the disks are not equally distributed
 across all machines (they are substantially different from each other)

 One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) and
 two machines have only two 1 TB drives each (s2  s1).

 Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's
 mostly from there, but I want to make sure that the writes that happen to
 the ceph cluster get written to the local osd's on s3 first and then the
 additional writes/copies get done to the network.

 Is this possible with ceph.  The VM's are KVM in Proxmox in case it's
 relevant.

In general you can't set up Ceph to write to the local node first. In
some specific cases you can if you're willing to do a lot more work
around placement, and this *might* be one of those cases.

To do this, you'd need to change the CRUSH rules pretty extensively,
so that instead of selecting OSDs at random, they have two steps:
1) starting from bucket s3, select a random OSD and put it at the
front of the OSD list for the PG.
2) Starting from a bucket which contains all the other OSDs, select
N-1 more at random (where N is the number of desired replicas).

You can look at the documentation on CRUSH or search the list archives
for more on this subject.

Note that doing this has a bunch of down sides: you'll have balance
issues because every piece of data will be on the s3 node (that's a
TERRIBLE name for a project which has API support for Amazon S3, btw
:p), if you add new VMs on a different node they'll all be going to
the s3 node for all their writes (unless you set them up on a
different pool with different CRUSH rules), s3 will be satisfying all
the read requests so the other nodes are just backups in case of disk
failure, etc.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com