Re: [ceph-users] Help with SSDs
Hi Mark, On 18.12.2014 07:15, Mark Kirkwood wrote: While you can't do much about the endurance lifetime being a bit low, you could possibly improve performance using a journal *file* that is located on the 840's (you'll need to symlink it - disclaimer - have not tried this myself, but will experiment if you are interested). Slightly different open() options are used in this case and these cheaper consumer SSD seem to work better with them. I had the symlink-file method before, (with different SSDs) but the performance was much better after changing to partitions. I try fist some different consumer SSDs with journal as file and end now with DC S3700 with partitions. Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Any tuning of LVM-Storage inside an VM related to ceph?
Hi all, I have some fileserver with insufficient read speed. Enabling read ahead inside the VM improve the read speed, but it's looks, that this has an drawback during lvm-operations like pvmove. For test purposes, I move the lvm-storage inside an VM from vdb to vdc1. It's take days, because it's 3TB data. After enbling read ahead (echo 4096 /sys/block/vdb/queue/read_ahead_kb; echo 4096 /sys/block/vdc/queue/read_ahead_kb) the move-speed drop noticeable! Are they any tunings to improve speed related to lvm on rbd-storage? Perhaps, if using partitions, align the partition on 4MB? Any hints? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with SSDs
The effect of this is *highly* dependent to the SSD make/model. My m550 work vastly better if the journal is a file on a filesystem as opposed to a partition. Obviously the Intel S3700/S3500 are a better choice - but the OP has already purchased Sammy 840's, so I'm trying to suggest options to try that don't require him to buy new SSDs! Cheers Mark On 18/12/14 21:28, Udo Lembke wrote: On 18.12.2014 07:15, Mark Kirkwood wrote: While you can't do much about the endurance lifetime being a bit low, you could possibly improve performance using a journal *file* that is located on the 840's (you'll need to symlink it - disclaimer - have not tried this myself, but will experiment if you are interested). Slightly different open() options are used in this case and these cheaper consumer SSD seem to work better with them. I had the symlink-file method before, (with different SSDs) but the performance was much better after changing to partitions. I try fist some different consumer SSDs with journal as file and end now with DC S3700 with partitions. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is cache tiering production ready?
Gregory Farnum greg@... writes: Cache tiering is a stable, functioning system. Those particular commands are for testing and development purposes, not something you should run (although they ought to be safe).-Greg Thanks for your reply! I'll put cache tiering into my production cluster! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
I too find Ceph fuse more stable. However, you really should do your tests with a much more recent kernel ! 3.10 is old. I think there is Ceph improvements in every kernel version since a long time. -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On jeu., 2014-12-18 at 14:52 +1000, Lindsay Mathieson wrote: I'be been experimenting with CephFS for funning KVM images (proxmox). cephfs fuse version - 0.87 cephfs kernel module - kernel version 3.10 Part of my testing involves running a Windows 7 VM up and running CrystalDiskMark to check the I/O in the VM. Its surprisingly good with both the fuse and the kernel driver, seq reads writes are actually faster than the underlying disk, so I presume the FS is aggressively caching. With the fuse driver I have no problems. With the kernel driver, the benchmark runs fine, but when I reboot the VM the drive is corrupted and unreadable, every time. Rolling back to a snapshot fixes the disk. This does not happen unless I run the benchmark, which I presume is writing a lot of data. No problems with the same test for Ceph rbd, or NFS. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Double-mounting of RBD
Hello, I have a somewhat interesting scenario. I have an RBD of 17TB formatted using XFS. I would like it accessible from two different hosts, one mapped/mounted read-only, and one mapped/mounted as read-write. Both are shared using Samba 4.x. One Samba server gives read-only access to the world for the data. The other gives read-write access to a very limited set of users who occasionally need to add data. However, when testing this, when changes are made to the read-write Samba server the changes don’t seem to be seen by the read-only Samba server. Is there some file system caching going on that will eventually be flushed? I think that this a normal behaviour as your read only filesystem is not aware that some writes occurred. To achieve your goal I think that you should use some clustered filesystem [1] in order that the read-only server know that some writes occurred in the filesystem. [1] https://en.wikipedia.org/wiki/Clustered_file_system Regards, Olivier DELHOMME. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] File System stripping data
Kevin, Yes, that is just too old for vxattrs (the earliest tag with vxattr support in fuse is v0.57~84^2~6). In Ceph FS terms, 0.56 is pretty ancient. Because the filesystem is under active development, you should use a much more recent version for clusters with Ceph FS enabled -- at least firefly, and perhaps giant if you can tolerate a non-LTS release. John On Thu, Dec 18, 2014 at 12:08 AM, Kevin Shiah agan...@gmail.com wrote: Hi John, I am using 0.56.1. Could it be because data striping is not supported in this version? Kevin On Wed Dec 17 2014 at 4:00:15 AM PST Wido den Hollander w...@42on.com wrote: On 12/17/2014 12:35 PM, John Spray wrote: On Wed, Dec 17, 2014 at 10:25 AM, Wido den Hollander w...@42on.com wrote: I just tried something similar on Giant (0.87) and I saw this in the logs: parse_layout_vxattr name layout.pool value 'cephfs_svo' invalid data pool 3 reply request -22 I resolves the pool to a ID, but then it's unable to set it? Was the 'cephfs_svo' pool already added as a data pool with ceph mds add_data_pool? Ah, indeed. Working fine right now. Same goes for any other layout settings. There are paths where if a pool was added very recently, MDSs/clients might not know about the pool yet and can generate errors like this. John -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with SSDs
On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote: My m550 work vastly better if the journal is a file on a filesystem as opposed to a partition. Any particular filesystem? ext4? xfs? or doesn't matter? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Double-mounting of RBD
On Wed, Dec 17, 2014 at 10:31 PM, McNamara, Bradley bradley.mcnam...@seattle.gov wrote: However, when testing this, when changes are made to the read-write Samba server the changes don’t seem to be seen by the read-only Samba server. Is there some file system caching going on that will eventually be flushed? As others have said, the read-only mount doesn't know how to poll the block device to see updates from the read-write mount, so you won't see updates to the data, and in general this is not a safe thing to do. One alternative would be taking a clone of a snapshot of the image, and mounting that read-only -- obviously that data will only be as up-to-date as whenever you did your last snapshot. If the read-only mounts are serving rarely updated files, the administrative overhead of doing the snapshot/remount on data updates might be acceptable. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Content-length error uploading big files to radosgw
Hello, I have been trying to upload multi-gigabyte files to CEPH via the object gateway, using both the swift and s3 APIs. With file up to about 2GB everything works as expected. With files bigger than that I get back a 400 Bad Request error, both with S3 (boto) and Swift clients. Enabling debug I can see this: 2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200 ... 2014-12-18 12:38:28.947539 7f5419ffb700 1 == starting new request req=0x7f541000fee0 = 2014-12-18 12:38:28.947556 7f5419ffb700 2 req 2:0.17::PUT /test/test::initializing 2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting 2014-12-18 12:38:28.947641 7f5419ffb700 2 req 2:0.000102::PUT /test/test::http status=400 2014-12-18 12:38:28.947644 7f5419ffb700 1 == req done req=0x7f541000fee0 http_status=400 == The content length is the right one (I created a test file with dd). With a file 207200 bytes long, I get no error. The gateway is running on debian, with the packages available on the ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100 continue). There is a limit on the object size? Or there is an error in my configuration somewhere? Thank you, Daniele -- Daniele Venzano http://www.brownhat.org ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New Cluster (0.87), Missing Default Pools?
Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2014-12-18 12:35:46.152873 7fb008d4b700 1 mon.a@0(leader).osd e1 e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.154551 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154580 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154588 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154592 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.157078 7fb008d4b700 0 log_channel(cluster) log [INF] : osdmap e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.220701 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 1..1) refresh upgraded, format 0 - 1 2014-12-18 12:35:46.334457 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
[ceph-users] Need help from Ceph experts
Hi Guys, I am very new to Ceph have couple of questions - 1. Can we install Ceph in a single node (both Monitor OSD). 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) 3. Any webpage where I can find the installation guide to install Ceph in one node. I will be eagerly waiting for your response. Please note that performance redundancy is not an issue for me but I want it to integrate with OpenStack in the end. Kind Regards Debashish Das ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
No mistake -- the Ceph FS pools are no longer created by default, as not everybody needs them. Ceph FS users now create these pools explicitly: http://ceph.com/docs/master/cephfs/createfs/ John On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2014-12-18 12:35:46.152873 7fb008d4b700 1 mon.a@0(leader).osd e1 e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.154551 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154580 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154588 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154592 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
I remember reading somewhere (maybe in changelogs) that default pools were not created automatically anymore. You can create pools you need yourself. -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On jeu., 2014-12-18 at 06:52 -0600, Dyweni - Ceph-Users wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2014-12-18 12:35:46.152873 7fb008d4b700 1 mon.a@0(leader).osd e1 e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.154551 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154580 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154588 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154592 7fb008d4b700 0 mon.a@0(leader).osd
Re: [ceph-users] Need help from Ceph experts
Hey Debashish, On Thu, Dec 18, 2014 at 6:21 AM, Debashish Das deba@gmail.com wrote: Hi Guys, I am very new to Ceph have couple of questions - 1. Can we install Ceph in a single node (both Monitor OSD). You can, but I would only recommend it for testing/experimentation. No production (or even pre-production) cluster with any meaningful amount of use should be a single node. 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. I would suggest taking a look at some of the hardware guides and reference architectures available though. A few examples would be: http://ceph.com/docs/master/start/hardware-recommendations/ https://engage.redhat.com/inktank-hardware-selection-guide-s-201409080912 https://engage.redhat.com/inktank-ceph-storage-for-dell-s-201409081132 http://karan-mj.blogspot.com/2014/01/zero-to-hero-guide-for-ceph-cluster.html http://www.supermicro.com/solutions/datasheet_Ceph.pdf 3. Any webpage where I can find the installation guide to install Ceph in one node. Since a single node isn't really demonstrating a realistic Ceph install we decided that a multi-node install was more effective, which is what you'll see in our docs. However, if you'd like a contained Ceph install to experiment with you can try the latest qemu advent calendar image ( http://www.qemu-advent-calendar.org/#day-18 ) or try the installation instructions from an older version of the doc: http://ceph.com/docs/v0.67.9/start/quick-start/ While that doc is quite outdated I'm sure you can see how to adapt the more recent install guide to that procedure if you're really set on doing a single node install. I would probably just use the qemu image for experimentation and then move to the multi node install. Hope that helps! Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Happy Holidays with Ceph QEMU Advent
Howdy Ceph rangers, Just wanted to kick off a bit of holiday cheer from our Ceph family to yours. As a part of the QEMU advent calendar [0], we have finally built a quick-and-dirty Ceph image for the purposes of trying and experimenting with Ceph. Feel free to download it [1] and try it out, or send it on to your friends who have yet to experience the fun of Ceph! Hope this works for those who had requested a simple Ceph image. Happy holidays and happy tinkering! [0] http://www.qemu-advent-calendar.org/#day-18 [1] http://www.qemu-advent-calendar.org/download/ceph.tar.xz Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] When is the rctime updated in CephFS?
Hi, I've been playing around a bit with the recursive statistics for CephFS today and I'm seeing some behavior with the rstats what I don't understand. I /A/B/C in my CephFS. I changed a file in 'C' and the ceph.dir.rctime xattr changed immediately. I've been waiting for 60 minutes now, but /A and /A/B still have their old rctime. A: 1418905422 (18-12-2014 13:23:42) B: 1418905422 (18-12-2014 13:23:42) C: 1418909134 (18-12-2014 14:25:34) It's 15:21:34 right now, so after 1 hour the rctime of A and B still hasn't updated. How long does this take? I know the MDS is lazy in updating the rstats, but one hour is quite long, isn't it? Ceph version 0.89 Linux 3.18 kernel client Ceph fuse client 0.89 -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When is the rctime updated in CephFS?
On Thu, 18 Dec 2014, Wido den Hollander wrote: Hi, I've been playing around a bit with the recursive statistics for CephFS today and I'm seeing some behavior with the rstats what I don't understand. I /A/B/C in my CephFS. I changed a file in 'C' and the ceph.dir.rctime xattr changed immediately. I've been waiting for 60 minutes now, but /A and /A/B still have their old rctime. A: 1418905422 (18-12-2014 13:23:42) B: 1418905422 (18-12-2014 13:23:42) C: 1418909134 (18-12-2014 14:25:34) It's 15:21:34 right now, so after 1 hour the rctime of A and B still hasn't updated. How long does this take? I know the MDS is lazy in updating the rstats, but one hour is quite long, isn't it? This is a bit of a loose end at the moment. The client doesn't have any refresh value for these stats. Right now an 'ls' in the parent dir will get you a fresh value, but repeatedly calling 'stat' will keep giving you the cached value. I'm not sure what the right fix is. The normal inode fields are all perfectly accurate, and the protocol is built around making sure that's the case.. not giving reasonably timely values to the new stuff. :/ sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When is the rctime updated in CephFS?
On 12/18/2014 03:37 PM, Sage Weil wrote: On Thu, 18 Dec 2014, Wido den Hollander wrote: Hi, I've been playing around a bit with the recursive statistics for CephFS today and I'm seeing some behavior with the rstats what I don't understand. I /A/B/C in my CephFS. I changed a file in 'C' and the ceph.dir.rctime xattr changed immediately. I've been waiting for 60 minutes now, but /A and /A/B still have their old rctime. A: 1418905422 (18-12-2014 13:23:42) B: 1418905422 (18-12-2014 13:23:42) C: 1418909134 (18-12-2014 14:25:34) It's 15:21:34 right now, so after 1 hour the rctime of A and B still hasn't updated. How long does this take? I know the MDS is lazy in updating the rstats, but one hour is quite long, isn't it? This is a bit of a loose end at the moment. The client doesn't have any refresh value for these stats. Right now an 'ls' in the parent dir will get you a fresh value, but repeatedly calling 'stat' will keep giving you the cached value. The ls didn't really trigger it for me. I'm using getfattr btw: $ getfattr -n ceph.dir.rctime /mnt/cephfs/A I unmounted and mounted and it worked right away. So this is probably not a real issue on a active filesystem where lots of I/O on that client is happening, right? I'm building a PoC backup script which uses the rctimes to backup CephFS in a reasonable way, not having rsync scan the whole tree. I'm not sure what the right fix is. The normal inode fields are all perfectly accurate, and the protocol is built around making sure that's the case.. not giving reasonably timely values to the new stuff. :/ sage -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When is the rctime updated in CephFS?
Hi Wido, I'm really interested in your script. Will you release it ? I'm sure I'm not the only one interested ;) If you need some help (testing or something else), don't hesitate to ask me. -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On jeu., 2014-12-18 at 15:47 +0100, Wido den Hollander wrote: On 12/18/2014 03:37 PM, Sage Weil wrote: On Thu, 18 Dec 2014, Wido den Hollander wrote: Hi, I've been playing around a bit with the recursive statistics for CephFS today and I'm seeing some behavior with the rstats what I don't understand. I /A/B/C in my CephFS. I changed a file in 'C' and the ceph.dir.rctime xattr changed immediately. I've been waiting for 60 minutes now, but /A and /A/B still have their old rctime. A: 1418905422 (18-12-2014 13:23:42) B: 1418905422 (18-12-2014 13:23:42) C: 1418909134 (18-12-2014 14:25:34) It's 15:21:34 right now, so after 1 hour the rctime of A and B still hasn't updated. How long does this take? I know the MDS is lazy in updating the rstats, but one hour is quite long, isn't it? This is a bit of a loose end at the moment. The client doesn't have any refresh value for these stats. Right now an 'ls' in the parent dir will get you a fresh value, but repeatedly calling 'stat' will keep giving you the cached value. The ls didn't really trigger it for me. I'm using getfattr btw: $ getfattr -n ceph.dir.rctime /mnt/cephfs/A I unmounted and mounted and it worked right away. So this is probably not a real issue on a active filesystem where lots of I/O on that client is happening, right? I'm building a PoC backup script which uses the rctimes to backup CephFS in a reasonable way, not having rsync scan the whole tree. I'm not sure what the right fix is. The normal inode fields are all perfectly accurate, and the protocol is built around making sure that's the case.. not giving reasonably timely values to the new stuff. :/ sage -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with SSDs
On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote: The effect of this is *highly* dependent to the SSD make/model. My m550 work vastly better if the journal is a file on a filesystem as opposed to a partition. Obviously the Intel S3700/S3500 are a better choice - but the OP has already purchased Sammy 840's, so I'm trying to suggest options to try that don't require him to buy new SSDs! I have 120GB Samsung 840 EVO's with 10GB journal partitions and just gave this a go. No real change unfortunately :( using rados bench. However it does make experimenting with different journal sizes easier. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] When is the rctime updated in CephFS?
On 12/18/2014 03:52 PM, Thomas Lemarchand wrote: Hi Wido, I'm really interested in your script. Will you release it ? I'm sure I'm not the only one interested ;) Well, it's not a general script to backup CephFS with. It's a fairly simple Bash script I'm writing for a specific situation where the directory layout is known: /year/month/project The script checks which year, month or project have changed since it last ran. If a project changed, it fires off rsync to backup that project to a NFS mount. This saves us scanning 2.000 projects with rsync where we know that about 25 change every day. The coolest thing would be if rsync could use these xattrs and become more clever, but some code which uses libcephfs would also be nice. So sorry, it's not something you can use on any CephFS deployment. If you need some help (testing or something else), don't hesitate to ask me. -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
Thanks!! Looks like the the manual installation instructions should be updated, to eliminate future confusion. Dyweni On 2014-12-18 07:11, John Spray wrote: No mistake -- the Ceph FS pools are no longer created by default, as not everybody needs them. Ceph FS users now create these pools explicitly: http://ceph.com/docs/master/cephfs/createfs/ John On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2014-12-18 12:35:46.152873 7fb008d4b700 1 mon.a@0(leader).osd e1 e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.154551 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154580 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154588 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154592 7fb008d4b700 0
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
Can you point out the specific page that's out of date so that we can update it? Thanks, John On Thu, Dec 18, 2014 at 5:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Thanks!! Looks like the the manual installation instructions should be updated, to eliminate future confusion. Dyweni On 2014-12-18 07:11, John Spray wrote: No mistake -- the Ceph FS pools are no longer created by default, as not everybody needs them. Ceph FS users now create these pools explicitly: http://ceph.com/docs/master/cephfs/createfs/ John On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2014-12-18 12:35:46.152873 7fb008d4b700 1 mon.a@0(leader).osd e1 e1: 0 osds: 0 up, 0 in 2014-12-18 12:35:46.154551 7fb008d4b700 0 mon.a@0(leader).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-12-18 12:35:46.154580
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
No ! It would have been a really bad idea. I upgraded without losing my default pools, hopefully ;) -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On jeu., 2014-12-18 at 10:10 -0800, JIten Shah wrote: So what happens if we upgrade from Firefly to Giant? Do we loose the pools? —Jiten On Dec 18, 2014, at 5:12 AM, Thomas Lemarchand thomas.lemarch...@cloud-solutions.fr wrote: I remember reading somewhere (maybe in changelogs) that default pools were not created automatically anymore. You can create pools you need yourself. -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On jeu., 2014-12-18 at 06:52 -0600, Dyweni - Ceph-Users wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:46.003223 7fb008d4b700 0 log_channel(cluster) log [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0} 2014-12-18 12:35:46.040555 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:46.087081 7fb008d4b700 0 log_channel(cluster) log [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-12-18 12:35:46.141415 7fb008d4b700 0 mon.a@0(leader).mds e1 print_map epoch 1 flags 0 created 0.00 modified2014-12-18 12:35:46.038418 tableserver 0 root0 session_timeout 0 session_autoclose 0 max_file_size 0 last_failure0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={} max_mds 0 in up {} failed stopped data_pools metadata_pool 0 inline_data disabled 2014-12-18 12:35:46.151117 7fb008d4b700 0 log_channel(cluster) log [INF] : mdsmap
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
On Thu, Dec 18, 2014 at 6:10 PM, JIten Shah jshah2...@me.com wrote: So what happens if we upgrade from Firefly to Giant? Do we loose the pools? Sure, you didn't have any data you wanted to keep, right? :-D Seriously though, no, we don't delete anything during an upgrade. It's just newly installed clusters that would never have those pools created. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Block device and Trim/Discard
One question re: discard support for kRBD -- does it matter which format the RBD is? Format 1 and Format 2 are okay, or just for Format 2? - Travis On Mon, Dec 15, 2014 at 8:58 AM, Max Power mailli...@ferienwohnung-altenbeken.de wrote: Ilya Dryomov ilya.dryo...@inktank.com hat am 12. Dezember 2014 um 18:00 geschrieben: Just a note, discard support went into 3.18, which was released a few days ago. I recently compiled 3.18 on Debian 7 and what do I have to say... It works perfectly well. The used memory goes up and down again. So I think this will be my choice. Thank you! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Block device and Trim/Discard
On 12/18/2014 10:49 AM, Travis Rhoden wrote: One question re: discard support for kRBD -- does it matter which format the RBD is? Format 1 and Format 2 are okay, or just for Format 2? It shouldn't matter which format you use. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Content-length error uploading big files to radosgw
On Thu, Dec 18, 2014 at 4:04 AM, Daniele Venzano li...@brownhat.org wrote: Hello, I have been trying to upload multi-gigabyte files to CEPH via the object gateway, using both the swift and s3 APIs. With file up to about 2GB everything works as expected. With files bigger than that I get back a 400 Bad Request error, both with S3 (boto) and Swift clients. Enabling debug I can see this: 2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200 ... 2014-12-18 12:38:28.947539 7f5419ffb700 1 == starting new request req=0x7f541000fee0 = 2014-12-18 12:38:28.947556 7f5419ffb700 2 req 2:0.17::PUT /test/test::initializing 2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting 2014-12-18 12:38:28.947641 7f5419ffb700 2 req 2:0.000102::PUT /test/test::http status=400 2014-12-18 12:38:28.947644 7f5419ffb700 1 == req done req=0x7f541000fee0 http_status=400 == The content length is the right one (I created a test file with dd). With a file 207200 bytes long, I get no error. The gateway is running on debian, with the packages available on the ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100 continue). There is a limit on the object size? Or there is an error in my configuration somewhere? You just stated it: you need 100-continue to upload parts larger than 2GB. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
On Wed, Dec 17, 2014 at 8:52 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: I'be been experimenting with CephFS for funning KVM images (proxmox). cephfs fuse version - 0.87 cephfs kernel module - kernel version 3.10 Part of my testing involves running a Windows 7 VM up and running CrystalDiskMark to check the I/O in the VM. Its surprisingly good with both the fuse and the kernel driver, seq reads writes are actually faster than the underlying disk, so I presume the FS is aggressively caching. With the fuse driver I have no problems. With the kernel driver, the benchmark runs fine, but when I reboot the VM the drive is corrupted and unreadable, every time. Rolling back to a snapshot fixes the disk. This does not happen unless I run the benchmark, which I presume is writing a lot of data. No problems with the same test for Ceph rbd, or NFS. Do you have any information about *how* the drive is corrupted; what part Win7 is unhappy with? I don't know how Proxmox configures it, but I assume you're storing the disk images as single files on the FS? I'm really not sure what the kernel client could even do here, since if you're not rebooting the host as well as the VM then it can't be losing any of the data it's given. :/ -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
Hi Lindsay, have you tried the different cache-options (no cache, write through, ...) which proxmox offer, for the drive? Udo On 18.12.2014 05:52, Lindsay Mathieson wrote: I'be been experimenting with CephFS for funning KVM images (proxmox). cephfs fuse version - 0.87 cephfs kernel module - kernel version 3.10 Part of my testing involves running a Windows 7 VM up and running CrystalDiskMark to check the I/O in the VM. Its surprisingly good with both the fuse and the kernel driver, seq reads writes are actually faster than the underlying disk, so I presume the FS is aggressively caching. With the fuse driver I have no problems. With the kernel driver, the benchmark runs fine, but when I reboot the VM the drive is corrupted and unreadable, every time. Rolling back to a snapshot fixes the disk. This does not happen unless I run the benchmark, which I presume is writing a lot of data. No problems with the same test for Ceph rbd, or NFS. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Block device and Trim/Discard
Discard is supported in kernel 3.18 rc1 or greater as per https://lkml.org/lkml/2014/10/14/450 -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert Sander Sent: Friday, December 12, 2014 7:01 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph Block device and Trim/Discard On 12.12.2014 12:48, Max Power wrote: It would be great to shrink the used space. Is there a way to achieve this? Or have I done something wrong? In a professional environment you may can live with filesystems that only grow. But on my small home-cluster this really is a problem. As Wido already mentioned the kernel RBD does not support discard. When using qemu+rbd you cannot use the virto driver as this also does not support discard. My best experience is with the virtual SATA driver and the options cache=writeback and discard=on. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Content-length error uploading big files to radosgw
On Thu, Dec 18, 2014 at 11:24 AM, Gregory Farnum g...@gregs42.com wrote: On Thu, Dec 18, 2014 at 4:04 AM, Daniele Venzano li...@brownhat.org wrote: Hello, I have been trying to upload multi-gigabyte files to CEPH via the object gateway, using both the swift and s3 APIs. With file up to about 2GB everything works as expected. With files bigger than that I get back a 400 Bad Request error, both with S3 (boto) and Swift clients. Enabling debug I can see this: 2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200 ... 2014-12-18 12:38:28.947539 7f5419ffb700 1 == starting new request req=0x7f541000fee0 = 2014-12-18 12:38:28.947556 7f5419ffb700 2 req 2:0.17::PUT /test/test::initializing 2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting 2014-12-18 12:38:28.947641 7f5419ffb700 2 req 2:0.000102::PUT /test/test::http status=400 2014-12-18 12:38:28.947644 7f5419ffb700 1 == req done req=0x7f541000fee0 http_status=400 == The content length is the right one (I created a test file with dd). With a file 207200 bytes long, I get no error. The gateway is running on debian, with the packages available on the ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100 continue). There is a limit on the object size? Or there is an error in my configuration somewhere? You just stated it: you need 100-continue to upload parts larger than 2GB. Just a small clarification: the 100-continue is needed to get an early error message. We should be able to support up to 5GB of a single part, so it could well be a bug. Usually for large size uploads you should be using the multipart upload api. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
On Thu, 18 Dec 2014 08:41:21 PM Udo Lembke wrote: have you tried the different cache-options (no cache, write through, ...) which proxmox offer, for the drive? I tried with writeback and it didn't corrupt. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
On Thu, 18 Dec 2014 11:23:42 AM Gregory Farnum wrote: Do you have any information about *how* the drive is corrupted; what part Win7 is unhappy with? Failure to find the boot sector I think, I'll run it again and take a screen shot. I don't know how Proxmox configures it, but I assume you're storing the disk images as single files on the FS? its a single KVM QCOW2 file. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver
On Thu, Dec 18, 2014 at 8:40 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: I don't know how Proxmox configures it, but I assume you're storing the disk images as single files on the FS? its a single KVM QCOW2 file. Like the cache mode, the image format might be an interesting thing to experiment with. There are bugs out there in all layers of the IO stack, it's entirely possible that you're seeing a bug elsewhere in the stack that is only being triggered when using Ceph. This probably goes without saying, but make sure you're using the latest/greatest versions of everything, including kvm/qemu/proxmox/kernel/guest drivers. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What to do when a parent RBD clone becomes corrupted
Before we base thousands of VM image clones off of one or more snapshots, I want to test what happens when the snapshot becomes corrupted. I don't believe the snapshot will become corrupted through client access to the snapshot, but some weird issue with PGs being lost or forced to be lost, solar flares or alien invasions. My initial thought was to export a snapshot image and import it over the top of the existing snapshot so that children would be preserved. No such luck. I was hoping there would be a i-really-really-want-to-do-this option that would let me restore the snapshot. Am I going about this the wrong way? I can see having to restore a number of VM because of corrupted clone, but I'd hate to lose all the clones because of corruption in the snapshot. I would be happy if the restored snapshot would be flattened if it was a clone of another image previously. Thanks, Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with SSDs
On 19/12/14 03:01, Lindsay Mathieson wrote: On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote: The effect of this is *highly* dependent to the SSD make/model. My m550 work vastly better if the journal is a file on a filesystem as opposed to a partition. Obviously the Intel S3700/S3500 are a better choice - but the OP has already purchased Sammy 840's, so I'm trying to suggest options to try that don't require him to buy new SSDs! I have 120GB Samsung 840 EVO's with 10GB journal partitions and just gave this a go. No real change unfortunately :( using rados bench. However it does make experimenting with different journal sizes easier. Pity. If you used xfs you can try tweaking some of the mkfs options...but I doubt they will make too much difference. Looking at the data specs for the 840, it does not seem to have any on board capacitors. If it did you could risk switching off xfs write barriers...which would probably make a big difference. You could try switching *off* the write cache (just in case the 840 behaves like my m550's and gets - oddly- 2x faster for sync writes in that case)! However disabling the write cache may *considerably* decrease disk lifetime, so if the setting helps in yur case, you'll need to conduct some experiments to measure by how much the lifetime is gonna be impacted. Cheers Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Have 2 different public networks
Hi, Is it possible to have 2 different public networks in a Ceph cluster? I explain my question below. Currently, I have 3 identical nodes in my Ceph cluster. Each node has: - only 1 monitor; - n osds (we don't care about the value n here); - and 3 interfaces. One interface for the cluster network (10.0.0.0/24): - node1 - 10.0.0.1 - node2 - 10.0.0.2 - node3 - 10.0.0.3 One interface for the public network (10.0.1.0/24): - node1 - [mon.1] mon addr = 10.0.1.1 - node2 - [mon.2] mon addr = 10.0.1.2 - node3 - [mon.3] mon addr = 10.0.1.3 And one interface not used yet (see below). With this configuration, if I have a Ceph client in the public network, I can use rbd images etc. No problem, it works. But now I would like to use the third interface of the nodes for a *different* plublic network - 10.0.2.0/24. The Ceph clients in this network will not really use the storage but will create and delete rbd images in a pool. In fact it's just a network for *Ceph management*. So, I want to have 2 different public networks: - 10.0.1.0/24 (already exists) - *and* 10.0.2.0/24 Am I wrong if I say that mon.1, mon.2 and mon.3 must have one more IP address? Is it possible to have a monitor that listens on 2 addresses? Something like that: - node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1 - node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2 - node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3 My environment is not a production environment, just a lab. So, if necessary I can reinstall everything, no problem. Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
Thanks, I'll look into these. On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com wrote: I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
Hello, On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote: Firstly I'd like to confirm what Craig said about small clusters. I just changed my four storage node test cluster from 1 OSD per node to 4 and it can now saturate a 1GbE link (110MB/s) where before it peaked at 50-60MB/s. Of course now it is CPU bound and a bit tight on memory (those nodes have 4GB RAM and 2 have just 1 CPU/core). ^o^ I think this is it: https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 Ah, the joys of corporate address packratting. You can also check out a presentation on Cern's Ceph cluster: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern At large scale, the biggest problem will likely be network I/O on the inter-switch links. While true I think it will hit an equilibrium of sorts, if you actually have enough client traffic to saturate those, time for an upgrade. Aside from mere technical questions and challenges of scaling Ceph to those sizes (tuning all sorts of parameters, etc) I think clusters of that scale can become an administrative nightmare first and foremost. Let's take a look at a classic Ceph cluster with 10,000 OSDs: It will have somewhere between 500 and 1000 nodes. That number should give you pause already, there are bound to be dead nodes frequently. And with 10,000 disks, you're pretty much guaranteed to have a dead OSD or more (see the various threads about how resilient Ceph is) at any given time. So you'll need a team of people to swap disks on a constant/regular basis. And unless you also have a very nice inventory and tracking system, you will want to replace these OSDs in order, so that OSD 10 isn't on node 50 all of sudden, etc. There's probably a point of diminishing return when increasing OSDs stops making sense for various reasons. In fact once you reach a few hundred OSDs, to ease maintenance consider RAIDed OSDs (no more failed OSDs, yeah! ^o^). For me, the life cycle of a steadily growing cluster would be something like this: 1. Start with as many nodes/OSDs as you can can afford for performance, even if you don't need the space yet. 2. Keep adding OSDs to satisfy space and performance requirements as needed. 3. While performance is still good (or can't improve because of network limitations) but space requirements increase, grow the size of your OSDs, not the number. Regards, Christian On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm interested to know if there is a reference to this reference architecture. It would help alleviate some of the fears we have about scaling this thing to a massive scale (10,000's OSDs). Thanks, Robert LeBlanc On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com wrote: On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com wrote: 2. What should be the minimum hardware requirement of the server (CPU, Memory, NIC etc) There is no real minimum to run Ceph, it's all about what your workload will look like and what kind of performance you need. We have seen Ceph run on Raspberry Pis. Technically, the smallest cluster is a single node with a 10 GiB disk. Anything smaller won't work. That said, Ceph was envisioned to run on large clusters. IIRC, the reference architecture has 7 rows, each row having 10 racks, all full. Those of us running small clusters (less than 10 nodes) are noticing that it doesn't work quite as well. We have to significantly scale back the amount of backfilling and recovery that is allowed. I try to keep all backfill/recovery operations touching less than 20% of my OSDs. In the reference architecture, it could lose a whole row, and still keep under that limit. My 5 nodes cluster is noticeably better better than the 3 node cluster. It's faster, has lower latency, and latency doesn't increase as much during recovery operations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have 2 different public networks
The daemons bind to *, so adding the 3rd interface to the machine will allow you to talk to the daemons on that IP. I'm not really sure how you'd setup the management network though. I'd start by setting the ceph.conf public network on the management nodes to have the public network 10.0.2.0/24, and an /etc/hosts file with the monitor's names on the 10.0.2.0/24 network. Make sure the management nodes can't route to the 10.0.1.0/24 network, and see what happens. Do you really plan on having enough traffic creating and deleting RDB images that you need a dedicated network? It seems like setting up link aggregation on 10.0.1.0/24 would be simpler and less error prone. On Thu, Dec 18, 2014 at 4:19 PM, Francois Lafont flafdiv...@free.fr wrote: Hi, Is it possible to have 2 different public networks in a Ceph cluster? I explain my question below. Currently, I have 3 identical nodes in my Ceph cluster. Each node has: - only 1 monitor; - n osds (we don't care about the value n here); - and 3 interfaces. One interface for the cluster network (10.0.0.0/24): - node1 - 10.0.0.1 - node2 - 10.0.0.2 - node3 - 10.0.0.3 One interface for the public network (10.0.1.0/24): - node1 - [mon.1] mon addr = 10.0.1.1 - node2 - [mon.2] mon addr = 10.0.1.2 - node3 - [mon.3] mon addr = 10.0.1.3 And one interface not used yet (see below). With this configuration, if I have a Ceph client in the public network, I can use rbd images etc. No problem, it works. But now I would like to use the third interface of the nodes for a *different* plublic network - 10.0.2.0/24. The Ceph clients in this network will not really use the storage but will create and delete rbd images in a pool. In fact it's just a network for *Ceph management*. So, I want to have 2 different public networks: - 10.0.1.0/24 (already exists) - *and* 10.0.2.0/24 Am I wrong if I say that mon.1, mon.2 and mon.3 must have one more IP address? Is it possible to have a monitor that listens on 2 addresses? Something like that: - node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1 - node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2 - node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3 My environment is not a production environment, just a lab. So, if necessary I can reinstall everything, no problem. Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] High CPU/Delay when Removing Layered Child RBD Image
Hey All, On a new Cent7 deployment with firefly I'm noticing a strange behavior when deleting RBD child disks. It appears upon deletion cpu usage on each OSD process raises to about 75% for 30+ seconds. On my previous deployments with CentOS 6.x and Ubuntu 12/14 this was never a problem. Each RBD Disk is 4GB created with 'rbd clone images/136dd921-f6a2-432f-b4d6-e9902f71baa6@snap compute/test' ## Ubuntu12 3.11.0-18-generic with Ceph 0.80.7 root@node-1:~# date; rbd rm compute/test123; date Fri Dec 19 01:09:31 UTC 2014 Removing image: 100% complete...done. Fri Dec 19 01:09:31 UTC 2014 ## Cent7 3.18.1-1.el7.elrepo.x86_64 with Ceph 0.80.7 [root@hvm003 ~]# date; rbd rm compute/test; date Fri Dec 19 01:08:32 UTC 2014 Removing image: 100% complete...done. Fri Dec 19 01:09:00 UTC 2014 root@cpl001 ~]# ceph -s cluster d033718a-2cb9-409e-b968-34370bd67bd0 health HEALTH_OK monmap e1: 3 mons at {cpl001= 10.0.0.1:6789/0,mng001=10.0.0.3:6789/0,net001=10.0.0.2:6789/0}, election epoch 10, quorum 0,1,2 cpl001,net001,mng001 osdmap e84: 9 osds: 9 up, 9 in pgmap v618: 1792 pgs, 12 pools, 4148 MB data, 518 kobjects 15106 MB used, 4257 GB / 4272 GB avail 1792 active+clean Any assistance would be appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU/Delay when Removing Layered Child RBD Image
Okay, this is rather unrelated to Ceph but I might as well mention how this is fixed. When using the Juno-Release OpenStack pages the 'rbd_store_chunk_size = 8' now sets it to 8192 bytes rather than 8192 kB (8MB) causing quite a bit more objects to be stored and deleted. Setting this to 8192 got me the expected object size of 8MB. On Thu, Dec 18, 2014 at 6:22 PM, Tyler Wilson k...@linuxdigital.net wrote: Hey All, On a new Cent7 deployment with firefly I'm noticing a strange behavior when deleting RBD child disks. It appears upon deletion cpu usage on each OSD process raises to about 75% for 30+ seconds. On my previous deployments with CentOS 6.x and Ubuntu 12/14 this was never a problem. Each RBD Disk is 4GB created with 'rbd clone images/136dd921-f6a2-432f-b4d6-e9902f71baa6@snap compute/test' ## Ubuntu12 3.11.0-18-generic with Ceph 0.80.7 root@node-1:~# date; rbd rm compute/test123; date Fri Dec 19 01:09:31 UTC 2014 Removing image: 100% complete...done. Fri Dec 19 01:09:31 UTC 2014 ## Cent7 3.18.1-1.el7.elrepo.x86_64 with Ceph 0.80.7 [root@hvm003 ~]# date; rbd rm compute/test; date Fri Dec 19 01:08:32 UTC 2014 Removing image: 100% complete...done. Fri Dec 19 01:09:00 UTC 2014 root@cpl001 ~]# ceph -s cluster d033718a-2cb9-409e-b968-34370bd67bd0 health HEALTH_OK monmap e1: 3 mons at {cpl001= 10.0.0.1:6789/0,mng001=10.0.0.3:6789/0,net001=10.0.0.2:6789/0}, election epoch 10, quorum 0,1,2 cpl001,net001,mng001 osdmap e84: 9 osds: 9 up, 9 in pgmap v618: 1792 pgs, 12 pools, 4148 MB data, 518 kobjects 15106 MB used, 4257 GB / 4272 GB avail 1792 active+clean Any assistance would be appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
On 19 December 2014 at 11:14, Christian Balzer ch...@gol.com wrote: Hello, On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote: Firstly I'd like to confirm what Craig said about small clusters. I just changed my four storage node test cluster from 1 OSD per node to 4 and it can now saturate a 1GbE link (110MB/s) where before it peaked at 50-60MB/s. What min//max sizes do you have set? Anything special in your crush map? Did it improve your write speed and latency? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help from Ceph experts
On Fri, 19 Dec 2014 12:28:48 +1000 Lindsay Mathieson wrote: On 19 December 2014 at 11:14, Christian Balzer ch...@gol.com wrote: Hello, On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote: Firstly I'd like to confirm what Craig said about small clusters. I just changed my four storage node test cluster from 1 OSD per node to 4 and it can now saturate a 1GbE link (110MB/s) where before it peaked at 50-60MB/s. What min//max sizes do you have set? Anything special in your crush map? What specific values are you thinking about? But no, nothing special, no tuning at all with that cluster. The gain is simply from having more spindles to distribute the load (remember rados bench runs 16 threads by default and I use 64) amongst. Did it improve your write speed and latency? I was referring to write speed (bandwidth), for sequential reads a single HDD can saturate a 1GbE link, let alone 4. As for latency, somewhat. But this cluster isn't pure testing, no SSDs, so it is slow no matter what. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Cluster (0.87), Missing Default Pools?
Hi John, Yes, no problem! I have few items that I noticed. They are: 1. The missing 'data' and 'metadata' pools http://ceph.com/docs/master/install/manual-deployment/ Monitor Bootstrapping - Steps # 17 18 2. The setting 'mon initial members' On page 'http://ceph.com/docs/master/rados/configuration/mon-config-ref/', 'mon initial members' are the IDs of the initial monitors in the cluster. On page 'http://ceph.com/docs/master/install/manual-deployment/' (Monitor Bootstrapping - Steps # 6 14) it lists the members as the hostnames. 3. Creating the default data directory for the monitors: On page 'http://ceph.com/docs/master/rados/configuration/mon-config-ref/', 'mon data' defaults to '/var/lib/ceph/mon/$cluster-$id'. On page 'http://ceph.com/docs/master/install/manual-deployment/' (Monitor Bootstrapping - Step # 12) it hostname instead. 4. Populating the monitor daemons The man page for 'ceph-mon' shows that '-i' is the monitor ID. On page 'http://ceph.com/docs/master/install/manual-deployment/' (Monitor Bootstrapping - Step # 13) it uses the hostname instead. Thanks, Dyweni On 2014-12-18 11:55, John Spray wrote: Can you point out the specific page that's out of date so that we can update it? Thanks, John On Thu, Dec 18, 2014 at 5:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Thanks!! Looks like the the manual installation instructions should be updated, to eliminate future confusion. Dyweni On 2014-12-18 07:11, John Spray wrote: No mistake -- the Ceph FS pools are no longer created by default, as not everybody needs them. Ceph FS users now create these pools explicitly: http://ceph.com/docs/master/cephfs/createfs/ John On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com wrote: Hi All, Just setup the monitor for a new cluster based on Giant (0.87) and I find that only the 'rbd' pool was created automatically. I don't see the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files. I haven't setup any OSDs or MDSs yet. I'm following the manual deployment guide. Would you mind looking over the setup details/logs below and letting me know my mistake please? Here's my /etc/ceph/ceph.conf file: --- [global] fsid = xx public network = xx.xx.xx.xx/xx cluster network = xx.xx.xx.xx/xx auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 [mon] mon initial members = a [mon.a] host = xx mon addr = xx.xx.xx.xx --- Here's the commands used to setup the monitor: --- ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap mkdir /var/lib/ceph/mon/ceph-a ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring /etc/init.d/ceph-mon.a start --- Here's the ceph-mon.a logfile: --- 2014-12-18 12:35:45.768752 7fb00df94780 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225 2014-12-18 12:35:45.856851 7fb00df94780 0 mon.a does not exist in monmap, will attempt to join an existing cluster 2014-12-18 12:35:45.857069 7fb00df94780 0 using public_addr xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0 2014-12-18 12:35:45.857126 7fb00df94780 0 starting mon.a rank -1 at xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx 2014-12-18 12:35:45.857330 7fb00df94780 1 mon.a@-1(probing) e0 preinit fsid xx 2014-12-18 12:35:45.857402 7fb00df94780 1 mon.a@-1(probing) e0 initial_members a, filtering seed monmap 2014-12-18 12:35:45.858322 7fb00df94780 0 mon.a@-1(probing) e0 my rank is now 0 (was -1) 2014-12-18 12:35:45.858360 7fb00df94780 1 mon.a@0(probing) e0 win_standalone_election 2014-12-18 12:35:45.859803 7fb00df94780 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0 2014-12-18 12:35:45.863846 7fb008d4b700 1 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.863867 7fb008d4b700 1 mon.a@0(leader).pg v0 on_upgrade discarding in-core PGMap 2014-12-18 12:35:45.865662 7fb008d4b700 1 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0 2014-12-18 12:35:45.865719 7fb008d4b700 1 mon.a@0(probing) e1 win_standalone_election 2014-12-18 12:35:45.867394 7fb008d4b700 0 log_channel(cluster) log [INF] : mon.a@0 won leader election with quorum 0
[ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I have a weird setup [osd nodes] 60 x 4TB 7200 RPM SATA Drives 12 x 400GB s3700 SSD drives 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) 512 GB of RAM 2 x CPU E5-2670 v2 @ 2.50GHz 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 1 x SAS 2208 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. RBD transfers seem to be fine for the most part which makes me think that this has little baring on the gateway issue but it may be related. Rebooting the OSD seems to fix this issue. I would like to figure out the root cause of both of these issues and post the results back here if possible (perhaps it can help other people). I am really looking for a place to start looking at as the gateway just outputs that it is posting data and all of the logs (outside of the monitors reporting down osds) seem to show a fully functioning cluster. Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as well. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
What kind of uploads are you performing? How are you testing? Have you looked at the admin sockets on any daemons yet? Examining the OSDs to see if they're behaving differently on the different requests is one angle of attack. The other is look into is if the RGW daemons are hitting throttler limits or something that the RBD clients aren't. -Greg On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan seapasu...@uchicago.edu wrote: Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I have a weird setup [osd nodes] 60 x 4TB 7200 RPM SATA Drives 12 x 400GB s3700 SSD drives 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) 512 GB of RAM 2 x CPU E5-2670 v2 @ 2.50GHz 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 1 x SAS 2208 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. RBD transfers seem to be fine for the most part which makes me think that this has little baring on the gateway issue but it may be related. Rebooting the OSD seems to fix this issue. I would like to figure out the root cause of both of these issues and post the results back here if possible (perhaps it can help other people). I am really looking for a place to start looking at as the gateway just outputs that it is posting data and all of the logs (outside of the monitors reporting down osds) seem to show a fully functioning cluster. Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as well. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Thanks for the reply Gegory, Sorry if this is in the wrong direction or something. Maybe I do not understand To test uploads I either use bash time and either python-swiftclient or boto key.set_contents_from_filename to the radosgw. I was unaware that radosgw had any type of throttle settings in the configuration (I can't seem to find any either). As for rbd mounts I test by creating a 1TB mount and writing a file to it through time+cp or dd. Not the most accurate test but I think should be good enough as a quick functionality test. So for writes, it's more for functionality than performance. I would think a basic functionality test should yield more than 8mb/s though. As for checking admin sockets: I have actually, I set the 3rd gateways debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see anything that stands out. The snippet of the log I pasted has these values set. I did the same for an osd that is marked as slow (1112). All I can see in the log for the osd are ticks and heartbeat responses though, nothing that shows any issues. Finally I did it for the primary monitor node to see if I would see anything there with debug_mon set to 5 (http://pastebin.com/hhnaFac1). I do not really see anything that would stand out as a failure (like a fault or timeout error). What kind of throttler limits do you mean? I didn't/don't see any mention of rgw throttler limits in the ceph.com docs or admin socket just osd/ filesystem throttle like inode/flusher limits, do you mean these? I have not messed with these limits yet on this cluster, do you think it would help? On 12/18/2014 10:24 PM, Gregory Farnum wrote: What kind of uploads are you performing? How are you testing? Have you looked at the admin sockets on any daemons yet? Examining the OSDs to see if they're behaving differently on the different requests is one angle of attack. The other is look into is if the RGW daemons are hitting throttler limits or something that the RBD clients aren't. -Greg On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan seapasu...@uchicago.edu mailto:seapasu...@uchicago.edu wrote: Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I have a weird setup [osd nodes] 60 x 4TB 7200 RPM SATA Drives 12 x 400GB s3700 SSD drives 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) 512 GB of RAM 2 x CPU E5-2670 v2 @ 2.50GHz 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 1 x SAS 2208 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. RBD transfers seem to be fine for the most part which makes me think that this has little baring on the gateway issue but it may be related. Rebooting the OSD seems to fix this issue. I would like to figure out the root cause of both of these issues and post the results back here if possible (perhaps it can help other people). I am really looking for a place to start looking at as the gateway just outputs that it is posting data and all of the
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello, Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er, wrong movie. ^o^ On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote: Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) I wouldn't call 300MB/s writes fine with a cluster of this size. How are you testing this (which tool, settings, from where)? while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I know nuttin about radosgw, but I wouldn't be surprised that the difference you see here is based how that is eventually written to the storage (smaller chunks than what you're using to test RBD performance). I have a weird setup I'm always interested in monster storage nodes, care to share what case this is? [osd nodes] 60 x 4TB 7200 RPM SATA Drives What maker/model? 12 x 400GB s3700 SSD drives Journals, one assumes. 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) I smell a port-expander or 3 on your backplane. And while making sure that your SSDs get undivided 6Gb/s love would probably help, you still have plenty of bandwidth here (4.5Gb/s per drive), so no real issue. 512 GB of RAM Sufficient. 2 x CPU E5-2670 v2 @ 2.50GHz Vastly, and I mean VASTLY insufficient. It would still be 10GHz short of the (optimistic IMHO) recommendation of 1GHz per OSD w/o SSD journals. With SSD journals my experience shows that with certain write patterns even 3.5GHz per OSD isn't sufficient. (there are several threads about this here) 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) Your journals could handle 5.5GB/s, so you're limiting yourself here a bit, but not too horribly. If I had been given this hardware, I would have RAIDed things (different controller) to keep the number of OSDs per node to something the CPUs (any CPU really!) can handle. Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for performance and 8 x 8HDD RAID6 + SSDs +spares for capacity. That still gives you 336 or 168 OSDs, allows for a replication size of 2 and as bonus you'll probably never have to deal with a failed OSD. ^o^ [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 I would have used Intel DC S3700s here as well, mons love their leveldb to be fast but 1 x SAS 2208 combined with this it should be fine. 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail That looks positively scary, blocked requests for hours... http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree scroll, scroll, woah! ^o^ http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. Anything changed? In particular I assume this a new cluster, has much data been added? A ceph -s output would be nice and educational. Can you correlate the time when you start seeing slow, blocked requests with scrubs or deep-scrubs? If so try setting your cluster temporarily to noscrub and nodeep-scrub and see if that helps. In case it does, setting osd_scrub_sleep (start with something high like 1.0 or 0.5 and lower until it hurts again) should help permanently. I have a cluster that could scrub things in minutes until the amount of objects/data and steady load reached a threshold and now its hours. In this context, check the fragmentation of your OSDs. How busy (ceph.log ops/s) is your cluster at these times? RBD transfers seem to be fine for the most part which makes me think that this has little baring on the gateway issue but it may be related. Rebooting the OSD seems to fix this
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
thanks! It would be really great in the right hands. Through some stroke of luck it's in mine. The flapping osd is becoming a real issue at this point as it is the only possible lead I have to why the gateways are transferring so slowly. The weird issue is that I can have 8 or 60 transfers going to the radosgw and they are all at roughly 8mbps. To work around this right now I am starting 60+ clients across 10 boxes to get roughly 1gbps per gateway across gw1 and gw2. I heve been staring at logs for hours trying to get a handle at what the issue may be with no luck. The third gateway was made last minute to test and rule out the hardware. On December 18, 2014 10:57:41 PM Christian Balzer ch...@gol.com wrote: Hello, Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er, wrong movie. ^o^ On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote: Hello Yall! I can't figure out why my gateways are performing so poorly and I am not sure where to start looking. My RBD mounts seem to be performing fine (over 300 MB/s) I wouldn't call 300MB/s writes fine with a cluster of this size. How are you testing this (which tool, settings, from where)? while uploading a 5G file to Swift/S3 takes 2m32s (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing with nuttcp shows that I can transfer from a client with 10G interface to any node on the ceph cluster at the full 10G and ceph can transfer close to 20G between itself. I am not really sure where to start looking as outside of another issue which I will mention below I am clueless. I know nuttin about radosgw, but I wouldn't be surprised that the difference you see here is based how that is eventually written to the storage (smaller chunks than what you're using to test RBD performance). I have a weird setup I'm always interested in monster storage nodes, care to share what case this is? [osd nodes] 60 x 4TB 7200 RPM SATA Drives What maker/model? 12 x 400GB s3700 SSD drives Journals, one assumes. 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across the 3 cards) I smell a port-expander or 3 on your backplane. And while making sure that your SSDs get undivided 6Gb/s love would probably help, you still have plenty of bandwidth here (4.5Gb/s per drive), so no real issue. 512 GB of RAM Sufficient. 2 x CPU E5-2670 v2 @ 2.50GHz Vastly, and I mean VASTLY insufficient. It would still be 10GHz short of the (optimistic IMHO) recommendation of 1GHz per OSD w/o SSD journals. With SSD journals my experience shows that with certain write patterns even 3.5GHz per OSD isn't sufficient. (there are several threads about this here) 2 x 10G interfaces LACP bonded for cluster traffic 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G ports) Your journals could handle 5.5GB/s, so you're limiting yourself here a bit, but not too horribly. If I had been given this hardware, I would have RAIDed things (different controller) to keep the number of OSDs per node to something the CPUs (any CPU really!) can handle. Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for performance and 8 x 8HDD RAID6 + SSDs +spares for capacity. That still gives you 336 or 168 OSDs, allows for a replication size of 2 and as bonus you'll probably never have to deal with a failed OSD. ^o^ [monitor nodes and gateway nodes] 4 x 300G 1500RPM SAS drives in raid 10 I would have used Intel DC S3700s here as well, mons love their leveldb to be fast but 1 x SAS 2208 combined with this it should be fine. 64G of RAM 2 x CPU E5-2630 v2 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports) Here is a pastebin dump of my details, I am running ceph giant 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic across the entire cluster. http://pastebin.com/XQ7USGUz -- ceph health detail That looks positively scary, blocked requests for hours... http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf http://pastebin.com/BC3gzWhT -- ceph osd tree scroll, scroll, woah! ^o^ http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me) We ran into a few issues with density (conntrack limits, pid limit, and number of open files) all of which I adjusted by bumping the ulimits in /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any signs of these limits being hit so I have not included my limits or sysctl conf. If you like this as well let me know and I can include it. One of the issues I am seeing is that OSDs have started to flop/ be marked as slow. The cluster was HEALTH_OK with all of the disks added for over 3 weeks before this behaviour started. Anything changed? In particular I assume this a new cluster, has much data been added? A ceph -s output would be nice and educational. Can you correlate the time when you start seeing slow,
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd node test cluster that we tried before this and it didn't have this flopping issue or rgw issue I am seeing . I can quickly answered the case/make questions, the model will need to wait till I get home :) Case is a 72 disk supermicro chassis, I'll grab the exact model in my next reply. Drives are HGST 4TB drives, ill grab the model once I get home as well. The 300 was completely incorrect and it can push more, it was just meant for a quick comparison but I agree it should be higher. Thank you so much. Please hold up and ill grab the extra info ^~^ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.
Hello, On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote: Wow Christian, Sorry I missed these in line replies. Give me a minute to gather some data. Thanks a million for the in depth responses! No worries. I thought about raiding it but I needed the space unfortunately. I had a 3x60 osd node test cluster that we tried before this and it didn't have this flopping issue or rgw issue I am seeing . I think I remember that... You do realize that the RAID6 configuration option I mentioned would actually give you MORE space (replication of 2 is sufficient with reliable OSDs) than what you have now? Albeit probably at reduced performance, how much would also depend on the controllers used, but at worst the RAID6 OSD performance would be equivalent to that of single disk. So a Cluster (performance wise) with 21 nodes and 8 disks each. I can quickly answered the case/make questions, the model will need to wait till I get home :) Case is a 72 disk supermicro chassis, I'll grab the exact model in my next reply. No need, now that strange monitor configuration makes sense, you (or whoever spec'ed this) went for the Supermicro Ceph solution, right? In my not so humble opinion, this the worst storage chassis ever designed by a long shot and totally unsuitable for Ceph. I told the Supermicro GM for Japan as much. ^o^ Every time a HDD dies, you will have to go and shut down the other OSD that resides on the same tray (and set the cluster to noout). Even worse of course if a SSD should fail. And if somebody should just go and hotswap things w/o that step first, hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively). Christian Drives are HGST 4TB drives, ill grab the model once I get home as well. The 300 was completely incorrect and it can push more, it was just meant for a quick comparison but I agree it should be higher. Thank you so much. Please hold up and ill grab the extra info ^~^ -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Recovering from PG in down+incomplete state
Hi all, I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down state, I have removed that PG from cluster, by removing crush rule for that OSD. Now cluster with 11 OSD's, started rebalancing. After sometime, cluster status was ems@rack6-client-5:~$ sudo ceph -s cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1 health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering; 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6 monmap e1: 3 mons at {rack6-client-4= 10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0}, election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6 osdmap e2648: 11 osds: 11 up, 11 in pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects 11668 GB used, 26048 GB / 37717 GB avail 63 stale+active+clean 1 down+incomplete 521 active+clean 251 incomplete 10 stale+peering ems@rack6-client-5:~$ To fix this, i cant run ceph osd lost osd.id to remove the PG which is in down state. As OSD is already removed from the cluster. ems@rack6-client-4:~$ sudo ceph pg dump all | grep down dumped all in format plain 1.3815480 0 0 0 6492782592 3001 3001down+incomplete 2014-12-18 15:58:29.681708 1118'508438 2648:1073892[6,3,1] 6 [6,3,1] 6 76'437184 2014-12-16 12:38:35.322835 76'437184 2014-12-16 12:38:35.322835 ems@rack6-client-4:~$ ems@rack6-client-4:~$ sudo ceph pg 1.38 query . recovery_state: [ { name: Started\/Primary\/Peering, enter_time: 2014-12-18 15:58:29.681666, past_intervals: [ { first: 1109, last: 1118, maybe_went_rw: 1, ... ... down_osds_we_would_probe: [ 7], peering_blocked_by: []}, ... ... ems@rack6-client-4:~$ sudo ceph osd tree # idweight type name up/down reweight -1 36.85 root default -2 20.1host rack2-storage-1 0 3.35osd.0 up 1 1 3.35osd.1 up 1 2 3.35osd.2 up 1 3 3.35osd.3 up 1 4 3.35osd.4 up 1 5 3.35osd.5 up 1 -3 16.75 host rack2-storage-5 6 3.35osd.6 up 1 8 3.35osd.8 up 1 9 3.35osd.9 up 1 10 3.35osd.10 up 1 11 3.35osd.11 up 1 ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it osd.7 is not down or doesn't exist ems@rack6-client-4:~$ Can somebody suggest any other recovery step to come out of this? -Thanks Regards, Mallikarjun Biradar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have 2 different public networks
Le 19/12/2014 02:18, Craig Lewis a écrit : The daemons bind to *, Yes but *only* for the OSD daemon. Am I wrong? Personally I must provide IP addresses for the monitors in the /etc/ceph/ceph.conf, like this: [global] mon host = 10.0.1.1, 10.0.1.2, 10.0.1.3 Or like this: [mon.1] mon addr = 10.0.1.1 [mon.2] mon addr = 10.0.1.2 [mon.3] mon addr = 10.0.1.3 And every time, the monitors daemons bind to just only one address. And if a ceph client want to contact the cluster, it must contact monitors. Here is my problem: monitors just listen in the 10.0.1.0/24 network but not in 10.0.2.0/24. Do you have monitor daemons that bind to * ? Personally I don't (always just one interface). Is it possible to provide 2 IP addresses for monitors in the /etc/ceph/ceph.conf file? so adding the 3rd interface to the machine will allow you to talk to the daemons on that IP. The 3rd interface exists since the begin (before the creation of the cluster) but monitors bind to only one interface. I'm not really sure how you'd setup the management network though. I'd start by setting the ceph.conf public network on the management nodes to have the public network 10.0.2.0/24, and an /etc/hosts file with the monitor's names on the 10.0.2.0/24 network. Make sure the management nodes can't route to the 10.0.1.0/24 network, and see what happens. For now, I can't have monitors that bind to 10.0.1.[123] *and* 10.0.2.[123]. Do you really plan on having enough traffic creating and deleting RDB images that you need a dedicated network? It seems like setting up link aggregation on 10.0.1.0/24 would be simpler and less error prone. This is not for traffic. I must have a node to manage rbd images and this node is in a different VLAN (this is an Openstack install... I try... ;). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com