Re: [ceph-users] Help with SSDs

2014-12-18 Thread Udo Lembke
Hi Mark,

On 18.12.2014 07:15, Mark Kirkwood wrote:

 While you can't do much about the endurance lifetime being a bit low,
 you could possibly improve performance using a journal *file* that is
 located on the 840's (you'll need to symlink it - disclaimer - have
 not tried this myself, but will experiment if you are interested).
 Slightly different open() options are used in this case and these
 cheaper consumer SSD seem to work better with them.
I had the symlink-file method before, (with different SSDs) but the
performance was much better after changing to partitions.
I try fist some different consumer SSDs with journal as file and end
now with DC S3700 with partitions.

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Any tuning of LVM-Storage inside an VM related to ceph?

2014-12-18 Thread Udo Lembke
Hi all,
I have some fileserver with insufficient read speed.
Enabling read ahead inside the VM improve the read speed, but it's
looks, that this has an drawback during lvm-operations like pvmove.

For test purposes, I move the lvm-storage inside an VM from vdb to vdc1.
It's take days, because it's 3TB data.
After enbling read ahead (echo 4096 
/sys/block/vdb/queue/read_ahead_kb; echo 4096 
/sys/block/vdc/queue/read_ahead_kb) the move-speed drop noticeable!

Are they any tunings to improve speed related to lvm on rbd-storage?
Perhaps, if using partitions, align the partition on 4MB?

Any hints?


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-18 Thread Mark Kirkwood
The effect of this is *highly* dependent to the SSD make/model. My m550 
work vastly better if the journal is a file on a filesystem as opposed 
to a partition.


Obviously the Intel S3700/S3500 are a better choice - but the OP has 
already purchased Sammy 840's, so I'm trying to suggest options to try 
that don't require him to buy new SSDs!


Cheers

Mark

On 18/12/14 21:28, Udo Lembke wrote:

On 18.12.2014 07:15, Mark Kirkwood wrote:


While you can't do much about the endurance lifetime being a bit low,
you could possibly improve performance using a journal *file* that is
located on the 840's (you'll need to symlink it - disclaimer - have
not tried this myself, but will experiment if you are interested).
Slightly different open() options are used in this case and these
cheaper consumer SSD seem to work better with them.



I had the symlink-file method before, (with different SSDs) but the
performance was much better after changing to partitions.
I try fist some different consumer SSDs with journal as file and end
now with DC S3700 with partitions.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is cache tiering production ready?

2014-12-18 Thread Yujian Peng
Gregory Farnum greg@... writes:

 
 
 Cache tiering is a stable, functioning system. Those particular commands
are for testing and development purposes, not something you should run
(although they ought to be safe).-Greg
Thanks for your reply!
I'll put cache tiering into my production cluster!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Thomas Lemarchand
I too find Ceph fuse more stable.

However, you really should do your tests with a much more recent
kernel ! 3.10 is old.
I think there is Ceph improvements in every kernel version since a long
time.

-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information



On jeu., 2014-12-18 at 14:52 +1000, Lindsay Mathieson wrote:
 I'be been experimenting with CephFS for funning KVM images (proxmox).
 
 cephfs fuse version - 0.87
 
 cephfs kernel module - kernel version 3.10
 
 
 Part of my testing involves running a Windows 7 VM up and running
 CrystalDiskMark to check the I/O in the VM. Its surprisingly good with
 both the fuse and the kernel driver, seq reads  writes are actually
 faster than the underlying disk, so I presume the FS is aggressively
 caching.
 
 With the fuse driver I have no problems.
 
 With the kernel driver, the benchmark runs fine, but when I reboot the
 VM the drive is corrupted and unreadable, every time. Rolling back to
 a snapshot fixes the disk. This does not happen unless I run the
 benchmark, which I presume is writing a lot of data.
 
 No problems with the same test for Ceph rbd, or NFS.
 
 -- 
 Lindsay
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Double-mounting of RBD

2014-12-18 Thread Olivier DELHOMME
Hello,

 I have a somewhat interesting scenario. I have an RBD of 17TB formatted using
 XFS. I would like it accessible from two different hosts, one mapped/mounted
 read-only, and one mapped/mounted as read-write. Both are shared using Samba
 4.x. One Samba server gives read-only access to the world for the data. The
 other gives read-write access to a very limited set of users who
 occasionally need to add data.
 
 
 However, when testing this, when changes are made to the read-write Samba
 server the changes don’t seem to be seen by the read-only Samba server. Is
 there some file system caching going on that will eventually be flushed?

I think that this a normal behaviour as your read only filesystem is not
aware that some writes occurred. To achieve your goal I think that you 
should use some clustered filesystem [1] in order that the read-only 
server know that some writes occurred in the filesystem.

[1] https://en.wikipedia.org/wiki/Clustered_file_system


Regards,

Olivier DELHOMME.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] File System stripping data

2014-12-18 Thread John Spray
Kevin,

Yes, that is just too old for vxattrs (the earliest tag with vxattr
support in fuse is v0.57~84^2~6).

In Ceph FS terms, 0.56 is pretty ancient.  Because the filesystem is
under active development, you should use a much more recent version
for clusters with Ceph FS enabled -- at least firefly, and perhaps
giant if you can tolerate a non-LTS release.

John

On Thu, Dec 18, 2014 at 12:08 AM, Kevin Shiah agan...@gmail.com wrote:
 Hi John,

 I am using 0.56.1. Could it be because data striping is not supported in
 this version?

 Kevin

 On Wed Dec 17 2014 at 4:00:15 AM PST Wido den Hollander w...@42on.com
 wrote:

 On 12/17/2014 12:35 PM, John Spray wrote:
  On Wed, Dec 17, 2014 at 10:25 AM, Wido den Hollander w...@42on.com
  wrote:
  I just tried something similar on Giant (0.87) and I saw this in the
  logs:
 
  parse_layout_vxattr name layout.pool value 'cephfs_svo'
   invalid data pool 3
  reply request -22
 
  I resolves the pool to a ID, but then it's unable to set it?
 
  Was the 'cephfs_svo' pool already added as a data pool with ceph mds
  add_data_pool?
 

 Ah, indeed. Working fine right now. Same goes for any other layout
 settings.

  There are paths where if a pool was added very recently, MDSs/clients
  might not know about the pool yet and can generate errors like this.
 
  John
 


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-18 Thread Lindsay Mathieson
On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote:
 My m550 
 work vastly better if the journal is a file on a filesystem as opposed 
 to a partition.


Any particular filesystem? ext4? xfs? or doesn't matter?
-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Double-mounting of RBD

2014-12-18 Thread John Spray
On Wed, Dec 17, 2014 at 10:31 PM, McNamara, Bradley
bradley.mcnam...@seattle.gov wrote:
 However, when testing this, when changes are made to the read-write Samba
 server the changes don’t seem to be seen by the read-only Samba server.  Is
 there some file system caching going on that will eventually be flushed?

As others have said, the read-only mount doesn't know how to poll the
block device to see updates from the read-write mount, so you won't
see updates to the data, and in general this is not a safe thing to
do.

One alternative would be taking a clone of a snapshot of the image,
and mounting that read-only -- obviously that data will only be as
up-to-date as whenever you did your last snapshot.  If the read-only
mounts are serving rarely updated files, the administrative overhead
of doing the snapshot/remount on data updates might be acceptable.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Content-length error uploading big files to radosgw

2014-12-18 Thread Daniele Venzano
Hello,

I have been trying to upload multi-gigabyte files to CEPH via the object
gateway, using both the swift and s3 APIs.

With file up to about 2GB everything works as expected.

With files bigger than that I get back a 400 Bad Request error, both
with S3 (boto) and Swift clients.

Enabling debug I can see this:
2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200
...
2014-12-18 12:38:28.947539 7f5419ffb700  1 == starting new request
req=0x7f541000fee0 =
2014-12-18 12:38:28.947556 7f5419ffb700  2 req 2:0.17::PUT
/test/test::initializing
2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting
2014-12-18 12:38:28.947641 7f5419ffb700  2 req 2:0.000102::PUT
/test/test::http status=400
2014-12-18 12:38:28.947644 7f5419ffb700  1 == req done
req=0x7f541000fee0 http_status=400 ==


The content length is the right one (I created a test file with dd).
With a file 207200 bytes long, I get no error.

The gateway is running on debian, with the packages available on the
ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100
continue).

There is a limit on the object size? Or there is an error in my
configuration somewhere?

Thank you,
Daniele

-- 
Daniele Venzano
http://www.brownhat.org

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread Dyweni - Ceph-Users

Hi All,


Just setup the monitor for a new cluster based on Giant (0.87) and I 
find that only the 'rbd' pool was created automatically.  I don't see 
the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files.  
I haven't setup any OSDs or MDSs yet.  I'm following the manual 
deployment guide.


Would you mind looking over the setup details/logs below and letting me 
know my mistake please?




Here's my /etc/ceph/ceph.conf file:
---
[global]
fsid = xx

public network = xx.xx.xx.xx/xx
cluster network = xx.xx.xx.xx/xx

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

osd pool default size = 2
osd pool default min size = 1

osd pool default pg num = 100
osd pool default pgp num = 100

[mon]
mon initial members = a

[mon.a]
host = xx
mon addr = xx.xx.xx.xx
---


Here's the commands used to setup the monitor:
---
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. 
--cap mon 'allow *'
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring 
--gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 
'allow *' --cap mds 'allow'
ceph-authtool /tmp/ceph.mon.keyring --import-keyring 
/etc/ceph/ceph.client.admin.keyring

monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
mkdir /var/lib/ceph/mon/ceph-a
ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring 
/tmp/ceph.mon.keyring

/etc/init.d/ceph-mon.a start
---


Here's the ceph-mon.a logfile:
---
2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in 
monmap, will attempt to join an existing cluster
2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr 
xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0
2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at 
xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 preinit 
fsid xx
2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0  
initial_members a, filtering seed monmap
2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my rank 
is now 0 (was -1)
2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0 
win_standalone_election
2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log 
[INF] : mon.a@0 won leader election with quorum 0
2014-12-18 12:35:45.863846 7fb008d4b700  1 
mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0 
on_upgrade discarding in-core PGMap
2014-12-18 12:35:45.865662 7fb008d4b700  1 
mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1 
win_standalone_election
2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log 
[INF] : mon.a@0 won leader election with quorum 0
2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log 
[INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
2014-12-18 12:35:46.040555 7fb008d4b700  1 
mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log 
[INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1 
print_map

epoch   1
flags   0
created 0.00
modified2014-12-18 12:35:46.038418
tableserver 0
root0
session_timeout 0
session_autoclose   0
max_file_size   0
last_failure0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={}
max_mds 0
in
up  {}
failed
stopped
data_pools
metadata_pool   0
inline_data disabled

2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log 
[INF] : mdsmap e1: 0/0/0 up
2014-12-18 12:35:46.152873 7fb008d4b700  1 mon.a@0(leader).osd e1 e1: 0 
osds: 0 up, 0 in
2014-12-18 12:35:46.154551 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
map has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154580 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
map has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154588 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
map has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154592 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
map has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.157078 7fb008d4b700  0 log_channel(cluster) log 
[INF] : osdmap e1: 0 osds: 0 up, 0 in
2014-12-18 12:35:46.220701 7fb008d4b700  1 
mon.a@0(leader).paxosservice(auth 1..1) refresh upgraded, format 0 - 1
2014-12-18 12:35:46.334457 7fb008d4b700  0 log_channel(cluster) log 
[INF] : pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 
kB avail

[ceph-users] Need help from Ceph experts

2014-12-18 Thread Debashish Das
Hi Guys,

I am very new to Ceph  have couple of questions -

1. Can we install Ceph in a single node (both Monitor  OSD).
2. What should be the minimum hardware requirement of the server (CPU,
Memory, NIC etc)
3. Any webpage where I can find the installation guide to install Ceph in
one node.

I will be eagerly waiting for your response.

Please note that performance  redundancy is not an issue for me but I want
it to integrate with OpenStack in the end.

Kind Regards
Debashish Das
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread John Spray
No mistake -- the Ceph FS pools are no longer created by default, as
not everybody needs them.  Ceph FS users now create these pools
explicitly:
http://ceph.com/docs/master/cephfs/createfs/

John

On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users
6exbab4fy...@dyweni.com wrote:
 Hi All,


 Just setup the monitor for a new cluster based on Giant (0.87) and I find
 that only the 'rbd' pool was created automatically.  I don't see the 'data'
 or 'metadata' pools in 'ceph osd lspools' or the log files.  I haven't setup
 any OSDs or MDSs yet.  I'm following the manual deployment guide.

 Would you mind looking over the setup details/logs below and letting me know
 my mistake please?



 Here's my /etc/ceph/ceph.conf file:
 ---
 [global]
 fsid = xx

 public network = xx.xx.xx.xx/xx
 cluster network = xx.xx.xx.xx/xx

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx

 osd pool default size = 2
 osd pool default min size = 1

 osd pool default pg num = 100
 osd pool default pgp num = 100

 [mon]
 mon initial members = a

 [mon.a]
 host = xx
 mon addr = xx.xx.xx.xx
 ---


 Here's the commands used to setup the monitor:
 ---
 ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap
 mon 'allow *'
 ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key
 -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap
 mds 'allow'
 ceph-authtool /tmp/ceph.mon.keyring --import-keyring
 /etc/ceph/ceph.client.admin.keyring
 monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
 mkdir /var/lib/ceph/mon/ceph-a
 ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
 /etc/init.d/ceph-mon.a start
 ---


 Here's the ceph-mon.a logfile:
 ---
 2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87
 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
 2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in monmap,
 will attempt to join an existing cluster
 2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr xx.xx.xx.xx:0/0
 - xx.xx.xx.xx:6789/0
 2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at
 xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
 2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 preinit fsid
 xx
 2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0
 initial_members a, filtering seed monmap
 2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my rank is
 now 0 (was -1)
 2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0
 win_standalone_election
 2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log [INF] :
 mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:45.863846 7fb008d4b700  1
 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0 on_upgrade
 discarding in-core PGMap
 2014-12-18 12:35:45.865662 7fb008d4b700  1 mon.a@0(leader).paxosservice(auth
 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1
 win_standalone_election
 2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log [INF] :
 mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log [INF] :
 monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
 2014-12-18 12:35:46.040555 7fb008d4b700  1 mon.a@0(leader).paxosservice(auth
 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log [INF] :
 pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
 2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1 print_map
 epoch   1
 flags   0
 created 0.00
 modified2014-12-18 12:35:46.038418
 tableserver 0
 root0
 session_timeout 0
 session_autoclose   0
 max_file_size   0
 last_failure0
 last_failure_osd_epoch  0
 compat  compat={},rocompat={},incompat={}
 max_mds 0
 in
 up  {}
 failed
 stopped
 data_pools
 metadata_pool   0
 inline_data disabled

 2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log [INF] :
 mdsmap e1: 0/0/0 up
 2014-12-18 12:35:46.152873 7fb008d4b700  1 mon.a@0(leader).osd e1 e1: 0
 osds: 0 up, 0 in
 2014-12-18 12:35:46.154551 7fb008d4b700  0 mon.a@0(leader).osd e1 crush map
 has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154580 7fb008d4b700  0 mon.a@0(leader).osd e1 crush map
 has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154588 7fb008d4b700  0 mon.a@0(leader).osd e1 crush map
 has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154592 7fb008d4b700  0 mon.a@0(leader).osd e1 crush map
 has features 1107558400, adjusting msgr requires
 2014-12-18 

Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread Thomas Lemarchand
I remember reading somewhere (maybe in changelogs) that default pools
were not created automatically anymore.

You can create pools you need yourself.

-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information



On jeu., 2014-12-18 at 06:52 -0600, Dyweni - Ceph-Users wrote:
 Hi All,
 
 
 Just setup the monitor for a new cluster based on Giant (0.87) and I 
 find that only the 'rbd' pool was created automatically.  I don't see 
 the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files.  
 I haven't setup any OSDs or MDSs yet.  I'm following the manual 
 deployment guide.
 
 Would you mind looking over the setup details/logs below and letting me 
 know my mistake please?
 
 
 
 Here's my /etc/ceph/ceph.conf file:
 ---
 [global]
  fsid = xx
 
  public network = xx.xx.xx.xx/xx
  cluster network = xx.xx.xx.xx/xx
 
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
 
  osd pool default size = 2
  osd pool default min size = 1
 
  osd pool default pg num = 100
  osd pool default pgp num = 100
 
 [mon]
  mon initial members = a
 
 [mon.a]
  host = xx
  mon addr = xx.xx.xx.xx
 ---
 
 
 Here's the commands used to setup the monitor:
 ---
 ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. 
 --cap mon 'allow *'
 ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring 
 --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 
 'allow *' --cap mds 'allow'
 ceph-authtool /tmp/ceph.mon.keyring --import-keyring 
 /etc/ceph/ceph.client.admin.keyring
 monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
 mkdir /var/lib/ceph/mon/ceph-a
 ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring 
 /tmp/ceph.mon.keyring
 /etc/init.d/ceph-mon.a start
 ---
 
 
 Here's the ceph-mon.a logfile:
 ---
 2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87 
 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
 2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in 
 monmap, will attempt to join an existing cluster
 2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr 
 xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0
 2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at 
 xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
 2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 preinit 
 fsid xx
 2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0  
 initial_members a, filtering seed monmap
 2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my rank 
 is now 0 (was -1)
 2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0 
 win_standalone_election
 2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log 
 [INF] : mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:45.863846 7fb008d4b700  1 
 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0 
 on_upgrade discarding in-core PGMap
 2014-12-18 12:35:45.865662 7fb008d4b700  1 
 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1 
 win_standalone_election
 2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log 
 [INF] : mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log 
 [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
 2014-12-18 12:35:46.040555 7fb008d4b700  1 
 mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log 
 [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
 2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1 
 print_map
 epoch   1
 flags   0
 created 0.00
 modified2014-12-18 12:35:46.038418
 tableserver 0
 root0
 session_timeout 0
 session_autoclose   0
 max_file_size   0
 last_failure0
 last_failure_osd_epoch  0
 compat  compat={},rocompat={},incompat={}
 max_mds 0
 in
 up  {}
 failed
 stopped
 data_pools
 metadata_pool   0
 inline_data disabled
 
 2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log 
 [INF] : mdsmap e1: 0/0/0 up
 2014-12-18 12:35:46.152873 7fb008d4b700  1 mon.a@0(leader).osd e1 e1: 0 
 osds: 0 up, 0 in
 2014-12-18 12:35:46.154551 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
 map has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154580 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
 map has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154588 7fb008d4b700  0 mon.a@0(leader).osd e1 crush 
 map has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154592 7fb008d4b700  0 mon.a@0(leader).osd 

Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Patrick McGarry
Hey Debashish,


On Thu, Dec 18, 2014 at 6:21 AM, Debashish Das deba@gmail.com wrote:
 Hi Guys,

 I am very new to Ceph  have couple of questions -

 1. Can we install Ceph in a single node (both Monitor  OSD).

You can, but I would only recommend it for testing/experimentation. No
production (or even pre-production) cluster with any meaningful amount
of use should be a single node.


 2. What should be the minimum hardware requirement of the server (CPU,
 Memory, NIC etc)

There is no real minimum to run Ceph, it's all about what your
workload will look like and what kind of performance you need. We have
seen Ceph run on Raspberry Pis. I would suggest taking a look at some
of the hardware guides and reference architectures available though. A
few examples would be:


http://ceph.com/docs/master/start/hardware-recommendations/

https://engage.redhat.com/inktank-hardware-selection-guide-s-201409080912

https://engage.redhat.com/inktank-ceph-storage-for-dell-s-201409081132

http://karan-mj.blogspot.com/2014/01/zero-to-hero-guide-for-ceph-cluster.html

http://www.supermicro.com/solutions/datasheet_Ceph.pdf



 3. Any webpage where I can find the installation guide to install Ceph in
 one node.

Since a single node isn't really demonstrating a realistic Ceph
install we decided that a multi-node install was more effective, which
is what you'll see in our docs. However, if you'd like a contained
Ceph install to experiment with you can try the latest qemu advent
calendar image ( http://www.qemu-advent-calendar.org/#day-18 ) or try
the installation instructions from an older version of the doc:

http://ceph.com/docs/v0.67.9/start/quick-start/

While that doc is quite outdated I'm sure you can see how to adapt the
more recent install guide to that procedure if you're really set on
doing a single node install. I would probably just use the qemu image
for experimentation and then move to the multi node install.

Hope that helps!


Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Happy Holidays with Ceph QEMU Advent

2014-12-18 Thread Patrick McGarry
Howdy Ceph rangers,

Just wanted to kick off a bit of holiday cheer from our Ceph family to
yours. As a part of the QEMU advent calendar [0], we have finally
built a quick-and-dirty Ceph image for the purposes of trying and
experimenting with Ceph.

Feel free to download it [1] and try it out, or send it on to your
friends who have yet to experience the fun of Ceph! Hope this works
for those who had requested a simple Ceph image. Happy holidays and
happy tinkering!


[0] http://www.qemu-advent-calendar.org/#day-18

[1] http://www.qemu-advent-calendar.org/download/ceph.tar.xz



Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] When is the rctime updated in CephFS?

2014-12-18 Thread Wido den Hollander
Hi,

I've been playing around a bit with the recursive statistics for CephFS
today and I'm seeing some behavior with the rstats what I don't understand.

I /A/B/C in my CephFS.

I changed a file in 'C' and the ceph.dir.rctime xattr changed
immediately. I've been waiting for 60 minutes now, but /A and /A/B still
have their old rctime.

A: 1418905422 (18-12-2014 13:23:42)
B: 1418905422 (18-12-2014 13:23:42)
C: 1418909134 (18-12-2014 14:25:34)

It's 15:21:34 right now, so after 1 hour the rctime of A and B still
hasn't updated.

How long does this take? I know the MDS is lazy in updating the rstats,
but one hour is quite long, isn't it?

Ceph version 0.89
Linux 3.18 kernel client
Ceph fuse client 0.89

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When is the rctime updated in CephFS?

2014-12-18 Thread Sage Weil
On Thu, 18 Dec 2014, Wido den Hollander wrote:
 Hi,
 
 I've been playing around a bit with the recursive statistics for CephFS
 today and I'm seeing some behavior with the rstats what I don't understand.
 
 I /A/B/C in my CephFS.
 
 I changed a file in 'C' and the ceph.dir.rctime xattr changed
 immediately. I've been waiting for 60 minutes now, but /A and /A/B still
 have their old rctime.
 
 A: 1418905422 (18-12-2014 13:23:42)
 B: 1418905422 (18-12-2014 13:23:42)
 C: 1418909134 (18-12-2014 14:25:34)
 
 It's 15:21:34 right now, so after 1 hour the rctime of A and B still
 hasn't updated.
 
 How long does this take? I know the MDS is lazy in updating the rstats,
 but one hour is quite long, isn't it?

This is a bit of a loose end at the moment.  The client doesn't have any 
refresh value for these stats.  Right now an 'ls' in the parent dir will 
get you a fresh value, but repeatedly calling 'stat' will keep giving you 
the cached value.

I'm not sure what the right fix is.  The normal inode fields are all 
perfectly accurate, and the protocol is built around making sure that's 
the case.. not giving reasonably timely values to the new stuff. :/

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When is the rctime updated in CephFS?

2014-12-18 Thread Wido den Hollander
On 12/18/2014 03:37 PM, Sage Weil wrote:
 On Thu, 18 Dec 2014, Wido den Hollander wrote:
 Hi,

 I've been playing around a bit with the recursive statistics for CephFS
 today and I'm seeing some behavior with the rstats what I don't understand.

 I /A/B/C in my CephFS.

 I changed a file in 'C' and the ceph.dir.rctime xattr changed
 immediately. I've been waiting for 60 minutes now, but /A and /A/B still
 have their old rctime.

 A: 1418905422 (18-12-2014 13:23:42)
 B: 1418905422 (18-12-2014 13:23:42)
 C: 1418909134 (18-12-2014 14:25:34)

 It's 15:21:34 right now, so after 1 hour the rctime of A and B still
 hasn't updated.

 How long does this take? I know the MDS is lazy in updating the rstats,
 but one hour is quite long, isn't it?
 
 This is a bit of a loose end at the moment.  The client doesn't have any 
 refresh value for these stats.  Right now an 'ls' in the parent dir will 
 get you a fresh value, but repeatedly calling 'stat' will keep giving you 
 the cached value.
 

The ls didn't really trigger it for me. I'm using getfattr btw:

$ getfattr -n ceph.dir.rctime /mnt/cephfs/A

I unmounted and mounted and it worked right away.

So this is probably not a real issue on a active filesystem where lots
of I/O on that client is happening, right?

I'm building a PoC backup script which uses the rctimes to backup CephFS
in a reasonable way, not having rsync scan the whole tree.

 I'm not sure what the right fix is.  The normal inode fields are all 
 perfectly accurate, and the protocol is built around making sure that's 
 the case.. not giving reasonably timely values to the new stuff. :/
 
 sage
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When is the rctime updated in CephFS?

2014-12-18 Thread Thomas Lemarchand
Hi Wido,

I'm really interested in your script.
Will you release it ? I'm sure I'm not the only one interested ;)

If you need some help (testing or something else), don't hesitate to ask
me.

-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information



On jeu., 2014-12-18 at 15:47 +0100, Wido den Hollander wrote:
 On 12/18/2014 03:37 PM, Sage Weil wrote:
  On Thu, 18 Dec 2014, Wido den Hollander wrote:
  Hi,
 
  I've been playing around a bit with the recursive statistics for CephFS
  today and I'm seeing some behavior with the rstats what I don't understand.
 
  I /A/B/C in my CephFS.
 
  I changed a file in 'C' and the ceph.dir.rctime xattr changed
  immediately. I've been waiting for 60 minutes now, but /A and /A/B still
  have their old rctime.
 
  A: 1418905422 (18-12-2014 13:23:42)
  B: 1418905422 (18-12-2014 13:23:42)
  C: 1418909134 (18-12-2014 14:25:34)
 
  It's 15:21:34 right now, so after 1 hour the rctime of A and B still
  hasn't updated.
 
  How long does this take? I know the MDS is lazy in updating the rstats,
  but one hour is quite long, isn't it?
  
  This is a bit of a loose end at the moment.  The client doesn't have any 
  refresh value for these stats.  Right now an 'ls' in the parent dir will 
  get you a fresh value, but repeatedly calling 'stat' will keep giving you 
  the cached value.
  
 
 The ls didn't really trigger it for me. I'm using getfattr btw:
 
 $ getfattr -n ceph.dir.rctime /mnt/cephfs/A
 
 I unmounted and mounted and it worked right away.
 
 So this is probably not a real issue on a active filesystem where lots
 of I/O on that client is happening, right?
 
 I'm building a PoC backup script which uses the rctimes to backup CephFS
 in a reasonable way, not having rsync scan the whole tree.
 
  I'm not sure what the right fix is.  The normal inode fields are all 
  perfectly accurate, and the protocol is built around making sure that's 
  the case.. not giving reasonably timely values to the new stuff. :/
  
  sage
  
 
 
 -- 
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-18 Thread Lindsay Mathieson
On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote:
 The effect of this is *highly* dependent to the SSD make/model. My m550 
 work vastly better if the journal is a file on a filesystem as opposed 
 to a partition.
 
 Obviously the Intel S3700/S3500 are a better choice - but the OP has 
 already purchased Sammy 840's, so I'm trying to suggest options to try 
 that don't require him to buy new SSDs!


I have 120GB Samsung 840 EVO's with 10GB journal partitions and just gave this 
a go.

No real change unfortunately :( using rados bench.

However it does make experimenting with different journal sizes easier.
-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When is the rctime updated in CephFS?

2014-12-18 Thread Wido den Hollander
On 12/18/2014 03:52 PM, Thomas Lemarchand wrote:
 Hi Wido,
 
 I'm really interested in your script.
 Will you release it ? I'm sure I'm not the only one interested ;)
 

Well, it's not a general script to backup CephFS with. It's a fairly
simple Bash script I'm writing for a specific situation where the
directory layout is known:

/year/month/project

The script checks which year, month or project have changed since it
last ran. If a project changed, it fires off rsync to backup that
project to a NFS mount. This saves us scanning 2.000 projects with rsync
where we know that about 25 change every day.

The coolest thing would be if rsync could use these xattrs and become
more clever, but some code which uses libcephfs would also be nice.

So sorry, it's not something you can use on any CephFS deployment.

 If you need some help (testing or something else), don't hesitate to ask
 me.
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread Dyweni - Ceph-Users

Thanks!!

Looks like the the manual installation instructions should be updated, 
to eliminate future confusion.


Dyweni



On 2014-12-18 07:11, John Spray wrote:

No mistake -- the Ceph FS pools are no longer created by default, as
not everybody needs them.  Ceph FS users now create these pools
explicitly:
http://ceph.com/docs/master/cephfs/createfs/

John

On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users
6exbab4fy...@dyweni.com wrote:

Hi All,


Just setup the monitor for a new cluster based on Giant (0.87) and I 
find
that only the 'rbd' pool was created automatically.  I don't see the 
'data'
or 'metadata' pools in 'ceph osd lspools' or the log files.  I haven't 
setup

any OSDs or MDSs yet.  I'm following the manual deployment guide.

Would you mind looking over the setup details/logs below and letting 
me know

my mistake please?



Here's my /etc/ceph/ceph.conf file:
---
[global]
fsid = xx

public network = xx.xx.xx.xx/xx
cluster network = xx.xx.xx.xx/xx

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

osd pool default size = 2
osd pool default min size = 1

osd pool default pg num = 100
osd pool default pgp num = 100

[mon]
mon initial members = a

[mon.a]
host = xx
mon addr = xx.xx.xx.xx
---


Here's the commands used to setup the monitor:
---
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. 
--cap

mon 'allow *'
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring 
--gen-key
-n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' 
--cap

mds 'allow'
ceph-authtool /tmp/ceph.mon.keyring --import-keyring
/etc/ceph/ceph.client.admin.keyring
monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
mkdir /var/lib/ceph/mon/ceph-a
ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring 
/tmp/ceph.mon.keyring

/etc/init.d/ceph-mon.a start
---


Here's the ceph-mon.a logfile:
---
2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in 
monmap,

will attempt to join an existing cluster
2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr 
xx.xx.xx.xx:0/0

- xx.xx.xx.xx:6789/0
2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at
xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 
preinit fsid

xx
2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0
initial_members a, filtering seed monmap
2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my 
rank is

now 0 (was -1)
2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0
win_standalone_election
2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log 
[INF] :

mon.a@0 won leader election with quorum 0
2014-12-18 12:35:45.863846 7fb008d4b700  1
mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 
0
2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0 
on_upgrade

discarding in-core PGMap
2014-12-18 12:35:45.865662 7fb008d4b700  1 
mon.a@0(leader).paxosservice(auth

0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1
win_standalone_election
2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log 
[INF] :

mon.a@0 won leader election with quorum 0
2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log 
[INF] :

monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
2014-12-18 12:35:46.040555 7fb008d4b700  1 
mon.a@0(leader).paxosservice(auth

0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log 
[INF] :

pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1 
print_map

epoch   1
flags   0
created 0.00
modified2014-12-18 12:35:46.038418
tableserver 0
root0
session_timeout 0
session_autoclose   0
max_file_size   0
last_failure0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={}
max_mds 0
in
up  {}
failed
stopped
data_pools
metadata_pool   0
inline_data disabled

2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log 
[INF] :

mdsmap e1: 0/0/0 up
2014-12-18 12:35:46.152873 7fb008d4b700  1 mon.a@0(leader).osd e1 e1: 
0

osds: 0 up, 0 in
2014-12-18 12:35:46.154551 7fb008d4b700  0 mon.a@0(leader).osd e1 
crush map

has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154580 7fb008d4b700  0 mon.a@0(leader).osd e1 
crush map

has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154588 7fb008d4b700  0 mon.a@0(leader).osd e1 
crush map

has features 1107558400, adjusting msgr requires
2014-12-18 12:35:46.154592 7fb008d4b700  0 

Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread John Spray
Can you point out the specific page that's out of date so that we can update it?

Thanks,
John

On Thu, Dec 18, 2014 at 5:52 PM, Dyweni - Ceph-Users
6exbab4fy...@dyweni.com wrote:
 Thanks!!

 Looks like the the manual installation instructions should be updated, to
 eliminate future confusion.

 Dyweni




 On 2014-12-18 07:11, John Spray wrote:

 No mistake -- the Ceph FS pools are no longer created by default, as
 not everybody needs them.  Ceph FS users now create these pools
 explicitly:
 http://ceph.com/docs/master/cephfs/createfs/

 John

 On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users
 6exbab4fy...@dyweni.com wrote:

 Hi All,


 Just setup the monitor for a new cluster based on Giant (0.87) and I find
 that only the 'rbd' pool was created automatically.  I don't see the
 'data'
 or 'metadata' pools in 'ceph osd lspools' or the log files.  I haven't
 setup
 any OSDs or MDSs yet.  I'm following the manual deployment guide.

 Would you mind looking over the setup details/logs below and letting me
 know
 my mistake please?



 Here's my /etc/ceph/ceph.conf file:
 ---
 [global]
 fsid = xx

 public network = xx.xx.xx.xx/xx
 cluster network = xx.xx.xx.xx/xx

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx

 osd pool default size = 2
 osd pool default min size = 1

 osd pool default pg num = 100
 osd pool default pgp num = 100

 [mon]
 mon initial members = a

 [mon.a]
 host = xx
 mon addr = xx.xx.xx.xx
 ---


 Here's the commands used to setup the monitor:
 ---
 ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon.
 --cap
 mon 'allow *'
 ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
 --gen-key
 -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap
 mds 'allow'
 ceph-authtool /tmp/ceph.mon.keyring --import-keyring
 /etc/ceph/ceph.client.admin.keyring
 monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
 mkdir /var/lib/ceph/mon/ceph-a
 ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
 /etc/init.d/ceph-mon.a start
 ---


 Here's the ceph-mon.a logfile:
 ---
 2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87
 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
 2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in
 monmap,
 will attempt to join an existing cluster
 2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr
 xx.xx.xx.xx:0/0
 - xx.xx.xx.xx:6789/0
 2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at
 xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
 2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 preinit
 fsid
 xx
 2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0
 initial_members a, filtering seed monmap
 2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my rank
 is
 now 0 (was -1)
 2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0
 win_standalone_election
 2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log [INF]
 :
 mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:45.863846 7fb008d4b700  1
 mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0
 on_upgrade
 discarding in-core PGMap
 2014-12-18 12:35:45.865662 7fb008d4b700  1
 mon.a@0(leader).paxosservice(auth
 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1
 win_standalone_election
 2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log [INF]
 :
 mon.a@0 won leader election with quorum 0
 2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log [INF]
 :
 monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
 2014-12-18 12:35:46.040555 7fb008d4b700  1
 mon.a@0(leader).paxosservice(auth
 0..0) refresh upgraded, format 1 - 0
 2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log [INF]
 :
 pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
 2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1
 print_map
 epoch   1
 flags   0
 created 0.00
 modified2014-12-18 12:35:46.038418
 tableserver 0
 root0
 session_timeout 0
 session_autoclose   0
 max_file_size   0
 last_failure0
 last_failure_osd_epoch  0
 compat  compat={},rocompat={},incompat={}
 max_mds 0
 in
 up  {}
 failed
 stopped
 data_pools
 metadata_pool   0
 inline_data disabled

 2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log [INF]
 :
 mdsmap e1: 0/0/0 up
 2014-12-18 12:35:46.152873 7fb008d4b700  1 mon.a@0(leader).osd e1 e1: 0
 osds: 0 up, 0 in
 2014-12-18 12:35:46.154551 7fb008d4b700  0 mon.a@0(leader).osd e1 crush
 map
 has features 1107558400, adjusting msgr requires
 2014-12-18 12:35:46.154580 

Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread Thomas Lemarchand
No !
It would have been a really bad idea. I upgraded without losing my
default pools, hopefully ;)

-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information



On jeu., 2014-12-18 at 10:10 -0800, JIten Shah wrote:
 So what happens if we upgrade from Firefly to Giant? Do we loose the pools?
 
 —Jiten
 On Dec 18, 2014, at 5:12 AM, Thomas Lemarchand 
 thomas.lemarch...@cloud-solutions.fr wrote:
 
  I remember reading somewhere (maybe in changelogs) that default pools
  were not created automatically anymore.
  
  You can create pools you need yourself.
  
  -- 
  Thomas Lemarchand
  Cloud Solutions SAS - Responsable des systèmes d'information
  
  
  
  On jeu., 2014-12-18 at 06:52 -0600, Dyweni - Ceph-Users wrote:
  Hi All,
  
  
  Just setup the monitor for a new cluster based on Giant (0.87) and I 
  find that only the 'rbd' pool was created automatically.  I don't see 
  the 'data' or 'metadata' pools in 'ceph osd lspools' or the log files.  
  I haven't setup any OSDs or MDSs yet.  I'm following the manual 
  deployment guide.
  
  Would you mind looking over the setup details/logs below and letting me 
  know my mistake please?
  
  
  
  Here's my /etc/ceph/ceph.conf file:
  ---
  [global]
  fsid = xx
  
  public network = xx.xx.xx.xx/xx
  cluster network = xx.xx.xx.xx/xx
  
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  
  osd pool default size = 2
  osd pool default min size = 1
  
  osd pool default pg num = 100
  osd pool default pgp num = 100
  
  [mon]
  mon initial members = a
  
  [mon.a]
  host = xx
  mon addr = xx.xx.xx.xx
  ---
  
  
  Here's the commands used to setup the monitor:
  ---
  ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. 
  --cap mon 'allow *'
  ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring 
  --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 
  'allow *' --cap mds 'allow'
  ceph-authtool /tmp/ceph.mon.keyring --import-keyring 
  /etc/ceph/ceph.client.admin.keyring
  monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
  mkdir /var/lib/ceph/mon/ceph-a
  ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring 
  /tmp/ceph.mon.keyring
  /etc/init.d/ceph-mon.a start
  ---
  
  
  Here's the ceph-mon.a logfile:
  ---
  2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87 
  (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 3225
  2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in 
  monmap, will attempt to join an existing cluster
  2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr 
  xx.xx.xx.xx:0/0 - xx.xx.xx.xx:6789/0
  2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at 
  xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
  2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 preinit 
  fsid xx
  2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0  
  initial_members a, filtering seed monmap
  2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my rank 
  is now 0 (was -1)
  2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0 
  win_standalone_election
  2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log 
  [INF] : mon.a@0 won leader election with quorum 0
  2014-12-18 12:35:45.863846 7fb008d4b700  1 
  mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 - 0
  2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0 
  on_upgrade discarding in-core PGMap
  2014-12-18 12:35:45.865662 7fb008d4b700  1 
  mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
  2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1 
  win_standalone_election
  2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log 
  [INF] : mon.a@0 won leader election with quorum 0
  2014-12-18 12:35:46.003223 7fb008d4b700  0 log_channel(cluster) log 
  [INF] : monmap e1: 1 mons at {a=xx.xx.xx.xx:6789/0}
  2014-12-18 12:35:46.040555 7fb008d4b700  1 
  mon.a@0(leader).paxosservice(auth 0..0) refresh upgraded, format 1 - 0
  2014-12-18 12:35:46.087081 7fb008d4b700  0 log_channel(cluster) log 
  [INF] : pgmap v1: 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
  2014-12-18 12:35:46.141415 7fb008d4b700  0 mon.a@0(leader).mds e1 
  print_map
  epoch   1
  flags   0
  created 0.00
  modified2014-12-18 12:35:46.038418
  tableserver 0
  root0
  session_timeout 0
  session_autoclose   0
  max_file_size   0
  last_failure0
  last_failure_osd_epoch  0
  compat  compat={},rocompat={},incompat={}
  max_mds 0
  in
  up  {}
  failed
  stopped
  data_pools
  metadata_pool   0
  inline_data disabled
  
  2014-12-18 12:35:46.151117 7fb008d4b700  0 log_channel(cluster) log 
  [INF] : mdsmap 

Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread John Spray
On Thu, Dec 18, 2014 at 6:10 PM, JIten Shah jshah2...@me.com wrote:
 So what happens if we upgrade from Firefly to Giant? Do we loose the pools?

Sure, you didn't have any data you wanted to keep, right? :-D

Seriously though, no, we don't delete anything during an upgrade.
It's just newly installed clusters that would never have those pools
created.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Block device and Trim/Discard

2014-12-18 Thread Travis Rhoden
One question re: discard support for kRBD -- does it matter which format
the RBD is?  Format 1 and Format 2 are okay, or just for Format 2?

 - Travis

On Mon, Dec 15, 2014 at 8:58 AM, Max Power 
mailli...@ferienwohnung-altenbeken.de wrote:

  Ilya Dryomov ilya.dryo...@inktank.com hat am 12. Dezember 2014 um
 18:00
  geschrieben:
  Just a note, discard support went into 3.18, which was released a few
  days ago.

 I recently compiled 3.18 on Debian 7 and what do I have to say... It works
 perfectly well. The used memory goes up and down again. So I think this
 will be
 my choice. Thank you!
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Block device and Trim/Discard

2014-12-18 Thread Josh Durgin

On 12/18/2014 10:49 AM, Travis Rhoden wrote:

One question re: discard support for kRBD -- does it matter which format
the RBD is?  Format 1 and Format 2 are okay, or just for Format 2?


It shouldn't matter which format you use.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Content-length error uploading big files to radosgw

2014-12-18 Thread Gregory Farnum
On Thu, Dec 18, 2014 at 4:04 AM, Daniele Venzano li...@brownhat.org wrote:
 Hello,

 I have been trying to upload multi-gigabyte files to CEPH via the object
 gateway, using both the swift and s3 APIs.

 With file up to about 2GB everything works as expected.

 With files bigger than that I get back a 400 Bad Request error, both
 with S3 (boto) and Swift clients.

 Enabling debug I can see this:
 2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200
 ...
 2014-12-18 12:38:28.947539 7f5419ffb700  1 == starting new request
 req=0x7f541000fee0 =
 2014-12-18 12:38:28.947556 7f5419ffb700  2 req 2:0.17::PUT
 /test/test::initializing
 2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting
 2014-12-18 12:38:28.947641 7f5419ffb700  2 req 2:0.000102::PUT
 /test/test::http status=400
 2014-12-18 12:38:28.947644 7f5419ffb700  1 == req done
 req=0x7f541000fee0 http_status=400 ==


 The content length is the right one (I created a test file with dd).
 With a file 207200 bytes long, I get no error.

 The gateway is running on debian, with the packages available on the
 ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100
 continue).

 There is a limit on the object size? Or there is an error in my
 configuration somewhere?

You just stated it: you need 100-continue to upload parts larger than 2GB.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Gregory Farnum
On Wed, Dec 17, 2014 at 8:52 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 I'be been experimenting with CephFS for funning KVM images (proxmox).

 cephfs fuse version - 0.87

 cephfs kernel module - kernel version 3.10


 Part of my testing involves running a Windows 7 VM up and running
 CrystalDiskMark to check the I/O in the VM. Its surprisingly good with
 both the fuse and the kernel driver, seq reads  writes are actually
 faster than the underlying disk, so I presume the FS is aggressively
 caching.

 With the fuse driver I have no problems.

 With the kernel driver, the benchmark runs fine, but when I reboot the
 VM the drive is corrupted and unreadable, every time. Rolling back to
 a snapshot fixes the disk. This does not happen unless I run the
 benchmark, which I presume is writing a lot of data.

 No problems with the same test for Ceph rbd, or NFS.

Do you have any information about *how* the drive is corrupted; what
part Win7 is unhappy with? I don't know how Proxmox configures it, but
I assume you're storing the disk images as single files on the FS?

I'm really not sure what the kernel client could even do here, since
if you're not rebooting the host as well as the VM then it can't be
losing any of the data it's given. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Udo Lembke
Hi Lindsay,
have you tried the different cache-options (no cache, write through,
...) which proxmox offer, for the drive?


Udo

On 18.12.2014 05:52, Lindsay Mathieson wrote:
 I'be been experimenting with CephFS for funning KVM images (proxmox).

 cephfs fuse version - 0.87

 cephfs kernel module - kernel version 3.10


 Part of my testing involves running a Windows 7 VM up and running
 CrystalDiskMark to check the I/O in the VM. Its surprisingly good with
 both the fuse and the kernel driver, seq reads  writes are actually
 faster than the underlying disk, so I presume the FS is aggressively
 caching.

 With the fuse driver I have no problems.

 With the kernel driver, the benchmark runs fine, but when I reboot the
 VM the drive is corrupted and unreadable, every time. Rolling back to
 a snapshot fixes the disk. This does not happen unless I run the
 benchmark, which I presume is writing a lot of data.

 No problems with the same test for Ceph rbd, or NFS.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Block device and Trim/Discard

2014-12-18 Thread Adeel Nazir
Discard is supported in kernel 3.18 rc1 or greater as per 
https://lkml.org/lkml/2014/10/14/450


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Robert Sander
 Sent: Friday, December 12, 2014 7:01 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph Block device and Trim/Discard
 
 On 12.12.2014 12:48, Max Power wrote:
 
  It would be great to shrink the used space. Is there a way to achieve
  this? Or have I done something wrong? In a professional environment
  you may can live with filesystems that only grow. But on my small
  home-cluster this really is a problem.
 
 As Wido already mentioned the kernel RBD does not support discard.
 
 When using qemu+rbd you cannot use the virto driver as this also does not
 support discard. My best experience is with the virtual SATA driver and the
 options cache=writeback and discard=on.
 
 Regards
 --
 Robert Sander
 Heinlein Support GmbH
 Schwedter Str. 8/9b, 10119 Berlin
 
 http://www.heinlein-support.de
 
 Tel: 030 / 405051-43
 Fax: 030 / 405051-19
 
 Zwangsangaben lt. §35a GmbHG:
 HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
 Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Content-length error uploading big files to radosgw

2014-12-18 Thread Yehuda Sadeh
On Thu, Dec 18, 2014 at 11:24 AM, Gregory Farnum g...@gregs42.com wrote:
 On Thu, Dec 18, 2014 at 4:04 AM, Daniele Venzano li...@brownhat.org wrote:
 Hello,

 I have been trying to upload multi-gigabyte files to CEPH via the object
 gateway, using both the swift and s3 APIs.

 With file up to about 2GB everything works as expected.

 With files bigger than that I get back a 400 Bad Request error, both
 with S3 (boto) and Swift clients.

 Enabling debug I can see this:
 2014-12-18 12:38:28.947499 7f5419ffb700 20 CONTENT_LENGTH=307200
 ...
 2014-12-18 12:38:28.947539 7f5419ffb700  1 == starting new request
 req=0x7f541000fee0 =
 2014-12-18 12:38:28.947556 7f5419ffb700  2 req 2:0.17::PUT
 /test/test::initializing
 2014-12-18 12:38:28.947581 7f5419ffb700 10 bad content length, aborting
 2014-12-18 12:38:28.947641 7f5419ffb700  2 req 2:0.000102::PUT
 /test/test::http status=400
 2014-12-18 12:38:28.947644 7f5419ffb700  1 == req done
 req=0x7f541000fee0 http_status=400 ==


 The content length is the right one (I created a test file with dd).
 With a file 207200 bytes long, I get no error.

 The gateway is running on debian, with the packages available on the
 ceph repo, version 0.87-1~bpo70+1. I am using standard apache (no 100
 continue).

 There is a limit on the object size? Or there is an error in my
 configuration somewhere?

 You just stated it: you need 100-continue to upload parts larger than 2GB.

Just a small clarification: the 100-continue is needed to get an early
error message. We should be able to support up to 5GB of a single
part, so it could well be a bug. Usually for large size uploads you
should be using the multipart upload api.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Lindsay Mathieson
On Thu, 18 Dec 2014 08:41:21 PM Udo Lembke wrote:
 have you tried the different cache-options (no cache, write through,
 ...) which proxmox offer, for the drive?


I tried with writeback and it didn't corrupt.
-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread Lindsay Mathieson
On Thu, 18 Dec 2014 11:23:42 AM Gregory Farnum wrote:
 Do you have any information about *how* the drive is corrupted; what
 part Win7 is unhappy with? 

Failure to find the boot sector I think, I'll run it again and take a screen 
shot.

 I don't know how Proxmox configures it, but
 I assume you're storing the disk images as single files on the FS?

its a single KVM QCOW2 file.

-- 
Lindsay

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reproducable Data Corruption with cephfs kernel driver

2014-12-18 Thread John Spray
On Thu, Dec 18, 2014 at 8:40 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:
 I don't know how Proxmox configures it, but
 I assume you're storing the disk images as single files on the FS?

 its a single KVM QCOW2 file.

Like the cache mode, the image format might be an interesting thing to
experiment with.  There are bugs out there in all layers of the IO
stack, it's entirely possible that you're seeing a bug elsewhere in
the stack that is only being triggered when using Ceph.

This probably goes without saying, but make sure you're using the
latest/greatest versions of everything, including
kvm/qemu/proxmox/kernel/guest drivers.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What to do when a parent RBD clone becomes corrupted

2014-12-18 Thread Robert LeBlanc
Before we base thousands of VM image clones off of one or more snapshots, I
want to test what happens when the snapshot becomes corrupted. I don't
believe the snapshot will become corrupted through client access to the
snapshot, but some weird issue with PGs being lost or forced to be lost,
solar flares or alien invasions.

My initial thought was to export a snapshot image and import it over the
top of the existing snapshot so that children would be preserved. No such
luck. I was hoping there would be a i-really-really-want-to-do-this
option that would let me restore the snapshot.

Am I going about this the wrong way? I can see having to restore a number
of VM because of corrupted clone, but I'd hate to lose all the clones
because of corruption in the snapshot. I would be happy if the restored
snapshot would be flattened if it was a clone of another image previously.

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with SSDs

2014-12-18 Thread Mark Kirkwood

On 19/12/14 03:01, Lindsay Mathieson wrote:

On Thu, 18 Dec 2014 10:05:20 PM Mark Kirkwood wrote:

The effect of this is *highly* dependent to the SSD make/model. My m550
work vastly better if the journal is a file on a filesystem as opposed
to a partition.

Obviously the Intel S3700/S3500 are a better choice - but the OP has
already purchased Sammy 840's, so I'm trying to suggest options to try
that don't require him to buy new SSDs!



I have 120GB Samsung 840 EVO's with 10GB journal partitions and just gave this
a go.

No real change unfortunately :( using rados bench.

However it does make experimenting with different journal sizes easier.




Pity. If you used xfs you can try tweaking some of the mkfs 
options...but I doubt they will make too much difference.


Looking at the data specs for the 840, it does not seem to have any on 
board capacitors. If it did you could risk switching off xfs write 
barriers...which would probably make a big difference.


You could try switching *off* the write cache (just in case the 840 
behaves like my m550's and gets - oddly- 2x faster for sync writes in 
that case)! However disabling the write cache may *considerably* 
decrease disk lifetime, so if the setting helps in yur case, you'll need 
to conduct some experiments to measure by how much the lifetime is gonna 
be impacted.


Cheers

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Craig Lewis
On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


Technically, the smallest cluster is a single node with a 10 GiB disk.
Anything smaller won't work.

That said, Ceph was envisioned to run on large clusters.  IIRC, the
reference architecture has 7 rows, each row having 10 racks, all full.

Those of us running small clusters (less than 10 nodes) are noticing that
it doesn't work quite as well.  We have to significantly scale back the
amount of backfilling and recovery that is allowed.  I try to keep all
backfill/recovery operations touching less than 20% of my OSDs.  In the
reference architecture, it could lose a whole row, and still keep under
that limit.  My 5 nodes cluster is noticeably better better than the 3 node
cluster.  It's faster, has lower latency, and latency doesn't increase as
much during recovery operations.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Robert LeBlanc
I'm interested to know if there is a reference to this reference
architecture. It would help alleviate some of the fears we have about
scaling this thing to a massive scale (10,000's OSDs).

Thanks,
Robert LeBlanc

On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com
wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
 wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


 Technically, the smallest cluster is a single node with a 10 GiB disk.
 Anything smaller won't work.

 That said, Ceph was envisioned to run on large clusters.  IIRC, the
 reference architecture has 7 rows, each row having 10 racks, all full.

 Those of us running small clusters (less than 10 nodes) are noticing that
 it doesn't work quite as well.  We have to significantly scale back the
 amount of backfilling and recovery that is allowed.  I try to keep all
 backfill/recovery operations touching less than 20% of my OSDs.  In the
 reference architecture, it could lose a whole row, and still keep under
 that limit.  My 5 nodes cluster is noticeably better better than the 3 node
 cluster.  It's faster, has lower latency, and latency doesn't increase as
 much during recovery operations.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Craig Lewis
I think this is it:
https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

You can also check out a presentation on Cern's Ceph cluster:
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern


At large scale, the biggest problem will likely be network I/O on the
inter-switch links.



On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
wrote:

 I'm interested to know if there is a reference to this reference
 architecture. It would help alleviate some of the fears we have about
 scaling this thing to a massive scale (10,000's OSDs).

 Thanks,
 Robert LeBlanc

 On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com
 wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
 wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


 Technically, the smallest cluster is a single node with a 10 GiB disk.
 Anything smaller won't work.

 That said, Ceph was envisioned to run on large clusters.  IIRC, the
 reference architecture has 7 rows, each row having 10 racks, all full.

 Those of us running small clusters (less than 10 nodes) are noticing that
 it doesn't work quite as well.  We have to significantly scale back the
 amount of backfilling and recovery that is allowed.  I try to keep all
 backfill/recovery operations touching less than 20% of my OSDs.  In the
 reference architecture, it could lose a whole row, and still keep under
 that limit.  My 5 nodes cluster is noticeably better better than the 3 node
 cluster.  It's faster, has lower latency, and latency doesn't increase as
 much during recovery operations.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Have 2 different public networks

2014-12-18 Thread Francois Lafont
Hi,

Is it possible to have 2 different public networks in a Ceph cluster?
I explain my question below.

Currently, I have 3 identical nodes in my Ceph cluster. Each node has:

- only 1 monitor;
- n osds (we don't care about the value n here);
- and 3 interfaces.

One interface for the cluster network (10.0.0.0/24):
- node1 - 10.0.0.1
- node2 - 10.0.0.2
- node3 - 10.0.0.3

One interface for the public network (10.0.1.0/24):
- node1 - [mon.1] mon addr = 10.0.1.1
- node2 - [mon.2] mon addr = 10.0.1.2
- node3 - [mon.3] mon addr = 10.0.1.3

And one interface not used yet (see below).

With this configuration, if I have a Ceph client in the
public network, I can use rbd images etc. No problem,
it works.

But now I would like to use the third interface of the
nodes for a *different* plublic network - 10.0.2.0/24.
The Ceph clients in this network will not really use the
storage but will create and delete rbd images in a pool.
In fact it's just a network for *Ceph management*.

So, I want to have 2 different public networks:
- 10.0.1.0/24 (already exists)
- *and* 10.0.2.0/24

Am I wrong if I say that mon.1, mon.2 and mon.3
must have one more IP address? Is it possible to
have a monitor that listens on 2 addresses? Something
like that:

- node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1
- node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2
- node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3

My environment is not a production environment, just a
lab. So, if necessary I can reinstall everything, no
problem.

Thanks for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Robert LeBlanc
Thanks, I'll look into these.

On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com
wrote:

 I think this is it:
 https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

 You can also check out a presentation on Cern's Ceph cluster:
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern


 At large scale, the biggest problem will likely be network I/O on the
 inter-switch links.



 On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
 wrote:

 I'm interested to know if there is a reference to this reference
 architecture. It would help alleviate some of the fears we have about
 scaling this thing to a massive scale (10,000's OSDs).

 Thanks,
 Robert LeBlanc

 On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com
 wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
 wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


 Technically, the smallest cluster is a single node with a 10 GiB disk.
 Anything smaller won't work.

 That said, Ceph was envisioned to run on large clusters.  IIRC, the
 reference architecture has 7 rows, each row having 10 racks, all full.

 Those of us running small clusters (less than 10 nodes) are noticing
 that it doesn't work quite as well.  We have to significantly scale back
 the amount of backfilling and recovery that is allowed.  I try to keep all
 backfill/recovery operations touching less than 20% of my OSDs.  In the
 reference architecture, it could lose a whole row, and still keep under
 that limit.  My 5 nodes cluster is noticeably better better than the 3 node
 cluster.  It's faster, has lower latency, and latency doesn't increase as
 much during recovery operations.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Christian Balzer

Hello,

On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote:

Firstly I'd like to confirm what Craig said about small clusters.
I just changed my four storage node test cluster from 1 OSD per node to 4
and it can now saturate a 1GbE link (110MB/s) where before it peaked at
50-60MB/s. Of course now it is CPU bound and a bit tight on memory (those
nodes have 4GB RAM and 2 have just 1 CPU/core). ^o^

 I think this is it:
 https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

Ah, the joys of corporate address packratting. 
 
 You can also check out a presentation on Cern's Ceph cluster:
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
 
 
 At large scale, the biggest problem will likely be network I/O on the
 inter-switch links.
 
While true I think it will hit an equilibrium of sorts, if you actually
have enough client traffic to saturate those, time for an upgrade.

Aside from mere technical questions and challenges of scaling Ceph to
those sizes (tuning all sorts of parameters, etc) I think clusters of that
scale can become an administrative nightmare first and foremost.

Let's take a look at a classic Ceph cluster with 10,000 OSDs:
It will have somewhere between 500 and 1000 nodes. That number should give
you pause already, there are bound to be dead nodes frequently.
And with 10,000 disks, you're pretty much guaranteed to have a dead OSD or
more (see the various threads about how resilient Ceph is) at any given
time.
So you'll need a team of people to swap disks on a constant/regular basis.
And unless you also have a very nice inventory and tracking system, you
will want to replace these OSDs in order, so that OSD 10 isn't on node
50 all of sudden, etc.

There's probably a point of diminishing return when increasing OSDs
stops making sense for various reasons.
In fact once you reach a few hundred OSDs, to ease maintenance consider
RAIDed OSDs (no more failed OSDs, yeah! ^o^).

For me, the life cycle of a steadily growing cluster would be something
like this:
1. Start with as many nodes/OSDs as you can can afford for performance,
even if you don't need the space yet. 
2. Keep adding OSDs to satisfy space and performance requirements as
needed.
3. While performance is still good (or can't improve because of network
limitations) but space requirements increase, grow the size of your OSDs,
not the number.

Regards,

Christian
 
 
 On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
 wrote:
 
  I'm interested to know if there is a reference to this reference
  architecture. It would help alleviate some of the fears we have about
  scaling this thing to a massive scale (10,000's OSDs).
 
  Thanks,
  Robert LeBlanc
 
  On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis
  cle...@centraldesktop.com wrote:
 
 
 
  On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
  wrote:
 
 
   2. What should be the minimum hardware requirement of the server
   (CPU, Memory, NIC etc)
 
  There is no real minimum to run Ceph, it's all about what your
  workload will look like and what kind of performance you need. We
  have seen Ceph run on Raspberry Pis.
 
 
  Technically, the smallest cluster is a single node with a 10 GiB disk.
  Anything smaller won't work.
 
  That said, Ceph was envisioned to run on large clusters.  IIRC, the
  reference architecture has 7 rows, each row having 10 racks, all full.
 
  Those of us running small clusters (less than 10 nodes) are noticing
  that it doesn't work quite as well.  We have to significantly scale
  back the amount of backfilling and recovery that is allowed.  I try
  to keep all backfill/recovery operations touching less than 20% of my
  OSDs.  In the reference architecture, it could lose a whole row, and
  still keep under that limit.  My 5 nodes cluster is noticeably better
  better than the 3 node cluster.  It's faster, has lower latency, and
  latency doesn't increase as much during recovery operations.
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have 2 different public networks

2014-12-18 Thread Craig Lewis
The daemons bind to *, so adding the 3rd interface to the machine will
allow you to talk to the daemons on that IP.

I'm not really sure how you'd setup the management network though.  I'd
start by setting the ceph.conf public network on the  management nodes to
have the public network 10.0.2.0/24, and an /etc/hosts file with the
monitor's names on the 10.0.2.0/24 network.

Make sure the management nodes can't route to the 10.0.1.0/24 network, and
see what happens.


Do you really plan on having enough traffic creating and deleting RDB
images that you need a dedicated network?  It seems like setting up link
aggregation on 10.0.1.0/24 would be simpler and less error prone.



On Thu, Dec 18, 2014 at 4:19 PM, Francois Lafont flafdiv...@free.fr wrote:

 Hi,

 Is it possible to have 2 different public networks in a Ceph cluster?
 I explain my question below.

 Currently, I have 3 identical nodes in my Ceph cluster. Each node has:

 - only 1 monitor;
 - n osds (we don't care about the value n here);
 - and 3 interfaces.

 One interface for the cluster network (10.0.0.0/24):
 - node1 - 10.0.0.1
 - node2 - 10.0.0.2
 - node3 - 10.0.0.3

 One interface for the public network (10.0.1.0/24):
 - node1 - [mon.1] mon addr = 10.0.1.1
 - node2 - [mon.2] mon addr = 10.0.1.2
 - node3 - [mon.3] mon addr = 10.0.1.3

 And one interface not used yet (see below).

 With this configuration, if I have a Ceph client in the
 public network, I can use rbd images etc. No problem,
 it works.

 But now I would like to use the third interface of the
 nodes for a *different* plublic network - 10.0.2.0/24.
 The Ceph clients in this network will not really use the
 storage but will create and delete rbd images in a pool.
 In fact it's just a network for *Ceph management*.

 So, I want to have 2 different public networks:
 - 10.0.1.0/24 (already exists)
 - *and* 10.0.2.0/24

 Am I wrong if I say that mon.1, mon.2 and mon.3
 must have one more IP address? Is it possible to
 have a monitor that listens on 2 addresses? Something
 like that:

 - node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1
 - node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2
 - node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3

 My environment is not a production environment, just a
 lab. So, if necessary I can reinstall everything, no
 problem.

 Thanks for your help.

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High CPU/Delay when Removing Layered Child RBD Image

2014-12-18 Thread Tyler Wilson
Hey All,

On a new Cent7 deployment with firefly I'm noticing a strange behavior when
deleting RBD child disks. It appears upon deletion cpu usage on each OSD
process raises to about 75% for 30+ seconds. On my previous deployments
with CentOS 6.x and Ubuntu 12/14 this was never a problem.

Each RBD Disk is 4GB created with 'rbd clone
images/136dd921-f6a2-432f-b4d6-e9902f71baa6@snap compute/test'

## Ubuntu12 3.11.0-18-generic with Ceph 0.80.7
root@node-1:~# date; rbd rm compute/test123; date
Fri Dec 19 01:09:31 UTC 2014
Removing image: 100% complete...done.
Fri Dec 19 01:09:31 UTC 2014

## Cent7 3.18.1-1.el7.elrepo.x86_64 with Ceph 0.80.7
[root@hvm003 ~]# date; rbd rm compute/test; date
Fri Dec 19 01:08:32 UTC 2014
Removing image: 100% complete...done.
Fri Dec 19 01:09:00 UTC 2014

root@cpl001 ~]# ceph -s
cluster d033718a-2cb9-409e-b968-34370bd67bd0
 health HEALTH_OK
 monmap e1: 3 mons at {cpl001=
10.0.0.1:6789/0,mng001=10.0.0.3:6789/0,net001=10.0.0.2:6789/0}, election
epoch 10, quorum 0,1,2 cpl001,net001,mng001
 osdmap e84: 9 osds: 9 up, 9 in
  pgmap v618: 1792 pgs, 12 pools, 4148 MB data, 518 kobjects
15106 MB used, 4257 GB / 4272 GB avail
1792 active+clean


Any assistance would be appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU/Delay when Removing Layered Child RBD Image

2014-12-18 Thread Tyler Wilson
Okay, this is rather unrelated to Ceph but I might as well mention how this
is fixed. When using the Juno-Release OpenStack pages the
'rbd_store_chunk_size = 8' now sets it to 8192 bytes rather than 8192 kB
(8MB) causing quite a bit more objects to be stored and deleted. Setting
this to 8192 got me the expected object size of 8MB.


On Thu, Dec 18, 2014 at 6:22 PM, Tyler Wilson k...@linuxdigital.net wrote:

 Hey All,

 On a new Cent7 deployment with firefly I'm noticing a strange behavior
 when deleting RBD child disks. It appears upon deletion cpu usage on each
 OSD process raises to about 75% for 30+ seconds. On my previous deployments
 with CentOS 6.x and Ubuntu 12/14 this was never a problem.

 Each RBD Disk is 4GB created with 'rbd clone
 images/136dd921-f6a2-432f-b4d6-e9902f71baa6@snap compute/test'

 ## Ubuntu12 3.11.0-18-generic with Ceph 0.80.7
 root@node-1:~# date; rbd rm compute/test123; date
 Fri Dec 19 01:09:31 UTC 2014
 Removing image: 100% complete...done.
 Fri Dec 19 01:09:31 UTC 2014

 ## Cent7 3.18.1-1.el7.elrepo.x86_64 with Ceph 0.80.7
 [root@hvm003 ~]# date; rbd rm compute/test; date
 Fri Dec 19 01:08:32 UTC 2014
 Removing image: 100% complete...done.
 Fri Dec 19 01:09:00 UTC 2014

 root@cpl001 ~]# ceph -s
 cluster d033718a-2cb9-409e-b968-34370bd67bd0
  health HEALTH_OK
  monmap e1: 3 mons at {cpl001=
 10.0.0.1:6789/0,mng001=10.0.0.3:6789/0,net001=10.0.0.2:6789/0}, election
 epoch 10, quorum 0,1,2 cpl001,net001,mng001
  osdmap e84: 9 osds: 9 up, 9 in
   pgmap v618: 1792 pgs, 12 pools, 4148 MB data, 518 kobjects
 15106 MB used, 4257 GB / 4272 GB avail
 1792 active+clean


 Any assistance would be appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Lindsay Mathieson
On 19 December 2014 at 11:14, Christian Balzer ch...@gol.com wrote:

 Hello,

 On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote:

 Firstly I'd like to confirm what Craig said about small clusters.
 I just changed my four storage node test cluster from 1 OSD per node to 4
 and it can now saturate a 1GbE link (110MB/s) where before it peaked at
 50-60MB/s.

What min//max sizes do you have set? Anything special in your crush map?

Did it improve your write speed and latency?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Christian Balzer
On Fri, 19 Dec 2014 12:28:48 +1000 Lindsay Mathieson wrote:

 On 19 December 2014 at 11:14, Christian Balzer ch...@gol.com wrote:
 
  Hello,
 
  On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote:
 
  Firstly I'd like to confirm what Craig said about small clusters.
  I just changed my four storage node test cluster from 1 OSD per node
  to 4 and it can now saturate a 1GbE link (110MB/s) where before it
  peaked at 50-60MB/s.
 
 What min//max sizes do you have set? Anything special in your crush map?
 
What specific values are you thinking about?
But no, nothing special, no tuning at all with that cluster.

The gain is simply from having more spindles to distribute the load
(remember rados bench runs 16 threads by default and I use 64) amongst.

 Did it improve your write speed and latency?
 
I was referring to write speed (bandwidth), for sequential reads a single
HDD can saturate a 1GbE link, let alone 4.

As for latency, somewhat. But this cluster isn't pure testing, no SSDs, so
it is slow no matter what. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Cluster (0.87), Missing Default Pools?

2014-12-18 Thread Dyweni - Ceph-Users

Hi John,

Yes, no problem!  I have few items that I noticed.  They are:


1.   The missing 'data' and 'metadata' pools

  http://ceph.com/docs/master/install/manual-deployment/

  Monitor Bootstrapping - Steps # 17  18


2.   The setting 'mon initial members'

  On page 
'http://ceph.com/docs/master/rados/configuration/mon-config-ref/', 'mon 
initial members' are the IDs of the initial monitors in the cluster.


  On page 'http://ceph.com/docs/master/install/manual-deployment/'  
(Monitor Bootstrapping - Steps # 6  14) it lists the members as the 
hostnames.




3.   Creating the default data directory for the monitors:

  On page 
'http://ceph.com/docs/master/rados/configuration/mon-config-ref/', 'mon 
data' defaults to '/var/lib/ceph/mon/$cluster-$id'.


  On page 'http://ceph.com/docs/master/install/manual-deployment/'  
(Monitor Bootstrapping - Step # 12) it hostname instead.





4.   Populating the monitor daemons

  The man page for 'ceph-mon' shows that '-i' is the monitor ID.

  On page 'http://ceph.com/docs/master/install/manual-deployment/'  
(Monitor Bootstrapping - Step # 13) it uses the hostname instead.




Thanks,
Dyweni





On 2014-12-18 11:55, John Spray wrote:
Can you point out the specific page that's out of date so that we can 
update it?


Thanks,
John

On Thu, Dec 18, 2014 at 5:52 PM, Dyweni - Ceph-Users
6exbab4fy...@dyweni.com wrote:

Thanks!!

Looks like the the manual installation instructions should be updated, 
to

eliminate future confusion.

Dyweni




On 2014-12-18 07:11, John Spray wrote:


No mistake -- the Ceph FS pools are no longer created by default, as
not everybody needs them.  Ceph FS users now create these pools
explicitly:
http://ceph.com/docs/master/cephfs/createfs/

John

On Thu, Dec 18, 2014 at 12:52 PM, Dyweni - Ceph-Users
6exbab4fy...@dyweni.com wrote:


Hi All,


Just setup the monitor for a new cluster based on Giant (0.87) and I 
find

that only the 'rbd' pool was created automatically.  I don't see the
'data'
or 'metadata' pools in 'ceph osd lspools' or the log files.  I 
haven't

setup
any OSDs or MDSs yet.  I'm following the manual deployment guide.

Would you mind looking over the setup details/logs below and letting 
me

know
my mistake please?



Here's my /etc/ceph/ceph.conf file:
---
[global]
fsid = xx

public network = xx.xx.xx.xx/xx
cluster network = xx.xx.xx.xx/xx

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

osd pool default size = 2
osd pool default min size = 1

osd pool default pg num = 100
osd pool default pgp num = 100

[mon]
mon initial members = a

[mon.a]
host = xx
mon addr = xx.xx.xx.xx
---


Here's the commands used to setup the monitor:
---
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n 
mon.

--cap
mon 'allow *'
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
--gen-key
-n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' 
--cap

mds 'allow'
ceph-authtool /tmp/ceph.mon.keyring --import-keyring
/etc/ceph/ceph.client.admin.keyring
monmaptool --create --add xx xx.xx.xx.xx --fsid xx /tmp/monmap
mkdir /var/lib/ceph/mon/ceph-a
ceph-mon --mkfs -i a --monmap /tmp/monmap --keyring 
/tmp/ceph.mon.keyring

/etc/init.d/ceph-mon.a start
---


Here's the ceph-mon.a logfile:
---
2014-12-18 12:35:45.768752 7fb00df94780  0 ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-mon, pid 
3225

2014-12-18 12:35:45.856851 7fb00df94780  0 mon.a does not exist in
monmap,
will attempt to join an existing cluster
2014-12-18 12:35:45.857069 7fb00df94780  0 using public_addr
xx.xx.xx.xx:0/0
- xx.xx.xx.xx:6789/0
2014-12-18 12:35:45.857126 7fb00df94780  0 starting mon.a rank -1 at
xx.xx.xx.xx:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid xx
2014-12-18 12:35:45.857330 7fb00df94780  1 mon.a@-1(probing) e0 
preinit

fsid
xx
2014-12-18 12:35:45.857402 7fb00df94780  1 mon.a@-1(probing) e0
initial_members a, filtering seed monmap
2014-12-18 12:35:45.858322 7fb00df94780  0 mon.a@-1(probing) e0  my 
rank

is
now 0 (was -1)
2014-12-18 12:35:45.858360 7fb00df94780  1 mon.a@0(probing) e0
win_standalone_election
2014-12-18 12:35:45.859803 7fb00df94780  0 log_channel(cluster) log 
[INF]

:
mon.a@0 won leader election with quorum 0
2014-12-18 12:35:45.863846 7fb008d4b700  1
mon.a@0(leader).paxosservice(pgmap 0..0) refresh upgraded, format 1 
- 0

2014-12-18 12:35:45.863867 7fb008d4b700  1 mon.a@0(leader).pg v0
on_upgrade
discarding in-core PGMap
2014-12-18 12:35:45.865662 7fb008d4b700  1
mon.a@0(leader).paxosservice(auth
0..0) refresh upgraded, format 1 - 0
2014-12-18 12:35:45.865719 7fb008d4b700  1 mon.a@0(probing) e1
win_standalone_election
2014-12-18 12:35:45.867394 7fb008d4b700  0 log_channel(cluster) log 
[INF]

:
mon.a@0 won leader election with quorum 0

[ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Hello Yall!

I can't figure out why my gateways are performing so poorly and I am not
sure where to start looking. My RBD mounts seem to be performing fine
(over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
(32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
with nuttcp shows that I can transfer from a client with 10G interface
to any node on the ceph cluster at the full 10G and ceph can transfer
close to 20G between itself. I am not really sure where to start looking
as outside of another issue which I will mention below I am clueless.

I have a weird setup
[osd nodes]
60 x 4TB 7200 RPM SATA Drives
12 x  400GB s3700 SSD drives
3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
the 3 cards)
512 GB of RAM
2 x CPU E5-2670 v2 @ 2.50GHz
2 x 10G interfaces  LACP bonded for cluster traffic
2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
ports)

[monitor nodes and gateway nodes]
4 x 300G 1500RPM SAS drives in raid 10
1 x SAS 2208
64G of RAM
2 x CPU E5-2630 v2
2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)


Here is a pastebin dump of my details, I am running ceph giant 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
across the entire cluster.

http://pastebin.com/XQ7USGUz -- ceph health detail
http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
http://pastebin.com/BC3gzWhT -- ceph osd tree
http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)


We ran into a few issues with density (conntrack limits, pid limit, and
number of open files) all of which I adjusted by bumping the ulimits in
/etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
signs of these limits being hit so I have not included my limits or
sysctl conf. If you like this as well let me know and I can include it.

One of the issues I am seeing is that OSDs have started to flop/ be
marked as slow. The cluster was HEALTH_OK with all of the disks added
for over 3 weeks before this behaviour started. RBD transfers seem to be
fine for the most part which makes me think that this has little baring
on the gateway issue but it may be related. Rebooting the OSD seems to
fix this issue.

I would like to figure out the root cause of both of these issues and
post the results back here if possible (perhaps it can help other
people). I am really looking for a place to start looking at as the
gateway just outputs that it is posting data and all of the logs
(outside of the monitors reporting down osds) seem to show a fully
functioning cluster.

Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
well.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Gregory Farnum
What kind of uploads are you performing? How are you testing?
Have you looked at the admin sockets on any daemons yet? Examining the OSDs
to see if they're behaving differently on the different requests is one
angle of attack. The other is look into is if the RGW daemons are hitting
throttler limits or something that the RBD clients aren't.
-Greg
On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan seapasu...@uchicago.edu
wrote:

 Hello Yall!

 I can't figure out why my gateways are performing so poorly and I am not
 sure where to start looking. My RBD mounts seem to be performing fine
 (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
 (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
 with nuttcp shows that I can transfer from a client with 10G interface
 to any node on the ceph cluster at the full 10G and ceph can transfer
 close to 20G between itself. I am not really sure where to start looking
 as outside of another issue which I will mention below I am clueless.

 I have a weird setup
 [osd nodes]
 60 x 4TB 7200 RPM SATA Drives
 12 x  400GB s3700 SSD drives
 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
 the 3 cards)
 512 GB of RAM
 2 x CPU E5-2670 v2 @ 2.50GHz
 2 x 10G interfaces  LACP bonded for cluster traffic
 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
 ports)

 [monitor nodes and gateway nodes]
 4 x 300G 1500RPM SAS drives in raid 10
 1 x SAS 2208
 64G of RAM
 2 x CPU E5-2630 v2
 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)


 Here is a pastebin dump of my details, I am running ceph giant 0.87
 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
 across the entire cluster.

 http://pastebin.com/XQ7USGUz -- ceph health detail
 http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
 http://pastebin.com/BC3gzWhT -- ceph osd tree
 http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
 http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)


 We ran into a few issues with density (conntrack limits, pid limit, and
 number of open files) all of which I adjusted by bumping the ulimits in
 /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
 signs of these limits being hit so I have not included my limits or
 sysctl conf. If you like this as well let me know and I can include it.

 One of the issues I am seeing is that OSDs have started to flop/ be
 marked as slow. The cluster was HEALTH_OK with all of the disks added
 for over 3 weeks before this behaviour started. RBD transfers seem to be
 fine for the most part which makes me think that this has little baring
 on the gateway issue but it may be related. Rebooting the OSD seems to
 fix this issue.

 I would like to figure out the root cause of both of these issues and
 post the results back here if possible (perhaps it can help other
 people). I am really looking for a place to start looking at as the
 gateway just outputs that it is posting data and all of the logs
 (outside of the monitors reporting down osds) seem to show a fully
 functioning cluster.

 Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
 well.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Thanks for the reply Gegory,

Sorry if this is in the wrong direction or something. Maybe I do not
understand

To test uploads I either use bash time and either python-swiftclient or
boto key.set_contents_from_filename to the radosgw. I was unaware that
radosgw had any type of throttle settings in the configuration (I can't
seem to find any either).  As for rbd mounts I test by creating a 1TB
mount and writing a file to it through time+cp or dd. Not the most
accurate test but I think should be good enough as a quick functionality
test. So for writes, it's more for functionality than performance. I
would think a basic functionality test should yield more than 8mb/s though.

As for checking admin sockets: I have actually, I set the 3rd gateways
debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see
anything that stands out. The snippet of the log I pasted has these
values set. I did the same for an osd that is marked as slow (1112). All
I can see in the log for the osd are ticks and heartbeat responses
though, nothing that shows any issues. Finally I did it for the primary
monitor node to see if I would see anything there with debug_mon set to
5 (http://pastebin.com/hhnaFac1). I do not really see anything that
would stand out as a failure (like a fault or timeout error).

What kind of throttler limits do you mean? I didn't/don't see any
mention of rgw throttler limits in the ceph.com docs or admin socket
just osd/ filesystem throttle like inode/flusher limits, do you mean
these? I have not messed with these limits yet on this cluster, do you
think it would help?

On 12/18/2014 10:24 PM, Gregory Farnum wrote:
 What kind of uploads are you performing? How are you testing?
 Have you looked at the admin sockets on any daemons yet? Examining the
 OSDs to see if they're behaving differently on the different requests
 is one angle of attack. The other is look into is if the RGW daemons
 are hitting throttler limits or something that the RBD clients aren't.
 -Greg
 On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan seapasu...@uchicago.edu
 mailto:seapasu...@uchicago.edu wrote:

 Hello Yall!

 I can't figure out why my gateways are performing so poorly and I
 am not
 sure where to start looking. My RBD mounts seem to be performing fine
 (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
 (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
 with nuttcp shows that I can transfer from a client with 10G interface
 to any node on the ceph cluster at the full 10G and ceph can transfer
 close to 20G between itself. I am not really sure where to start
 looking
 as outside of another issue which I will mention below I am clueless.

 I have a weird setup
 [osd nodes]
 60 x 4TB 7200 RPM SATA Drives
 12 x  400GB s3700 SSD drives
 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly
 across
 the 3 cards)
 512 GB of RAM
 2 x CPU E5-2670 v2 @ 2.50GHz
 2 x 10G interfaces  LACP bonded for cluster traffic
 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
 ports)

 [monitor nodes and gateway nodes]
 4 x 300G 1500RPM SAS drives in raid 10
 1 x SAS 2208
 64G of RAM
 2 x CPU E5-2630 v2
 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G
 ports)


 Here is a pastebin dump of my details, I am running ceph giant 0.87
 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel
 3.13.0-40-generic
 across the entire cluster.

 http://pastebin.com/XQ7USGUz -- ceph health detail
 http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
 http://pastebin.com/BC3gzWhT -- ceph osd tree
 http://pastebin.com/eRyY4H4c --
 /var/log/radosgw/client.radosgw.rgw03.log
 http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't
 let me)


 We ran into a few issues with density (conntrack limits, pid
 limit, and
 number of open files) all of which I adjusted by bumping the
 ulimits in
 /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
 signs of these limits being hit so I have not included my limits or
 sysctl conf. If you like this as well let me know and I can
 include it.

 One of the issues I am seeing is that OSDs have started to flop/ be
 marked as slow. The cluster was HEALTH_OK with all of the disks added
 for over 3 weeks before this behaviour started. RBD transfers seem
 to be
 fine for the most part which makes me think that this has little
 baring
 on the gateway issue but it may be related. Rebooting the OSD seems to
 fix this issue.

 I would like to figure out the root cause of both of these issues and
 post the results back here if possible (perhaps it can help other
 people). I am really looking for a place to start looking at as the
 gateway just outputs that it is posting data and all of the 

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Christian Balzer

Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

 Hello Yall!
 
 I can't figure out why my gateways are performing so poorly and I am not
 sure where to start looking. My RBD mounts seem to be performing fine
 (over 300 MB/s) 

I wouldn't call 300MB/s writes fine with a cluster of this size. 
How are you testing this (which tool, settings, from where)?

 while uploading a 5G file to Swift/S3 takes 2m32s
 (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
 with nuttcp shows that I can transfer from a client with 10G interface
 to any node on the ceph cluster at the full 10G and ceph can transfer
 close to 20G between itself. I am not really sure where to start looking
 as outside of another issue which I will mention below I am clueless.
 
I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

 I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

 [osd nodes]
 60 x 4TB 7200 RPM SATA Drives
What maker/model?

 12 x  400GB s3700 SSD drives
Journals, one assumes. 

 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
 the 3 cards)
I smell a port-expander or 3 on your backplane. 
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

 512 GB of RAM
Sufficient.

 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals. 
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

 2 x 10G interfaces  LACP bonded for cluster traffic
 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
 ports)
 
Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle. 
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

 [monitor nodes and gateway nodes]
 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
 1 x SAS 2208
combined with this it should be fine.

 64G of RAM
 2 x CPU E5-2630 v2
 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
 
 
 Here is a pastebin dump of my details, I am running ceph giant 0.87 
 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
 across the entire cluster.
 
 http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

 http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
 http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

 http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
 http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
 
 
 We ran into a few issues with density (conntrack limits, pid limit, and
 number of open files) all of which I adjusted by bumping the ulimits in
 /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
 signs of these limits being hit so I have not included my limits or
 sysctl conf. If you like this as well let me know and I can include it.
 
 One of the issues I am seeing is that OSDs have started to flop/ be
 marked as slow. The cluster was HEALTH_OK with all of the disks added
 for over 3 weeks before this behaviour started. 
Anything changed? 
In particular I assume this a new cluster, has much data been added?
A ceph -s output would be nice and educational.

Can you correlate the time when you start seeing slow, blocked requests
with scrubs or deep-scrubs? If so try setting your cluster temporarily to
noscrub and nodeep-scrub and see if that helps.  In case it does, setting  
osd_scrub_sleep (start with something high like 1.0 or 0.5 and lower until it 
hurts again) should help permanently.

I have a cluster that could scrub things in minutes until the amount of
objects/data and steady load reached a threshold and now its hours.

In this context, check the fragmentation of your OSDs.

How busy (ceph.log ops/s) is your cluster at these times?

 RBD transfers seem to be
 fine for the most part which makes me think that this has little baring
 on the gateway issue but it may be related. Rebooting the OSD seems to
 fix this 

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

thanks!
It would be really great in the right hands. Through some stroke of luck 
it's in mine. The flapping osd is becoming a real issue at this point as it 
is the only possible lead I have to why the gateways are transferring so 
slowly. The weird issue is that I can have 8 or 60 transfers going to the 
radosgw and they are all at roughly 8mbps. To work around this right now I 
am starting 60+ clients across 10 boxes to get roughly 1gbps per gateway 
across gw1 and gw2.


I heve been staring at logs for hours trying to get a handle at what the 
issue may be with no luck.


The third gateway was made last minute to test and rule out the hardware.


On December 18, 2014 10:57:41 PM Christian Balzer ch...@gol.com wrote:



Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

 Hello Yall!

 I can't figure out why my gateways are performing so poorly and I am not
 sure where to start looking. My RBD mounts seem to be performing fine
 (over 300 MB/s)

I wouldn't call 300MB/s writes fine with a cluster of this size.
How are you testing this (which tool, settings, from where)?

 while uploading a 5G file to Swift/S3 takes 2m32s
 (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
 with nuttcp shows that I can transfer from a client with 10G interface
 to any node on the ceph cluster at the full 10G and ceph can transfer
 close to 20G between itself. I am not really sure where to start looking
 as outside of another issue which I will mention below I am clueless.

I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

 I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

 [osd nodes]
 60 x 4TB 7200 RPM SATA Drives
What maker/model?

 12 x  400GB s3700 SSD drives
Journals, one assumes.

 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
 the 3 cards)
I smell a port-expander or 3 on your backplane.
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

 512 GB of RAM
Sufficient.

 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals.
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

 2 x 10G interfaces  LACP bonded for cluster traffic
 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
 ports)

Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle.
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

 [monitor nodes and gateway nodes]
 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
 1 x SAS 2208
combined with this it should be fine.

 64G of RAM
 2 x CPU E5-2630 v2
 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)


 Here is a pastebin dump of my details, I am running ceph giant 0.87
 (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
 across the entire cluster.

 http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

 http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
 http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

 http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
 http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)


 We ran into a few issues with density (conntrack limits, pid limit, and
 number of open files) all of which I adjusted by bumping the ulimits in
 /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
 signs of these limits being hit so I have not included my limits or
 sysctl conf. If you like this as well let me know and I can include it.

 One of the issues I am seeing is that OSDs have started to flop/ be
 marked as slow. The cluster was HEALTH_OK with all of the disks added
 for over 3 weeks before this behaviour started.
Anything changed?
In particular I assume this a new cluster, has much data been added?
A ceph -s output would be nice and educational.

Can you correlate the time when you start seeing slow, 

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some data. 
Thanks a million for the in depth responses!


I thought about raiding it but I needed the space unfortunately. I had a 
3x60 osd node test cluster that we tried before this and it didn't have 
this flopping issue or rgw issue I am seeing .


I can quickly answered the case/make questions, the model will need to wait 
till I get home :)


Case is a 72 disk supermicro chassis, I'll grab the exact model in my next 
reply.


Drives are HGST 4TB drives, ill grab the model once I get home as well.

The 300 was completely incorrect and it can push more, it was just meant 
for a quick comparison but I agree it should be higher.


Thank you so much. Please hold up and ill grab the extra info ^~^


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Christian Balzer

Hello,

On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:

 Wow Christian,
 
 Sorry I missed these in line replies. Give me a minute to gather some
 data. Thanks a million for the in depth responses!
 
No worries.

 I thought about raiding it but I needed the space unfortunately. I had a 
 3x60 osd node test cluster that we tried before this and it didn't have 
 this flopping issue or rgw issue I am seeing .

I think I remember that...

You do realize that the RAID6 configuration option I mentioned would
actually give you MORE space (replication of 2 is sufficient with reliable
OSDs) than what you have now? 
Albeit probably at reduced performance, how much would also depend on the
controllers used, but at worst the RAID6 OSD performance would be
equivalent to that of single disk. 
So a Cluster (performance wise) with 21 nodes and 8 disks each.
 
 I can quickly answered the case/make questions, the model will need to
 wait till I get home :)
 
 Case is a 72 disk supermicro chassis, I'll grab the exact model in my
 next reply.

No need, now that strange monitor configuration makes sense, you (or
whoever spec'ed this) went for the Supermicro Ceph solution, right?

In my not so humble opinion, this the worst storage chassis ever designed
by a long shot and totally unsuitable for Ceph. 
I told the Supermicro GM for Japan as much. ^o^

Every time a HDD dies, you will have to go and shut down the other OSD
that resides on the same tray (and set the cluster to noout).
Even worse of course if a SSD should fail.
And if somebody should just go and hotswap things w/o that step first,
hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively).

Christian
 
 Drives are HGST 4TB drives, ill grab the model once I get home as well.
 
 The 300 was completely incorrect and it can push more, it was just meant 
 for a quick comparison but I agree it should be higher.
 
 Thank you so much. Please hold up and ill grab the extra info ^~^
 
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recovering from PG in down+incomplete state

2014-12-18 Thread Mallikarjun Biradar
Hi all,

I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down
state, I have removed that PG from cluster, by removing crush rule for that
OSD.

Now cluster with 11 OSD's, started rebalancing. After sometime, cluster
status was

ems@rack6-client-5:~$ sudo ceph -s
cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1
 health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering; 73
pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck
unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6
 monmap e1: 3 mons at {rack6-client-4=
10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0},
election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6
 osdmap e2648: 11 osds: 11 up, 11 in
  pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects
11668 GB used, 26048 GB / 37717 GB avail
  63 stale+active+clean
   1 down+incomplete
 521 active+clean
 251 incomplete
  10 stale+peering
ems@rack6-client-5:~$


To fix this, i cant run ceph osd lost osd.id to remove the PG which is
in down state. As OSD is already removed from the cluster.

ems@rack6-client-4:~$ sudo ceph pg dump all | grep down
dumped all in format plain
1.3815480   0   0   0   6492782592  3001
 3001down+incomplete 2014-12-18 15:58:29.681708  1118'508438
2648:1073892[6,3,1] 6   [6,3,1] 6   76'437184
2014-12-16 12:38:35.322835  76'437184   2014-12-16 12:38:35.322835
ems@rack6-client-4:~$

ems@rack6-client-4:~$ sudo ceph pg 1.38 query
.
recovery_state: [
{ name: Started\/Primary\/Peering,
  enter_time: 2014-12-18 15:58:29.681666,
  past_intervals: [
{ first: 1109,
  last: 1118,
  maybe_went_rw: 1,
...
...
down_osds_we_would_probe: [
7],
  peering_blocked_by: []},
...
...

ems@rack6-client-4:~$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  36.85   root default
-2  20.1host rack2-storage-1
0   3.35osd.0   up  1
1   3.35osd.1   up  1
2   3.35osd.2   up  1
3   3.35osd.3   up  1
4   3.35osd.4   up  1
5   3.35osd.5   up  1
-3  16.75   host rack2-storage-5
6   3.35osd.6   up  1
8   3.35osd.8   up  1
9   3.35osd.9   up  1
10  3.35osd.10  up  1
11  3.35osd.11  up  1
ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it
osd.7 is not down or doesn't exist
ems@rack6-client-4:~$


Can somebody suggest any other recovery step to come out of this?

-Thanks  Regards,
Mallikarjun Biradar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have 2 different public networks

2014-12-18 Thread Francois Lafont
Le 19/12/2014 02:18, Craig Lewis a écrit :
 The daemons bind to *, 

Yes but *only* for the OSD daemon. Am I wrong?

Personally I must provide IP addresses for the monitors
in the /etc/ceph/ceph.conf, like this:

[global]
mon host = 10.0.1.1, 10.0.1.2, 10.0.1.3

Or like this:

[mon.1]
mon addr = 10.0.1.1
[mon.2]
mon addr = 10.0.1.2
[mon.3]
mon addr = 10.0.1.3

And every time, the monitors daemons bind to just only one
address. And if a ceph client want to contact the
cluster, it must contact monitors. Here is my problem:
monitors just listen in the 10.0.1.0/24 network but
not in 10.0.2.0/24.

Do you have monitor daemons that bind to * ? Personally
I don't (always just one interface).

Is it possible to provide 2 IP addresses for monitors in
the /etc/ceph/ceph.conf file?

 so adding the 3rd interface to the machine will
 allow you to talk to the daemons on that IP.

The 3rd interface exists since the begin (before the
creation of the cluster) but monitors bind to only
one interface.

 I'm not really sure how you'd setup the management network though.  I'd
 start by setting the ceph.conf public network on the  management nodes to
 have the public network 10.0.2.0/24, and an /etc/hosts file with the
 monitor's names on the 10.0.2.0/24 network.
 
 Make sure the management nodes can't route to the 10.0.1.0/24 network, and
 see what happens.

For now, I can't have monitors that bind to 10.0.1.[123] *and* 10.0.2.[123].

 Do you really plan on having enough traffic creating and deleting RDB
 images that you need a dedicated network?  It seems like setting up link
 aggregation on 10.0.1.0/24 would be simpler and less error prone.

This is not for traffic. I must have a node to manage rbd images and this
node is in a different VLAN (this is an Openstack install... I try... ;).

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com