[ceph-users] TRIM / DISCARD run at low priority by the OSDs?

2015-08-21 Thread Chad William Seys
Hi All,

Is it possible to give TRIM / DISCARD initiated by krbd low priority on the 
OSDs?

I know it is possible to run fstrim at Idle priority on the rbd mount point, 
e.g. ionice -c Idle fstrim -v $MOUNT .  

But this Idle priority (it appears) only is within the context of the node 
executing fstrim .  If the node executing fstrim is Idle then the OSDs are 
very busy and performance suffers.

Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at 
low priority also?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw hanging - blocking rgw.bucket_list ops

2015-08-21 Thread Sam Wouters
tried removing, but no luck:

rados -p .be-east.rgw.buckets rm
be-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity
error removing
.be-east.rgw.bucketsbe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity:
(2)

anyone?

On 21-08-15 13:06, Sam Wouters wrote:
 I suspect these to be the cause:

 rados ls -p .be-east.rgw.buckets | grep
 sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity   
 be-east.5436.1__sanity
 be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity   
 be-east.5436.1__sanity
 be-east.5436.1___sanitybe-east.5436.1__sanity

 would it be save and/or help to remove those with rados rm, and try an
 bucket check --fix --check-objects?

 On 21-08-15 11:28, Sam Wouters wrote:
 Hi,

 We are running hammer 0.94.2 and have an increasing amount of
 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
 timed out after 600 messages in our radosgw logs, with radosgw
 eventually stalling. A restart of the radosgw helps for a few minutes,
 but after that it hangs again.

 ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows
 call rgw.bucket_list ops. No new bucket lists are requested, so those
 ops seem to stay there. Anyone any idea how to get rid of those. Restart
 of the affecting osd didn't help neither.

 I'm not sure if its related, but we have an object called _sanity in
 the bucket where the listing was performed on. I know there is some bug
 with objects starting with _.

 Any help would be much appreciated.

 r,
 Sam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw hanging - blocking rgw.bucket_list ops

2015-08-21 Thread Sam Wouters
I suspect these to be the cause:

rados ls -p .be-east.rgw.buckets | grep
sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity   
be-east.5436.1__sanity
be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity   
be-east.5436.1__sanity
be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity   
be-east.5436.1__sanity
be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity   
be-east.5436.1__sanity
be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity   
be-east.5436.1__sanity
be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity   
be-east.5436.1__sanity
be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity   
be-east.5436.1__sanity
be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity   
be-east.5436.1__sanity
be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity   
be-east.5436.1__sanity
be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity   
be-east.5436.1__sanity
be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity   
be-east.5436.1__sanity
be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity   
be-east.5436.1__sanity
be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity   
be-east.5436.1__sanity
be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity   
be-east.5436.1__sanity
be-east.5436.1___sanitybe-east.5436.1__sanity

would it be save and/or help to remove those with rados rm, and try an
bucket check --fix --check-objects?

On 21-08-15 11:28, Sam Wouters wrote:
 Hi,

 We are running hammer 0.94.2 and have an increasing amount of
 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
 timed out after 600 messages in our radosgw logs, with radosgw
 eventually stalling. A restart of the radosgw helps for a few minutes,
 but after that it hangs again.

 ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows
 call rgw.bucket_list ops. No new bucket lists are requested, so those
 ops seem to stay there. Anyone any idea how to get rid of those. Restart
 of the affecting osd didn't help neither.

 I'm not sure if its related, but we have an object called _sanity in
 the bucket where the listing was performed on. I know there is some bug
 with objects starting with _.

 Any help would be much appreciated.

 r,
 Sam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Testing CephFS

2015-08-21 Thread Gregory Farnum
On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam s...@pml.ac.uk wrote:
 Hey all,



 We are currently testing CephFS on a small (3 node) cluster.



 The setup is currently:



 Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:

 The servers are running: 0.94.2-0.el7

 The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64



 ceph -s

 cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd

  health HEALTH_OK

  monmap e1: 3 mons at
 {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}

 election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3

  mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby

  osdmap e389: 36 osds: 36 up, 36 in

   pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects

 95526 MB used, 196 TB / 196 TB avail

 8256 active+clean



 Our Ceph.conf is relatively simple at the moment:



 cat /etc/ceph/ceph.conf

 [global]

 fsid = 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd

 mon_initial_members = ceph1, ceph2, ceph3

 mon_host = 10.15.0.1,10.15.0.2,10.15.0.3

 mon_pg_warn_max_per_osd = 1000

 auth_cluster_required = cephx

 auth_service_required = cephx

 auth_client_required = cephx

 filestore_xattr_use_omap = true

 osd_pool_default_size = 2



 When I pulled the plug on the master MDS last time (ceph1), it stopped all
 IO until I plugged it back in. I was under the assumption that the MDS would
 fail over the other 2 MDS’s and IO would continue?



 Is there something I need to do to allow the MDS’s to failover from each
 other without too much interruption? Or is this because the clients ceph
 version?

That's quite strange. How long did you wait for it to fail over? Did
the output of ceph -s (or ceph -w, whichever) change during that
time?
By default the monitors should have detected the MDS was dead after 30
seconds and put one of the other MDS nodes into replay and active.

...I wonder if this is because you lost a monitor at the same time as
the MDS. What kind of logging do you have available from during your
test?
-Greg




 Cheers,



 Simon Hallam

 Linux Support  Development Officer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados: Undefined symbol error

2015-08-21 Thread Jason Dillaman
It sounds like you have rados CLI tool from an earlier Ceph release ( Hammer) 
installed and it is attempting to use the librados shared library from a newer 
(= Hammer) version of Ceph.

Jason 


- Original Message - 

 From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com
 To: ceph-us...@ceph.com
 Sent: Thursday, August 20, 2015 11:47:26 PM
 Subject: [ceph-users] Rados: Undefined symbol error

 Hello,

 I cloned the master branch of Ceph and after setting up the cluster, when I
 tried to use the rados commands, I got this error:

 rados: symbol lookup error: rados: undefined symbol:
 _ZN5MutexC1ERKSsbbbP11CephContext

 I saw a similar post here: http://tracker.ceph.com/issues/12563 but I am not
 clear on the solution for this problem. I am not performing an upgrade here
 but the error seems to be similar. Could anybody shed more light on the
 issue and how to solve it? Thanks a lot!

 Aakanksha

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread J-P Methot
Hi,

First of all, we are sure that the return to the default configuration
fixed it. As soon as we restarted only one of the ceph nodes with the
default configuration, it sped up recovery tremedously. We had already
restarted before with the old conf and recovery was never that fast.

Regarding the configuration, here's the old one with comments :

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = ***
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true   //
  Let's you use xattributes of xfs/ext4/btrfs filesystems
osd_pool_default_pgp_num = 450   //
default pgp number for new pools
osd_pg_bits = 12  //
 number of bits used to designate pgps. Lets you have 2^12 pgps
osd_pool_default_size = 3   //
 default copy number for new pools
osd_pool_default_pg_num = 450//
default pg number for new pools
public_network = *
cluster_network = ***
osd_pgp_bits = 12   //
 number of bits used to designate pgps. Let's you have 2^12 pgps

[osd]
filestore_queue_max_ops = 5000// set to 500 by default Defines the
maximum number of in progress operations the file store accepts before
blocking on queuing new operations.
filestore_fd_cache_random = true//  
journal_queue_max_ops = 100   //   set
to 500 by default. Number of operations allowed in the journal queue
filestore_omap_header_cache_size = 100  //   Determines
the size of the LRU used to cache object omap headers. Larger values use
more memory but may reduce lookups on omap.
filestore_fd_cache_size = 100 //
not in the ceph documentation. Seems to be a common tweak for SSD
clusters though.
max_open_files = 100 //
  lets ceph set the max file descriptor in the OS to prevent running out
of file descriptors
osd_journal_size = 1   //
journal max size for each OSD

New conf:

[global]
fsid = *
mon_initial_members = cephmon1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = **
cluster_network = **

You might notice, I have a few undocumented settings in the old
configuration. These are settings I took from a certain openstack summit
presentation and they may have contributed to this whole problem. Here's
a list of settings that I think might be a possible cause for these
speed issues:

filestore_fd_cache_random = true
filestore_fd_cache_size = 100

Additionally, my colleague thinks these settings may have contributed :

filestore_queue_max_ops = 5000
journal_queue_max_ops = 100

We will do further tests on these settings once we have our lab ceph
test environment as we are also curious as to exactly what caused this.


On 2015-08-20 11:43 AM, Alex Gorbachev wrote:

 Just to update the mailing list, we ended up going back to default
 ceph.conf without any additional settings than what is mandatory. We are
 now reaching speeds we never reached before, both in recovery and in
 regular usage. There was definitely something we set in the ceph.conf
 bogging everything down.
 
 Could you please share the old and new ceph.conf, or the section that
 was removed?
 
 Best regards,
 Alex
 


 On 2015-08-20 4:06 AM, Christian Balzer wrote:

 Hello,

 from all the pertinent points by Somnath, the one about pre-conditioning
 would be pretty high on my list, especially if this slowness persists and
 nothing else (scrub) is going on.

 This might be fixed by doing a fstrim.

 Additionally the levelDB's per OSD are of course sync'ing heavily during
 reconstruction, so that might not be the favorite thing for your type of
 SSDs.

 But ultimately situational awareness is very important, as in what is
 actually going and slowing things down.
 As usual my recommendations would be to use atop, iostat or similar on all
 your nodes and see if your OSD SSDs are indeed the bottleneck or if it is
 maybe just one of them or something else entirely.

 Christian

 On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote:

 Also, check if scrubbing started in the cluster or not. That may
 considerably slow down the cluster.

 -Original Message-
 From: Somnath Roy
 Sent: Wednesday, August 19, 2015 1:35 PM
 To: 'J-P Methot'; ceph-us...@ceph.com
 Subject: RE: [ceph-users] Bad performances in recovery

 All the writes will go through the journal.
 It may happen your SSDs are not preconditioned well and after a lot of
 writes during recovery IOs are stabilized to lower number. This is quite
 common for SSDs if that is the 

Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread Shinobu Kinjo
 filestore_fd_cache_random = true

not true

Shinobu

On Fri, Aug 21, 2015 at 10:20 PM, Jan Schermer j...@schermer.cz wrote:

 Thanks for the config,
 few comments inline:, not really related to the issue

  On 21 Aug 2015, at 15:12, J-P Methot jpmet...@gtcomm.net wrote:
 
  Hi,
 
  First of all, we are sure that the return to the default configuration
  fixed it. As soon as we restarted only one of the ceph nodes with the
  default configuration, it sped up recovery tremedously. We had already
  restarted before with the old conf and recovery was never that fast.
 
  Regarding the configuration, here's the old one with comments :
 
  [global]
  fsid = *
  mon_initial_members = cephmon1
  mon_host = ***
  auth_cluster_required = cephx
  auth_service_required = cephx
  auth_client_required = cephx
  filestore_xattr_use_omap = true   //
   Let's you use xattributes of xfs/ext4/btrfs filesystems

 This actually did the opposite, but this option doesn't exist anymore

  osd_pool_default_pgp_num = 450   //
  default pgp number for new pools
  osd_pg_bits = 12  //
  number of bits used to designate pgps. Lets you have 2^12 pgps

 Could someone comment on those? What exactly does it do? What if I have
 more PGs than num_osds*osd_pg_bits?

  osd_pool_default_size = 3   //
  default copy number for new pools
  osd_pool_default_pg_num = 450//
  default pg number for new pools
  public_network = *
  cluster_network = ***
  osd_pgp_bits = 12   //
  number of bits used to designate pgps. Let's you have 2^12 pgps
 
  [osd]
  filestore_queue_max_ops = 5000// set to 500 by default Defines the
  maximum number of in progress operations the file store accepts before
  blocking on queuing new operations.
  filestore_fd_cache_random = true//  

 No docs, I don't see this in my ancient cluster :-)

  journal_queue_max_ops = 100   //   set
  to 500 by default. Number of operations allowed in the journal queue
  filestore_omap_header_cache_size = 100  //   Determines
  the size of the LRU used to cache object omap headers. Larger values use
  more memory but may reduce lookups on omap.
  filestore_fd_cache_size = 100 //

 You don't really need to set this so high, but not sure what the
 implications are if you go too high (it probably doesn't eat more memory
 until it opens so many files). If you have 4MB object on a 1TB drive than
 you really only need 250K to keep all files open.
  not in the ceph documentation. Seems to be a common tweak for SSD
  clusters though.
  max_open_files = 100 //
   lets ceph set the max file descriptor in the OS to prevent running out
  of file descriptors

 This is too low if you were really using all of the fd_cache. There are
 going to be thousands of tcp connection which need to be accounted for as
 well.
 (in my experience there can be hundreds to thousands tcp connection from
 just one RBD client and 200 OSDs, which is a lot).


  osd_journal_size = 1   //
 journal max size for each OSD
 
  New conf:
 
  [global]
  fsid = *
  mon_initial_members = cephmon1
  mon_host = 
  auth_cluster_required = cephx
  auth_service_required = cephx
  auth_client_required = cephx
  public_network = **
  cluster_network = **
 
  You might notice, I have a few undocumented settings in the old
  configuration. These are settings I took from a certain openstack summit
  presentation and they may have contributed to this whole problem. Here's
  a list of settings that I think might be a possible cause for these
  speed issues:
 
  filestore_fd_cache_random = true
  filestore_fd_cache_size = 100
 
  Additionally, my colleague thinks these settings may have contributed :
 
  filestore_queue_max_ops = 5000
  journal_queue_max_ops = 100
 
  We will do further tests on these settings once we have our lab ceph
  test environment as we are also curious as to exactly what caused this.
 
 
  On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
 
  Just to update the mailing list, we ended up going back to default
  ceph.conf without any additional settings than what is mandatory. We
 are
  now reaching speeds we never reached before, both in recovery and in
  regular usage. There was definitely something we set in the ceph.conf
  bogging everything down.
 
  Could you please share the old and new ceph.conf, or the section that
  was removed?
 
  Best regards,
  Alex
 
 
 
  On 2015-08-20 4:06 AM, Christian Balzer wrote:
 
  Hello,
 
  from all the pertinent points by Somnath, the one 

Re: [ceph-users] Bad performances in recovery

2015-08-21 Thread Jan Schermer
Thanks for the config,
few comments inline:, not really related to the issue 

 On 21 Aug 2015, at 15:12, J-P Methot jpmet...@gtcomm.net wrote:
 
 Hi,
 
 First of all, we are sure that the return to the default configuration
 fixed it. As soon as we restarted only one of the ceph nodes with the
 default configuration, it sped up recovery tremedously. We had already
 restarted before with the old conf and recovery was never that fast.
 
 Regarding the configuration, here's the old one with comments :
 
 [global]
 fsid = *
 mon_initial_members = cephmon1
 mon_host = ***
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true   //
  Let's you use xattributes of xfs/ext4/btrfs filesystems

This actually did the opposite, but this option doesn't exist anymore

 osd_pool_default_pgp_num = 450   //
 default pgp number for new pools
 osd_pg_bits = 12  //
 number of bits used to designate pgps. Lets you have 2^12 pgps

Could someone comment on those? What exactly does it do? What if I have more 
PGs than num_osds*osd_pg_bits?

 osd_pool_default_size = 3   //
 default copy number for new pools
 osd_pool_default_pg_num = 450//
 default pg number for new pools
 public_network = *
 cluster_network = ***
 osd_pgp_bits = 12   //
 number of bits used to designate pgps. Let's you have 2^12 pgps
 
 [osd]
 filestore_queue_max_ops = 5000// set to 500 by default Defines the
 maximum number of in progress operations the file store accepts before
 blocking on queuing new operations.
 filestore_fd_cache_random = true//  

No docs, I don't see this in my ancient cluster :-)

 journal_queue_max_ops = 100   //   set
 to 500 by default. Number of operations allowed in the journal queue
 filestore_omap_header_cache_size = 100  //   Determines
 the size of the LRU used to cache object omap headers. Larger values use
 more memory but may reduce lookups on omap.
 filestore_fd_cache_size = 100 //

You don't really need to set this so high, but not sure what the implications 
are if you go too high (it probably doesn't eat more memory until it opens so 
many files). If you have 4MB object on a 1TB drive than you really only need 
250K to keep all files open.
 not in the ceph documentation. Seems to be a common tweak for SSD
 clusters though.
 max_open_files = 100 //
  lets ceph set the max file descriptor in the OS to prevent running out
 of file descriptors

This is too low if you were really using all of the fd_cache. There are going 
to be thousands of tcp connection which need to be accounted for as well.
(in my experience there can be hundreds to thousands tcp connection from just 
one RBD client and 200 OSDs, which is a lot).


 osd_journal_size = 1   //
journal max size for each OSD
 
 New conf:
 
 [global]
 fsid = *
 mon_initial_members = cephmon1
 mon_host = 
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 public_network = **
 cluster_network = **
 
 You might notice, I have a few undocumented settings in the old
 configuration. These are settings I took from a certain openstack summit
 presentation and they may have contributed to this whole problem. Here's
 a list of settings that I think might be a possible cause for these
 speed issues:
 
 filestore_fd_cache_random = true
 filestore_fd_cache_size = 100
 
 Additionally, my colleague thinks these settings may have contributed :
 
 filestore_queue_max_ops = 5000
 journal_queue_max_ops = 100
 
 We will do further tests on these settings once we have our lab ceph
 test environment as we are also curious as to exactly what caused this.
 
 
 On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
 
 Just to update the mailing list, we ended up going back to default
 ceph.conf without any additional settings than what is mandatory. We are
 now reaching speeds we never reached before, both in recovery and in
 regular usage. There was definitely something we set in the ceph.conf
 bogging everything down.
 
 Could you please share the old and new ceph.conf, or the section that
 was removed?
 
 Best regards,
 Alex
 
 
 
 On 2015-08-20 4:06 AM, Christian Balzer wrote:
 
 Hello,
 
 from all the pertinent points by Somnath, the one about pre-conditioning
 would be pretty high on my list, especially if this slowness persists and
 nothing else (scrub) is going on.
 
 This might be fixed by doing a fstrim.
 
 Additionally the levelDB's per OSD are of 

Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Ilya Dryomov
On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just sj...@redhat.com wrote:
 Odd, did you happen to capture osd logs?

No, but the reproducer is trivial to cut  paste.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Samuel Just
Odd, did you happen to capture osd logs?
-Sam

On Thu, Aug 20, 2015 at 8:10 PM, Ilya Dryomov idryo...@gmail.com wrote:
 On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just sj...@redhat.com wrote:
 What's supposed to happen is that the client transparently directs all
 requests to the cache pool rather than the cold pool when there is a
 cache pool.  If the kernel is sending requests to the cold pool,
 that's probably where the bug is.  Odd.  It could also be a bug
 specific 'forward' mode either in the client or on the osd.  Why did
 you have it in that mode?

 I think I reproduced this on today's master.

 Setup, cache mode is writeback:

 $ ./ceph osd pool create foo 12 12
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo' created
 $ ./ceph osd pool create foo-hot 12 12
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo-hot' created
 $ ./ceph osd tier add foo foo-hot
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo-hot' is now (or already was) a tier of 'foo'
 $ ./ceph osd tier cache-mode foo-hot writeback
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 set cache-mode for pool 'foo-hot' to writeback
 $ ./ceph osd tier set-overlay foo foo-hot
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 overlay for 'foo' is now (or already was) 'foo-hot'

 Create an image:

 $ ./rbd create --size 10M --image-format 2 foo/bar
 $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
 $ sudo mkfs.ext4 /mnt/bar
 $ sudo umount /mnt

 Create a snapshot, take md5sum:

 $ ./rbd snap create foo/bar@snap
 $ ./rbd export foo/bar /tmp/foo-1
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-1
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-1
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-1
 $ md5sum /tmp/snap-1
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-1

 Set the cache mode to forward and do a flush, hashes don't match - the
 snap is empty - we bang on the hot tier and don't get redirected to the
 cold tier, I suspect:

 $ ./ceph osd tier cache-mode foo-hot forward
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 set cache-mode for pool 'foo-hot' to forward
 $ ./rados -p foo-hot cache-flush-evict-all
 rbd_data.100a6b8b4567.0002
 rbd_id.bar
 rbd_directory
 rbd_header.100a6b8b4567
 bar.rbd
 rbd_data.100a6b8b4567.0001
 rbd_data.100a6b8b4567.
 $ ./rados -p foo-hot cache-flush-evict-all
 $ ./rbd export foo/bar /tmp/foo-2
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-2
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-2
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-2
 $ md5sum /tmp/snap-2
 f1c9645dbc14efddc7d8a322685f26eb  /tmp/snap-2
 $ od /tmp/snap-2
 000 00 00 00 00 00 00 00 00
 *
 5000

 Disable the cache tier and we are back to normal:

 $ ./ceph osd tier remove-overlay foo
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 there is now (or already was) no overlay for 'foo'
 $ ./rbd export foo/bar /tmp/foo-3
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-3
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-3
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-3
 $ md5sum /tmp/snap-3
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-3

 I first reproduced it with the kernel client, rbd export was just to
 take it out of the equation.


 Also, Igor sort of raised a question in his second message: if, after
 setting the cache mode to forward and doing a flush, I open an image
 (not a snapshot, so may not be related to the above) for write (e.g.
 with rbd-fuse), I get an rbd header object in the hot pool, even though
 it's in forward mode:

 $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
 $ sudo mount /mnt/bar /media
 $ sudo umount /media
 $ sudo umount /mnt
 $ ./rados -p foo-hot ls
 rbd_header.100a6b8b4567
 $ ./rados -p foo ls | grep rbd_header
 rbd_header.100a6b8b4567

 It's been a while since I looked into tiering, is that how it's
 supposed to work?  It looks like it happens because rbd_header op
 replies don't redirect?

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s

2015-08-21 Thread Sean
We heavily use radosgw here for most of our work and we have seen a 
weird truncation issue with radosgw/s3 requests.


We have noticed that if the time between the initial ticket to grab 
the object key and grabbing the data is greater than 90 seconds the 
object returned is truncated to whatever RGW has grabbed/cached after 
the initial connection and this seems to be around 512k.


Here is some PoC. This will work on most objects I have tested mostly 1G 
to 5G keys in RGW::




#!/usr/bin/env python

import os
import sys
import json
import time

import boto
import boto.s3.connection

if __name__ == '__main__':
import argparse

parser = argparse.ArgumentParser(description='Delayed download.')

parser.add_argument('credentials', type=argparse.FileType('r'),
help='Credentials file.')

parser.add_argument('endpoint')
parser.add_argument('bucket')
parser.add_argument('key')

args = parser.parse_args()

credentials= json.load(args.credentials)[args.endpoint]

conn = boto.connect_s3(
aws_access_key_id = credentials.get('access_key'),
aws_secret_access_key = credentials.get('secret_key'),
host  = credentials.get('host'),
port  = credentials.get('port'),
is_secure = credentials.get('is_secure',False),
calling_format= boto.s3.connection.OrdinaryCallingFormat(),
)

key = conn.get_bucket(args.bucket).get_key(args.key)

key.BufferSize = 1048576
key.open_read(headers={})
time.sleep(120)

key.get_contents_to_file(sys.stdout)



The format of the credentials file is just standard::

=
=
{
 cluster: {
access_key: blahblahblah,
secret_key: blahblahblah,
host: blahblahblah,
port: 443,
is_secure: true
}
}

=
=


From here your object will almost always be truncated to whatever the 
gateway has cached in the time after the initial key request.


This can be a huge issue as if the radosgw or cluster is tasked some 
requests can be minutes long. You can end up grabbing the rest of the 
object by doing a range request against the gateway so I know the data 
is intact but I don't think the radosgw should be acting as if the 
download is completed successfully and I think it should instead return 
an error of some kind if it can no longer service the request.


We are using hammer (ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our 
gateway.


This is on a 3 node test cluster but I have tried on our larger cluster 
with the same behavior. If I can provide any other information please 
let me know.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Samuel Just
I think I found the bug -- need to whiteout the snapset (or decache
it) upon evict.

http://tracker.ceph.com/issues/12748
-Sam

On Fri, Aug 21, 2015 at 8:04 AM, Ilya Dryomov idryo...@gmail.com wrote:
 On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just sj...@redhat.com wrote:
 Odd, did you happen to capture osd logs?

 No, but the reproducer is trivial to cut  paste.

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Shouldn't this already be possible with HTTP Range requests? I don't
work with RGW or S3 so please ignore me if I'm talking crazy.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Aug 21, 2015 at 3:27 PM, Scottix  wrote:
 I saw this article on Linux Today and immediately thought of Ceph.

 http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html

 I was thinking would it theoretically be possible with RGW to do a GET and
 set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file.

 The other option to append data to a RGW object instead of rewriting the
 entire object.
 And so on...

 Just food for thought.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV15/ICRDmVDuy+mK58QAAnkAP/3q804Y7xJDqadNxFjWd
A1hzTcRfN6oqzZCf0T8stteTTG93Jt1R01ae2ZoVCM8EsefbovaPX68qy6kC
sw4JN+G9h2Ow01X5nWD1mvQIPde0+kdTqK6jejTPr8tWQ/J1/98kkkqH4FGp
TI3bOVBHik38RMt1G+yzVOS8E2lmckujzUsoQqA8kOyodsglQqAVj3kD8KAc
me+BlcOvZhP2eV0Tg8FtAjaUp22bJbh/V+a2ycwoNKKS5YsiP3bQHbaI8FAK
DYzndaS6UiwAhYjszmADRCqLXfmo8KkNYCr6xzr8oHSdPR33V87eFnkkaNmX
pkGSuwblA19QT0PiVan8B5XRUd7HcdcjUPrbGtjmRsrF2QtzHD+Fda6qw48/
TljMye6rnMX6A87UuIVpIj33OZiJRdiFwjMXQuSWCMl7WIYXU75KZKR5rsss
zX6NRIF3tSq0TBjcOFQN3+531XuCgsjwe3/zu2f1a/1JaGMAmMCO6vMdPhxU
dgkk31Ou7BbIuOzZmfagnNvRSdNLu5AUXZLlu5D+BhrH28kxzW0fXtoqyqU5
tGk83pP+sr6sJaAk4nfzEQWLE8LHxtkS21CE5Aa0u1av9Sg0T5R84hYfPw+W
skc67t2TVPHnphuLF2x2+xPArG3Ghuf2qD2Roz6zwkhpKQVprI8eiuu1lIfd
Yl/b
=w+bI
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD GHz vs. Cores Question

2015-08-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are looking to purchase our next round of Ceph hardware and based
off the work by Nick Fisk [1] our previous thought of cores over clock
is being revisited.

I have two camps of thoughts and would like to get some feedback, even
if it is only theoretical. We currently have 12 disks per node (2
SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
for journals and cache tier (when [2] and fstrim are resolved). We
also want to stay with a single processor for cost, power and NUMA
considerations.

1. For 12 disks with three threads each (2 client and 1 background),
lots of slower cores would allow I/O (ceph code) to be scheduled as
soon as a core is available.

2. Faster cores would get through the Ceph code faster but there would
be less cores and so some I/O may have to wait to be scheduled.

I'm leaning towards #2 for these reasons, please expose anything I may
be missing:
* The latency will only really be improved in the SSD I/O with faster
clock speed, all writes and any reads from the cache tier. So 8 fast
cores might be sufficient, reading from spindle and flushing the
journal will have a substantial amount of sleep to allow other Ceph
I/O to be hyperthreaded.
* Even though SSDs are much faster than spindles they are still orders
of magnitude slower than the processor, so it is still possible to get
more lines of code executed between SSD I/O with a faster processor
even with less cores.
* As the Ceph code is improved through optimization and less code has
to be executed for each I/O, faster clock speeds will only provide
even more benefit (lower latency, less waiting for cores) as the delay
shifts more from CPU to disk.

Since our workload is typically small I/O 12K-18K, latency means a lot
to our performance.

Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz

[1] http://www.spinics.net/lists/ceph-users/msg19305.html
[2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0PmS
CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb
ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw
xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP
oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOFX
6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezwv
9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8
LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq
XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOnM
LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw
EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k
OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o
kUd4
=F5Sx
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s

2015-08-21 Thread Ben Hines
I just tried this (with some smaller objects, maybe 4.5 MB, as well as
with a 16 GB file and it worked fine.

However, i am using apache + fastcgi interface to rgw, rather than civetweb.

-Ben

On Fri, Aug 21, 2015 at 12:19 PM, Sean seapasu...@uchicago.edu wrote:
 We heavily use radosgw here for most of our work and we have seen a weird
 truncation issue with radosgw/s3 requests.

 We have noticed that if the time between the initial ticket to grab the
 object key and grabbing the data is greater than 90 seconds the object
 returned is truncated to whatever RGW has grabbed/cached after the initial
 connection and this seems to be around 512k.

 Here is some PoC. This will work on most objects I have tested mostly 1G to
 5G keys in RGW::

 
 
 #!/usr/bin/env python

 import os
 import sys
 import json
 import time

 import boto
 import boto.s3.connection

 if __name__ == '__main__':
 import argparse

 parser = argparse.ArgumentParser(description='Delayed download.')

 parser.add_argument('credentials', type=argparse.FileType('r'),
 help='Credentials file.')

 parser.add_argument('endpoint')
 parser.add_argument('bucket')
 parser.add_argument('key')

 args = parser.parse_args()

 credentials= json.load(args.credentials)[args.endpoint]

 conn = boto.connect_s3(
 aws_access_key_id = credentials.get('access_key'),
 aws_secret_access_key = credentials.get('secret_key'),
 host  = credentials.get('host'),
 port  = credentials.get('port'),
 is_secure = credentials.get('is_secure',False),
 calling_format= boto.s3.connection.OrdinaryCallingFormat(),
 )

 key = conn.get_bucket(args.bucket).get_key(args.key)

 key.BufferSize = 1048576
 key.open_read(headers={})
 time.sleep(120)

 key.get_contents_to_file(sys.stdout)
 
 

 The format of the credentials file is just standard::

 =
 =
 {
  cluster: {
 access_key: blahblahblah,
 secret_key: blahblahblah,
 host: blahblahblah,
 port: 443,
 is_secure: true
 }
 }

 =
 =


 From here your object will almost always be truncated to whatever the
 gateway has cached in the time after the initial key request.

 This can be a huge issue as if the radosgw or cluster is tasked some
 requests can be minutes long. You can end up grabbing the rest of the object
 by doing a range request against the gateway so I know the data is intact
 but I don't think the radosgw should be acting as if the download is
 completed successfully and I think it should instead return an error of some
 kind if it can no longer service the request.

 We are using hammer (ceph version 0.94.2
 (5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our
 gateway.

 This is on a 3 node test cluster but I have tried on our larger cluster with
 the same behavior. If I can provide any other information please let me
 know.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Scottix
I saw this article on Linux Today and immediately thought of Ceph.

http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html

I was thinking would it theoretically be possible with RGW to do a GET and
set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the
file.

The other option to append data to a RGW object instead of rewriting the
entire object.
And so on...

Just food for thought.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Storage and POSIX Mix

2015-08-21 Thread Gregory Farnum
On Fri, Aug 21, 2015 at 10:27 PM, Scottix scot...@gmail.com wrote:
 I saw this article on Linux Today and immediately thought of Ceph.

 http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html

 I was thinking would it theoretically be possible with RGW to do a GET and
 set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file.

 The other option to append data to a RGW object instead of rewriting the
 entire object.
 And so on...

 Just food for thought.

Raw RADOS (ie, librados users) get access significantly more powerful
than what he's describing in that article. :) I don't know if anybody
will ever punch more of that functionality through RGW or not.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about reliability model result

2015-08-21 Thread dahan
Hi,
I have crosspost this issue here and in github,
but no response yet.

Any advice?

On Mon, Aug 10, 2015 at 10:21 AM, dahan dahan...@gmail.com wrote:


 Hi all, I have tried the reliability model:
 https://github.com/ceph/ceph-tools/tree/master/models/reliability

 I run the tool with default configuration, and cannot understand the
 result.

 ```
 storage   durabilityPL(site)  PL(copies) PL(NRE)
   PL(rep)loss/PiB
 ----  --  --  --
  --  --
 Disk: Enterprise 99.119%   0.000e+00   0.721457%   0.159744%
 0.000e+00   8.812e+12
 RADOS: 1 cp  99.279%   0.000e+00   0.721457%   0.000865%
 0.000e+00   5.411e+12
 RADOS: 2 cp  7-nines   0.000e+00   0.49%   0.003442%
 0.000e+00   9.704e+06
 RADOS: 3 cp 11-nines   0.000e+00   5.090e-11   3.541e-09
 0.000e+00   6.655e+02
 ```

 ```
 storage   durabilityPL(site)  PL(copies) PL(NRE)
   PL(rep)loss/PiB
 ----  --  --  --
  --  --
 Site (1 PB)  99.900%   0.099950%   0.000e+00   0.000e+00
 0.000e+00   9.995e+11
 RADOS: 1-site, 1-cp  99.179%   0.099950%   0.721457%   0.000865%
 0.000e+00   1.010e+12
 RADOS: 1-site, 2-cp  99.900%   0.099950%   0.49%   0.003442%
 0.000e+00   9.995e+11
 RADOS: 1-site, 3-cp  99.900%   0.099950%   5.090e-11   3.541e-09
 0.000e+00   9.995e+11

 ```

 The two result tables have different trend. In the first table, durability
 value is 1 cp  2 cp  3 cp. However, the second table results in 1 cp  2
 cp = 3 cp.

 The two tables have the same PL(site),  PL(copies) , PL(NRE), and PL(rep).
 The only difference is PL(site). PL(site) is constant, since number of site
 is constant. The trend should be the same.

 How to explain the result?

 Anything I missed out? Thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: Question

2015-08-21 Thread Vickie ch
Hi ,
 I've do that before and when I try to write file into rbd. It's get
freeze.
​Beside resource, is there any other reason not recommend to combined mon
and osd? ​



Best wishes,
Mika


2015-08-18 15:52 GMT+08:00 Межов Игорь Александрович me...@yuterra.ru:

 Hi!

 You can run mons on the same hosts, though it is not recommemned. MON
 daemon
 itself are not resurce hungry - 1-2 cores and 2-4 Gb RAM is enough in most
 small
 installs. But there are some pitfalls:
 - MONs use LevelDB as a backstorage, and widely use direct write to ensure
 DB consistency.
 So, if MON daemon coexits with OSDs not only on the same host, but on the
 same
 volume/disk/controller - it will severily reduce disk io available to OSD,
 thus greatly
 reduce overall performance. Moving MONs root to separate spindle, or
 better - separate SSD
 will keep MONs running fine with OSDs at the same host.
 - When cluster is in healthy state, MONs are not resource consuming, but
 when cluster
 in changing state (adding/removing OSDs, backfiling, etc) the CPU and
 memory usage
 for MON can raise significantly.

 And yes, in small cluster, it is not alaways possible to get 3 separate
 hosts for MONs only.


 Megov Igor
 CIO, Yuterra

 --
 *От:* ceph-users ceph-users-boun...@lists.ceph.com от имени Luis
 Periquito periqu...@gmail.com
 *Отправлено:* 17 августа 2015 г. 17:09
 *Кому:* Kris Vaes
 *Копия:* ceph-users@lists.ceph.com
 *Тема:* Re: [ceph-users] Question

 yes. The issue is resource sharing as usual: the MONs will use disk I/O,
 memory and CPU. If the cluster is small (test?) then there's no problem in
 using the same disks. If the cluster starts to get bigger you may want to
 dedicate resources (e.g. the disk for the MONs isn't used by an OSD). If
 the cluster is big enough you may want to dedicate a node for being a MON.

 On Mon, Aug 17, 2015 at 2:56 PM, Kris Vaes k...@s3s.eu wrote:

 Hi,

 Maybe this seems like a strange question but i could not find this info
 in the docs , i have following question,

 For the ceph cluster you need osd daemons and monitor daemons,

 On a host you can run several osd daemons (best one per drive as read in
 the docs) on one host

 But now my question  can you run on the same host where you run already
 some osd daemons the monitor daemon

 Is this possible and what are the implications of doing this



 Met Vriendelijke Groeten
 Cordialement
 Kind Regards
 Cordialmente
 С приятелски поздрави


 This message (including any attachments) may be privileged or
 confidential. If you have received it by mistake, please notify the sender
 by return e-mail and delete this message from your system. Any unauthorized
 use or dissemination of this message in whole or in part is strictly
 prohibited. S3S rejects any liability for the improper, incomplete or
 delayed transmission of the information contained in this message, as well
 as for damages resulting from this e-mail message. S3S cannot guarantee
 that the message received by you has not been intercepted by third parties
 and/or manipulated by computer programs used to transmit messages and
 viruses.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken snapshots... CEPH 0.94.2

2015-08-21 Thread Voloshanenko Igor
Exact as in our case.

Ilya, same for images from our side. Headers opened from hot tier

пятница, 21 августа 2015 г. пользователь Ilya Dryomov написал:

 On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just sj...@redhat.com
 javascript:; wrote:
  What's supposed to happen is that the client transparently directs all
  requests to the cache pool rather than the cold pool when there is a
  cache pool.  If the kernel is sending requests to the cold pool,
  that's probably where the bug is.  Odd.  It could also be a bug
  specific 'forward' mode either in the client or on the osd.  Why did
  you have it in that mode?

 I think I reproduced this on today's master.

 Setup, cache mode is writeback:

 $ ./ceph osd pool create foo 12 12
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo' created
 $ ./ceph osd pool create foo-hot 12 12
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo-hot' created
 $ ./ceph osd tier add foo foo-hot
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 pool 'foo-hot' is now (or already was) a tier of 'foo'
 $ ./ceph osd tier cache-mode foo-hot writeback
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 set cache-mode for pool 'foo-hot' to writeback
 $ ./ceph osd tier set-overlay foo foo-hot
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 overlay for 'foo' is now (or already was) 'foo-hot'

 Create an image:

 $ ./rbd create --size 10M --image-format 2 foo/bar
 $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
 $ sudo mkfs.ext4 /mnt/bar
 $ sudo umount /mnt

 Create a snapshot, take md5sum:

 $ ./rbd snap create foo/bar@snap
 $ ./rbd export foo/bar /tmp/foo-1
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-1
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-1
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-1
 $ md5sum /tmp/snap-1
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-1

 Set the cache mode to forward and do a flush, hashes don't match - the
 snap is empty - we bang on the hot tier and don't get redirected to the
 cold tier, I suspect:

 $ ./ceph osd tier cache-mode foo-hot forward
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 set cache-mode for pool 'foo-hot' to forward
 $ ./rados -p foo-hot cache-flush-evict-all
 rbd_data.100a6b8b4567.0002
 rbd_id.bar
 rbd_directory
 rbd_header.100a6b8b4567
 bar.rbd
 rbd_data.100a6b8b4567.0001
 rbd_data.100a6b8b4567.
 $ ./rados -p foo-hot cache-flush-evict-all
 $ ./rbd export foo/bar /tmp/foo-2
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-2
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-2
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-2
 $ md5sum /tmp/snap-2
 f1c9645dbc14efddc7d8a322685f26eb  /tmp/snap-2
 $ od /tmp/snap-2
 000 00 00 00 00 00 00 00 00
 *
 5000

 Disable the cache tier and we are back to normal:

 $ ./ceph osd tier remove-overlay foo
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 there is now (or already was) no overlay for 'foo'
 $ ./rbd export foo/bar /tmp/foo-3
 Exporting image: 100% complete...done.
 $ ./rbd export foo/bar@snap /tmp/snap-3
 Exporting image: 100% complete...done.
 $ md5sum /tmp/foo-3
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/foo-3
 $ md5sum /tmp/snap-3
 83f5d244bb65eb19eddce0dc94bf6dda  /tmp/snap-3

 I first reproduced it with the kernel client, rbd export was just to
 take it out of the equation.


 Also, Igor sort of raised a question in his second message: if, after
 setting the cache mode to forward and doing a flush, I open an image
 (not a snapshot, so may not be related to the above) for write (e.g.
 with rbd-fuse), I get an rbd header object in the hot pool, even though
 it's in forward mode:

 $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt
 $ sudo mount /mnt/bar /media
 $ sudo umount /media
 $ sudo umount /mnt
 $ ./rados -p foo-hot ls
 rbd_header.100a6b8b4567
 $ ./rados -p foo ls | grep rbd_header
 rbd_header.100a6b8b4567

 It's been a while since I looked into tiering, is that how it's
 supposed to work?  It looks like it happens because rbd_header op
 replies don't redirect?

 Thanks,

 Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw hanging - blocking rgw.bucket_list ops

2015-08-21 Thread Sam Wouters
Hi,

We are running hammer 0.94.2 and have an increasing amount of
heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had
timed out after 600 messages in our radosgw logs, with radosgw
eventually stalling. A restart of the radosgw helps for a few minutes,
but after that it hangs again.

ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows
call rgw.bucket_list ops. No new bucket lists are requested, so those
ops seem to stay there. Anyone any idea how to get rid of those. Restart
of the affecting osd didn't help neither.

I'm not sure if its related, but we have an object called _sanity in
the bucket where the listing was performed on. I know there is some bug
with objects starting with _.

Any help would be much appreciated.

r,
Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com