[ceph-users] Hostname not found error

2014-02-04 Thread Sahana
Hi

I am trying to form ceph cluster using ceph-deploy.

I generated ssh-keygen,  copied key to all ceph nodes, then added user and
hostname in ~/.ssh/config file :

cat /home/ems/.ssh/config
Host host1
Hostname aa.bbb.cc.d  #ip
User user

For sample I ran purge  command, I get this error:

ceph-deploy purge host1
[ceph_deploy.cli][INFO  ] Invoked (1.3.4): /usr/bin/ceph-deploy purge host1
[ceph_deploy.install][DEBUG ] Purging from cluster ceph hosts host1
[ceph_deploy.install][DEBUG ] Detecting platform for host host1...
/home/ems/.ssh/config line 2: garbage at end of line; "#ip".
[ceph_deploy][ERROR ] RuntimeError: connecting to host: host1 resulted in
errors: HostNotFound host1

When i tried with other ips like aa.bbb.cc.dd /hostname it worked.

Example: 192.168.25.1 didnot work, but 192.168.25.11 worked.

thanks
Sahana
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-04 Thread Josh Durgin

On 02/04/2014 07:44 PM, Craig Lewis wrote:



On 2/4/14 17:06 , Craig Lewis wrote:


On 2/4/14 14:43 , Yehuda Sadeh wrote:

Does it ever catching up? You mentioned before that most of the writes
went to the same two buckets, so that's probably one of them. Note
that writes to the same bucket are being handled in-order by the
agent.

Yehuda


... I think so.  This is what my graph looks like:

I think it's still catching up, despite the graph.  radosgw-admin bucket
stats shows more objects are being created in the slave zone than the
master zone during the same time period.  Not a large number though.  If
it is catching up, it'll take months at this rate.  It's not entirely
clear, because the slave zone trails the master, but it's been pretty
consistent for the past hour.

It's not doing all of the missing objects though.  I have a bucket that
I stopped writing to a few days ago.  The slave is missing ~500k
objects, and none have been added to the slave in the past hour.



You can run

$ radosgw-admin bilog list --bucket= --marker=

E.g.,

$ radosgw-admin bilog list --bucket=live-2 --marker=0127871.328492.2

The entries there should have timestamp info.


Thanks!  I'll see what I can figure out.


From the log it looks like you're hitting the default maximum number of
entries to be processed at once per shard. This was intended to prevent
one really busy shard from blocking progress on syncing other shards,
since the remainder will be synced the next time the shard is processed.
Perhaps the default is too low though, or the idea should be scrapped 
altogether since you can sync other shards in parallel.


For your particular usage, since you're updating the same few buckets,
the max entries limit is hit constantly. You can increase it with
max-entries: 100 in the config file or --max-entries 1000 on the 
command line.


Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-04 Thread Craig Lewis



On 2/4/14 17:06 , Craig Lewis wrote:


On 2/4/14 14:43 , Yehuda Sadeh wrote:

Does it ever catching up? You mentioned before that most of the writes
went to the same two buckets, so that's probably one of them. Note
that writes to the same bucket are being handled in-order by the
agent.

Yehuda


... I think so.  This is what my graph looks like:
I think it's still catching up, despite the graph.  radosgw-admin bucket 
stats shows more objects are being created in the slave zone than the 
master zone during the same time period.  Not a large number though.  If 
it is catching up, it'll take months at this rate.  It's not entirely 
clear, because the slave zone trails the master, but it's been pretty 
consistent for the past hour.


It's not doing all of the missing objects though.  I have a bucket that 
I stopped writing to a few days ago.  The slave is missing ~500k 
objects, and none have been added to the slave in the past hour.




You can run

$ radosgw-admin bilog list --bucket= --marker=

E.g.,

$ radosgw-admin bilog list --bucket=live-2 --marker=0127871.328492.2

The entries there should have timestamp info.


Thanks!  I'll see what I can figure out.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-04 Thread Yehuda Sadeh
On Tue, Feb 4, 2014 at 5:06 PM, Craig Lewis  wrote:
>
>
> On 2/4/14 14:43 , Yehuda Sadeh wrote:
>
> Now that objects are missing in the slave, how do I fix it?  radosgw-agent
> --sync-scope=full ?
>
> That would do it, yes.
>
>
> I'm hesitant to do this, at least until I understand what's going on better.  
> I know something is wrong, but I don't know what is wrong.
> I want to solve that before using a --sync-scope=full.  Otherwise it'll just 
> happen again next time I start importing data.
>
> I'm going to shutdown replication cleanly, and leave it off.  I'll import 
> enough objects that I hit > 1000 entries, then I'll start up replication with 
> --verbose.  Then I'll check if all the imported objects exist in both 
> clusters.  Repeat until I find missing objects in the slave cluster.
>
>
>
>
> A shard was locked by the agent, but the agent never unlocked it
> (maybe because you took it down?).  The lock itself has a timeout, so
> it's supposed to get released after a while, and then processing
> should resume as usual. However, when it happens you can try playing
> with the rados lock commands (rados lock list, rados lock info, rados
> lock break) to release it (as long as there's no agent running that
> has locked the shard).
>
>
> The rados lock command requires an object name.  I'll see if I can figure out 
> how to map "shard 36" to a rados object in the .rgw.buckets pool.
>
> Thanks!
>
> Does it ever catching up? You mentioned before that most of the writes
> went to the same two buckets, so that's probably one of them. Note
> that writes to the same bucket are being handled in-order by the
> agent.
>
> Yehuda
>
>
> ... I think so.  This is what my graph looks like:
>
>
>
> Being able to answer that question is really what this graph is about.  If 
> you have any suggestions for generic ways to answer that question, I'm open 
> to suggestions.  If you'd like to see what I'm doing, take a look at 
> https://github.com/ceph/radosgw-agent/pull/7
>
> Now that I've started seeing missing objects, I'm not able to download 
> objects that should be on the slave if replication is up to date.  Either 
> it's not up to date, or it's skipping objects every pass.
>
> I'm trying to get the radosgw-agent --verbose output I mentioned above, but 
> this question is more fundamental.  If I don't know if it's up to date or 
> not, looking for missing objects isn't going to do me any good.  I'll work on 
> this now, and get back to the other experiment later.


You can run

$ radosgw-admin bilog list --bucket= --marker=

E.g.,

$ radosgw-admin bilog list --bucket=live-2 --marker=0127871.328492.2

The entries there should have timestamp info.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Replication

2014-02-04 Thread Yehuda Sadeh
On Tue, Feb 4, 2014 at 2:21 PM, Craig Lewis  wrote:
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
> On 2/4/14 11:36 , Yehuda Sadeh wrote:
>
> Also, verify whether any objects are missing. Start with just counting
> the total number of objects in the buckets (radosgw-admin bucket stats
> can give you that info).
>
> Yehuda
>
>
> Thanks, I didn't know about bucket stats.
>
> bucket stats reports that the slave have fewer objects and kB than the
> master.
>
> Now that objects are missing in the slave, how do I fix it?  radosgw-agent
> --sync-scope=full ?
>

That would do it, yes.

>
>
> I figured out why replication went so quickly after the restart.  I missed
> an error in the radosgw-agent logs:
> 2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking
> shard 36 log,  skipping for now. Traceback:
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/worker.py", line 58,
> in lock_shard
> self.lock.acquire()
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/lock.py", line 65, in
> acquire
> self.zone_id, self.timeout, self.locker_id)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 241,
> in lock_shard
> expect_json=False)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 155,
> in request
> check_result_status(result)
>   File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 116,
> in check_result_status
> HttpError)(result.status_code, result.content)
> HttpError: Http error code 423 content {"Code":"Locked"}
> 2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing shard
> 36

A shard was locked by the agent, but the agent never unlocked it
(maybe because you took it down?).  The lock itself has a timeout, so
it's supposed to get released after a while, and then processing
should resume as usual. However, when it happens you can try playing
with the rados lock commands (rados lock list, rados lock info, rados
lock break) to release it (as long as there's no agent running that
has locked the shard).
>
> Full radosgw-agent.log, starting at restart:
> https://cd.centraldesktop.com/p/eAAAC60_AAia_J0
>
>
>
> I shutdown radosgw-agent, and restarted all radosgw daemons in the slave
> cluster.  Replication is proceeding again on shard 36, but I'm seeing the
> same behavior.  The slave is catching up much too quickly.
>
> Before the stall:
> root@ceph1c:/var/log/ceph# zegrep '(live-2:us-west-1|shard 36)'
> radosgw-agent.us-west-1.us-central-1.log.1.gz | grep -v
> 'WARNING:radosgw_agent.sync:shard 36 log has fallen behind' | tail
> 2014-02-03T23:19:11.434 11783:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "0115883.315938.2"
> 2014-02-03T23:24:51.246 11783:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:25:30.185 6419:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:25:46.826 6468:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "0116882.316964.3"
> 2014-02-03T23:30:13.648 6468:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:30:50.132 29240:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:31:06.808 29390:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "0117881.317984.2"
> 2014-02-03T23:38:56.830 29390:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:39:58.408 3744:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-03T23:40:15.049 3837:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "0118880.319057.3"
>
> After the radosgw and radosgw-agent restart (contained in the full logs
> linked above):
> root@ceph1c:/var/log/ceph# egrep '(live-2:us-west-1|shard 36)'
> radosgw-agent.us-west-1.us-central-1.log | grep -v
> 'WARNING:radosgw_agent.sync:shard 36 log has fallen behind'
> 2014-02-04T08:15:58.966 14045:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking
> shard 36 log,  skipping for now. Traceback:
> 2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing shard
> 36
> 2014-02-04T08:23:50.318 15231:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:24:05.970 15288:INFO:radosgw_agent.worker:bucket instance
> "live-2:us-west-1.35026898.2" has 1000 entries after "0118880.319057.3"
> 2014-02-04T08:42:20.351 15288:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:48:36.509 24250:INFO:radosgw_agent.worker:finished processing
> shard 36
> 2014-02-04T08:48:53.145 24280:INFO:ra

Re: [ceph-users] RGW Replication

2014-02-04 Thread Craig Lewis


*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



On 2/4/14 11:36 , Yehuda Sadeh wrote:

Also, verify whether any objects are missing. Start with just counting
the total number of objects in the buckets (radosgw-admin bucket stats
can give you that info).

Yehuda


Thanks, I didn't know about bucket stats.

bucket stats reports that the slave have fewer objects and kB than the 
master.


Now that objects are missing in the slave, how do I fix it? 
radosgw-agent --sync-scope=full ?




I figured out why replication went so quickly after the restart.  I 
missed an error in the radosgw-agent logs:
2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking 
shard 36 log,  skipping for now. Traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/radosgw_agent/worker.py", line 
58, in lock_shard

self.lock.acquire()
  File "/usr/lib/python2.7/dist-packages/radosgw_agent/lock.py", line 
65, in acquire

self.zone_id, self.timeout, self.locker_id)
  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 
241, in lock_shard

expect_json=False)
  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 
155, in request

check_result_status(result)
  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 
116, in check_result_status

HttpError)(result.status_code, result.content)
HttpError: Http error code 423 content {"Code":"Locked"}
2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing 
shard 36


Full radosgw-agent.log, starting at restart: 
https://cd.centraldesktop.com/p/eAAAC60_AAia_J0




I shutdown radosgw-agent, and restarted all radosgw daemons in the slave 
cluster.  Replication is proceeding again on shard 36, but I'm seeing 
the same behavior.  The slave is catching up much too quickly.


Before the stall:
root@ceph1c:/var/log/ceph# zegrep '(live-2:us-west-1|shard 36)' 
radosgw-agent.us-west-1.us-central-1.log.1.gz | grep -v 
'WARNING:radosgw_agent.sync:shard 36 log has fallen behind' | tail
2014-02-03T23:19:11.434 11783:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0115883.315938.2"
2014-02-03T23:24:51.246 11783:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:25:30.185 6419:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:25:46.826 6468:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0116882.316964.3"
2014-02-03T23:30:13.648 6468:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:30:50.132 29240:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:31:06.808 29390:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0117881.317984.2"
2014-02-03T23:38:56.830 29390:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:39:58.408 3744:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-03T23:40:15.049 3837:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0118880.319057.3"


After the radosgw and radosgw-agent restart (contained in the full logs 
linked above):
root@ceph1c:/var/log/ceph# egrep '(live-2:us-west-1|shard 36)' 
radosgw-agent.us-west-1.us-central-1.log | grep -v 
'WARNING:radosgw_agent.sync:shard 36 log has fallen behind'
2014-02-04T08:15:58.966 14045:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T08:16:28.936 14145:WARNING:radosgw_agent.worker:error locking 
shard 36 log, skipping for now. Traceback:
2014-02-04T08:16:28.939 12730:ERROR:radosgw_agent.sync:error syncing 
shard 36
2014-02-04T08:23:50.318 15231:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T08:24:05.970 15288:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0118880.319057.3"
2014-02-04T08:42:20.351 15288:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T08:48:36.509 24250:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T08:48:53.145 24280:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0119879.320127.2"
2014-02-04T08:57:22.429 24280:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T09:03:35.292 23586:INFO:radosgw_agent.worker:finished 
processing shard 36
2014-02-04T09:03:53.561 23744:INFO:radosgw_agent.worker:bucket instance 
"live-2:us-west-1.35026898.2" has 1000 entries after "0120878.3

Re: [ceph-users] ceph interactive mode tab completion

2014-02-04 Thread Ahmed Kamal
That would indeed be awesome!


On Tue, Feb 4, 2014 at 12:28 AM, Ben Sherman  wrote:

> Hello all,
>
> I noticed ceph has an interactive mode.
>
> I did quick search and I don't see that tab completion is in there,
> but there are some mentions of readline in the source, so I'm
> wondering if it is on the horizon.
>
>
>
> --ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MON can no longer join quorum

2014-02-04 Thread Greg Poirier
I have a MON that at some point lost connectivity to the rest of the
cluster and now cannot rejoin.

Each time I restart it, it looks like it's attempting to create a new MON
and join the cluster, but the rest of the cluster rejects it, because the
new one isn't in the monmap.

I don't know why it suddenly decided it needed to be a new MON.

I am not really sure where to start.

root@ceph-mon-2003:/var/log/ceph# ceph -s
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Notice ceph-mon-2003:6800

If I try to start ceph-mon-all, it will be listening on some other port...

root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
ceph-mon-all start/running
root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
root  6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon
--cluster=ceph -i ceph-mon-2003 -f
root  6931 1  3 15:49 ?00:00:00 python
/usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

root@ceph-mon-2003:/var/log/ceph# ceph -s
2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 >>
10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7f9cf00215d0).fault
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low RBD Performance

2014-02-04 Thread Gruher, Joseph R
>> Ultimately this seems to be an FIO issue.  If I use "--iodepth X" or "--
>iodepth=X" on the FIO command line I always get queue depth 1.  After
>switching to specifying "iodepth=X" in the body of the FIO workload file I do
>get the desired queue depth and I can immediately see performance is much
>higher (a full re-test is underway, I can share some results when complete if
>anyone is curious).  This seems to have effectively worked around the
>problem, although I'm still curious why the command line parameters don't
>have the desired effect.  Thanks for the responses!
>>
>
>Strange!  I do most of our testing using the command line parameters as well.
>What version of fio are you using?  Maybe there is a bug.  For what it's worth,
>I'm using --iodepth=X, and fio version 1.59 from the Ubuntu precise
>repository.
>
>Mark

FIO --version reports 2.0.8.  Installed on Ubuntu 13.04 from the default 
repositories (just did an 'apt-get install fio').

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Maciej Bonin
Hello Brad,

We are using iscsi via tgt with ESXi:
> And yes it we are using iscsi via tgtd from the ceph-extras repo I believe 
> (in response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements
ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 
500 EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: McNamara, Bradley [mailto:bradley.mcnam...@seattle.gov] 
Sent: 04 February 2014 19:22
To: Maciej Bonin; Mark Nelson; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Performance issues running vmfs on top of Ceph

Just for clarity since I didn't see it explained, but how are you accessing 
Ceph using ESXI?  Is it via iscsi or NFS?  Thanks.

Brad McNamara

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: Tuesday, February 04, 2014 11:01 AM
To: Maciej Bonin; Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello again,

Having said that we seem to have improved the performance by following 
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665
 after we figured out there might be a mismatch between reported and actually 
supported capabilities.
Thank you for your time Mark and Neil.

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: 04 February 2014 18:21
To: Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello Mark,

Thanks for getting back to me. We do have a couple of vms running that were 
migrated off xen that are fine, performance in rados bench is what one would 
expect (maxing the 4xgigabit bond).
The only other time I've noticed similar issues is when running mkfs.ext[3-4] 
on new images, which took ridiculously long on xen-pv and kvm and even longer 
under esxi. We have a vmfs image with configuration files for the guests and 
are trying to wget an iso into the shared config volume to install another vm 
via esxi we don't get very far (we checked the uplink etc and everything up to 
the way vmfs works on top of ceph seems ok). My thoughts are something in the 
way vmfs thin-provisions space is causing problems with ceph's own thin 
provisioning, my colleague is testing different block sizes, no luck however in 
getting any sort of improvement so far.
And yes it we are using iscsi via tgtd from the ceph-extras repo I believe (in 
response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: 04 February 2014 18:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

On 02/04/2014 11:55 AM, Maciej Bonin wrote:
> Hello guys,
>
> We're testing running an esxi hv on top of a ceph backend and we're getting 
> abysmal performance when using vmfs, has anyone else tried this successful, 
> any advice ?
> Would be really thankful for any hints.

Hi!

I don't have a ton of experience with esxi, but if you can do some rados bench 
or smalliobenchfs tests, that might help give you an idea if the problem is 
Ceph (or lower), or more related to something higher up closer to exsi.  Can 
you describe a little more what you are seeing and what you expect?

Thanks,
Mark

>
> Regards,
> Maciej Bonin
> Systems Engineer | M247 Limited
> M247.com  Connected with our Customers Contact us today to discuss 
> your hosting and connectivity requirements ISO 27001 | ISO 9001 | 
> Deloitte Technology Fast 50 | Deloitte Technology Fast 500 EMEA | 
> Sun

Re: [ceph-users] RGW Replication

2014-02-04 Thread Yehuda Sadeh
On Tue, Feb 4, 2014 at 10:07 AM, Craig Lewis  wrote:
>
>
>
> On 2/3/14 14:34 , Craig Lewis wrote:
>
>
> On 2/3/14 10:51 , Gregory Farnum wrote:
>
> On Mon, Feb 3, 2014 at 10:43 AM, Craig Lewis  
> wrote:
>
> I've been noticing somethings strange with my RGW federation.  I added some
> statistics to radosgw-agent to try and get some insight
> (https://github.com/ceph/radosgw-agent/pull/7), but that just showed me that
> I don't understand how replication works.
>
> When PUT traffic was relatively slow to the master zone, replication had no
> issues keeping up.  Now I'm trying to cause replication to fall behind, by
> deliberately exceeding the amount of bandwidth between the two zones
> (they're in different datacenters).  Instead of falling behind, both the
> radosgw-agent logs and the stats I added say that slave zone is keeping up.
>
> But some of the numbers don't add up.  I'm not using enough bandwidth
> between the two facilities, and I'm not using enough disk space in the slave
> zone.  The disk usage in the slave zone continues to fall further and
> further behind the master.  Despite this, I'm always able to download
> objects from both zones.
>
>
> How does radosgw-agent actually replicate metadata and data?  Does
> radosgw-agent actually copy all the bytes, or does it create placeholders in
> the slave zone?  If radosgw-agent is creating placeholders in the slave
> zone, and radosgw populates the placeholder in the background, then that
> would explain the behavior I'm seeing.  If this is true, how can I tell if
> replication is keeping up?
>
> Are you overwriting the same objects? Replication copies over the
> "present" version of an object, not all the versions which have ever
> existed. Similarly, the slave zone doesn't keep all the
> (garbage-collected) logs that the master zone has to, so those factors
> would be one way to get differing disk counts.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>
> Before I started this import, the master zone was using 3.54TB (raw), and the 
> slave zone was using 3.42 TB (raw).  I did overwrite some objects, and the 
> 120GB is plausible for overwrites.
>
> I haven't deleted anything yet, so the only garbage collection would be 
> overwritten objects.  Right?
>
>
> I imported 1.93TB of data.  Replication is currently 2x, so that's 3.86TB 
> (raw).  Now the master is using 7.48TB (raw), and the slave is using 4.89TB 
> (raw).  The master zone looks correct, but the slave zone is missing 2.59TB 
> (raw).  That's 66% of my imported data.
>
> The 33% of data the slave does have is in line with the amount of bandwidth I 
> see between the two facilities.  I see an increase of ~150 Mbps when the 
> import is running on the master, and ~50 Mbps on the slave.
>
>
>
> Just to verify that I'm not over writing objects, I checked the apache logs.  
> Since I started the import, there have been 1328542 PUTs (including normal 
> site traffic).  1301511 of those are unique.  I'll investigate the 27031 
> duplicates, but the dups are only 34GB.  Not nearly enough to account for the 
> discrepancy.
>
>
> From your answer, I'll assume there are no placeholders involved.  If 
> radosgw-agent says we're up to date, the data should exist in the slave zone.
>
> Now I'm really confused.
>
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> Here's another example.  At 23:40 PST, radosgw-agent stalled.  This is 
> something that's been happening, but I haven't dug into it yet.  Ignore that 
> for now.  The point is that replication now has a backlog of 106000 PUTs 
> requests, roughly 125GB.
>
> A graph using the --stats patch I submitted to radosgw-agent:
>
>
> X axis is time in UTC.
> Left axises is (to use pseudo code) sum( length( DataSyncer.get_log_enteries( 
> *allshards))).  This is the scale for the green line.
> The right axis is the delta between now and the oldest entry's timestamp from 
> all shards.  This is the scale for the blue area.
>
> I only have 2 buckets that are actively being written to.  Replication 
> entries cap at 1000 entries per shard, so that's why the green line levels 
> off at 2000.
>
>
> I restarted replication at 08:15 PDT:
>
>
>
> Both this graph and the radosgw-agent logs say that it took about 1h15m to 
> catch up.  If true, that would require a transfer rate (uncompressed) of 
> about 225 Mbps.  My inter-datacenter bandwidth graphs show a peak of 50 Mbps, 
> and an average of 20 Mbps.  This link is capped at 100 Mbps.
>
> During the 8h30m that replication was stalled,  the master zone's raw cluster 
> storage went from 7.99TB to 8.34TB.  The import

Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread McNamara, Bradley
Just for clarity since I didn't see it explained, but how are you accessing 
Ceph using ESXI?  Is it via iscsi or NFS?  Thanks.

Brad McNamara

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: Tuesday, February 04, 2014 11:01 AM
To: Maciej Bonin; Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello again,

Having said that we seem to have improved the performance by following 
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665
 after we figured out there might be a mismatch between reported and actually 
supported capabilities.
Thank you for your time Mark and Neil.

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: 04 February 2014 18:21
To: Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello Mark,

Thanks for getting back to me. We do have a couple of vms running that were 
migrated off xen that are fine, performance in rados bench is what one would 
expect (maxing the 4xgigabit bond).
The only other time I've noticed similar issues is when running mkfs.ext[3-4] 
on new images, which took ridiculously long on xen-pv and kvm and even longer 
under esxi. We have a vmfs image with configuration files for the guests and 
are trying to wget an iso into the shared config volume to install another vm 
via esxi we don't get very far (we checked the uplink etc and everything up to 
the way vmfs works on top of ceph seems ok). My thoughts are something in the 
way vmfs thin-provisions space is causing problems with ceph's own thin 
provisioning, my colleague is testing different block sizes, no luck however in 
getting any sort of improvement so far.
And yes it we are using iscsi via tgtd from the ceph-extras repo I believe (in 
response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: 04 February 2014 18:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

On 02/04/2014 11:55 AM, Maciej Bonin wrote:
> Hello guys,
>
> We're testing running an esxi hv on top of a ceph backend and we're getting 
> abysmal performance when using vmfs, has anyone else tried this successful, 
> any advice ?
> Would be really thankful for any hints.

Hi!

I don't have a ton of experience with esxi, but if you can do some rados bench 
or smalliobenchfs tests, that might help give you an idea if the problem is 
Ceph (or lower), or more related to something higher up closer to exsi.  Can 
you describe a little more what you are seeing and what you expect?

Thanks,
Mark

>
> Regards,
> Maciej Bonin
> Systems Engineer | M247 Limited
> M247.com  Connected with our Customers Contact us today to discuss 
> your hosting and connectivity requirements ISO 27001 | ISO 9001 | 
> Deloitte Technology Fast 50 | Deloitte Technology Fast 500 EMEA | 
> Sunday Times Tech Track 100
> M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra 
> Court, Manchester, M32 0QT
>
> ISO 27001 Data Protection Classification: A - Public
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-

Re: [ceph-users] Low RBD Performance

2014-02-04 Thread Mark Nelson

On 02/04/2014 01:08 PM, Gruher, Joseph R wrote:




-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com]
Sent: Tuesday, February 04, 2014 9:46 AM
To: Gruher, Joseph R
Cc: Mark Nelson; ceph-users@lists.ceph.com; Ilya Dryomov
Subject: Re: [ceph-users] Low RBD Performance

On Tue, Feb 4, 2014 at 9:29 AM, Gruher, Joseph R
 wrote:




-Original Message-
From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: Monday, February 03, 2014 6:48 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Low RBD Performance

On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:

Hi folks-

I'm having trouble demonstrating reasonable performance of RBDs.
I'm running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I
have four dual-Xeon servers, each with 24GB RAM, and an Intel 320
SSD for journals and four WD 10K RPM SAS drives for OSDs, all
connected with an LSI 1078.  This is just a lab experiment using
scrounged hardware so everything isn't sized to be a Ceph cluster,
it's just what I have lying around, but I should have more than
enough CPU and memory

resources.

Everything is connected with a single 10GbE.

When testing with RBDs from four clients (also running Ubuntu 13.04
with
3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random
read or write workload (cephx set to none, replication set to one).
IO is generated using FIO from four clients, each hosting a single
1TB RBD, and I've experimented with queue depths and increasing the
number of RBDs without any benefit.  300 IOPS for a pool of 16 10K
RPM HDDs seems quite low, not to mention the journal should provide
a good boost on write workloads.  When I run a 4KB object write
workload in Cosbench I can approach 3500 Obj/Sec which seems more

reasonable.


Sample FIO configuration:

[global]

ioengine=libaio

direct=1

ramp_time=300

runtime=300

[4k-rw]

description=4k-rw

filename=/dev/rbd1

rw=randwrite

bs=4k

stonewall

I use --iodepth=X on the FIO command line to set the queue depth
when testing.

I notice in the FIO output despite the iodepth setting it seems to
be reporting an IO depth of only 1, which would certainly help
explain poor performance, but I'm at a loss as to why, I wonder if
it could be something specific to RBD behavior, like I need to use a
different IO engine to establish queue depth.

IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%

Any thoughts appreciated!


Interesting results with the io depth at 1.  I Haven't seen that
behaviour when using libaio, direct=1, and higher io depths.  Is this kernel

RBD or QEMU/KVM?

If it's QEMU/KVM, is it the libvirt driver?

Certainly 300 IOPS is low for that kind of setup compared to what
we've seen for RBD on other systems (especially with 1x replication).
Given that you are seeing more reasonable performance with RGW, I
guess I'd look at a couple
things:

- Figure out why fio is reporting queue depth = 1


Yup, I agree, I will work on this and report back.  First thought is to try

specifying the queue depth in the FIO workload file instead of on the
command line.



- Does increasing the num jobs help (ie get concurrency another way)?


I will give this a shot.


- Do you have enough PGs in the RBD pool?


I should, for 16 OSDs and no replication I use 2048 PGs/PGPs (100 * 16 / 1

rounded up to power of 2).



- Are you using the virtio driver if QEMU/KVM?


No virtualization, clients are bare metal using kernel RBD.


I believe that directIO via the kernel client will go all the way to the OSDs 
and
to disk before returning. I imagine that something in the stack is preventing
the dispatch from actually happening asynchronously in that case, and the
reason you're getting 300 IOPS is because your total RTT is about 3 ms with
that code...

Ilya, is that assumption of mine correct? One thing that occurs to me is that 
for
direct IO it's fair to use the ack instead of on-disk response from the OSDs,
although that would only help us for people using btrfs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


Ultimately this seems to be an FIO issue.  If I use "--iodepth X" or "--iodepth=X" on the 
FIO command line I always get queue depth 1.  After switching to specifying "iodepth=X" in the body 
of the FIO workload file I do get the desired queue depth and I can immediately see performance is much 
higher (a full re-test is underway, I can share some results when complete if anyone is curious).  This seems 
to have effectively worked around the problem, although I'm still curious why the command line parameters 
don't have the desired effect.  Thanks for the responses!



Strange!  I do most of our testing using the command line parameters as 
well.  What version of fio are you using?  Maybe there is a bug.  For 
what it's worth, I'm using --iodepth=X, and fio version 1.59 from the 
Ubuntu precise repository.


Mark


_

Re: [ceph-users] Low RBD Performance

2014-02-04 Thread Gruher, Joseph R


>-Original Message-
>From: Gregory Farnum [mailto:g...@inktank.com]
>Sent: Tuesday, February 04, 2014 9:46 AM
>To: Gruher, Joseph R
>Cc: Mark Nelson; ceph-users@lists.ceph.com; Ilya Dryomov
>Subject: Re: [ceph-users] Low RBD Performance
>
>On Tue, Feb 4, 2014 at 9:29 AM, Gruher, Joseph R
> wrote:
>>
>>
>>>-Original Message-
>>>From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>>>boun...@lists.ceph.com] On Behalf Of Mark Nelson
>>>Sent: Monday, February 03, 2014 6:48 PM
>>>To: ceph-users@lists.ceph.com
>>>Subject: Re: [ceph-users] Low RBD Performance
>>>
>>>On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:
 Hi folks-

 I'm having trouble demonstrating reasonable performance of RBDs.
 I'm running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I
 have four dual-Xeon servers, each with 24GB RAM, and an Intel 320
 SSD for journals and four WD 10K RPM SAS drives for OSDs, all
 connected with an LSI 1078.  This is just a lab experiment using
 scrounged hardware so everything isn't sized to be a Ceph cluster,
 it's just what I have lying around, but I should have more than
 enough CPU and memory
>>>resources.
 Everything is connected with a single 10GbE.

 When testing with RBDs from four clients (also running Ubuntu 13.04
 with
 3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random
 read or write workload (cephx set to none, replication set to one).
 IO is generated using FIO from four clients, each hosting a single
 1TB RBD, and I've experimented with queue depths and increasing the
 number of RBDs without any benefit.  300 IOPS for a pool of 16 10K
 RPM HDDs seems quite low, not to mention the journal should provide
 a good boost on write workloads.  When I run a 4KB object write
 workload in Cosbench I can approach 3500 Obj/Sec which seems more
>reasonable.

 Sample FIO configuration:

 [global]

 ioengine=libaio

 direct=1

 ramp_time=300

 runtime=300

 [4k-rw]

 description=4k-rw

 filename=/dev/rbd1

 rw=randwrite

 bs=4k

 stonewall

 I use --iodepth=X on the FIO command line to set the queue depth
 when testing.

 I notice in the FIO output despite the iodepth setting it seems to
 be reporting an IO depth of only 1, which would certainly help
 explain poor performance, but I'm at a loss as to why, I wonder if
 it could be something specific to RBD behavior, like I need to use a
 different IO engine to establish queue depth.

 IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
=64=0.0%

 Any thoughts appreciated!
>>>
>>>Interesting results with the io depth at 1.  I Haven't seen that
>>>behaviour when using libaio, direct=1, and higher io depths.  Is this kernel
>RBD or QEMU/KVM?
>>>If it's QEMU/KVM, is it the libvirt driver?
>>>
>>>Certainly 300 IOPS is low for that kind of setup compared to what
>>>we've seen for RBD on other systems (especially with 1x replication).
>>>Given that you are seeing more reasonable performance with RGW, I
>>>guess I'd look at a couple
>>>things:
>>>
>>>- Figure out why fio is reporting queue depth = 1
>>
>> Yup, I agree, I will work on this and report back.  First thought is to try
>specifying the queue depth in the FIO workload file instead of on the
>command line.
>>
>>>- Does increasing the num jobs help (ie get concurrency another way)?
>>
>> I will give this a shot.
>>
>>>- Do you have enough PGs in the RBD pool?
>>
>> I should, for 16 OSDs and no replication I use 2048 PGs/PGPs (100 * 16 / 1
>rounded up to power of 2).
>>
>>>- Are you using the virtio driver if QEMU/KVM?
>>
>> No virtualization, clients are bare metal using kernel RBD.
>
>I believe that directIO via the kernel client will go all the way to the OSDs 
>and
>to disk before returning. I imagine that something in the stack is preventing
>the dispatch from actually happening asynchronously in that case, and the
>reason you're getting 300 IOPS is because your total RTT is about 3 ms with
>that code...
>
>Ilya, is that assumption of mine correct? One thing that occurs to me is that 
>for
>direct IO it's fair to use the ack instead of on-disk response from the OSDs,
>although that would only help us for people using btrfs.
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com

Ultimately this seems to be an FIO issue.  If I use "--iodepth X" or 
"--iodepth=X" on the FIO command line I always get queue depth 1.  After 
switching to specifying "iodepth=X" in the body of the FIO workload file I do 
get the desired queue depth and I can immediately see performance is much 
higher (a full re-test is underway, I can share some results when complete if 
anyone is curious).  This seems to have effectively worked around the problem, 
although I'm still curious why the command line parameters don't have the 
des

Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Maciej Bonin
Hello again,

Having said that we seem to have improved the performance by following 
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665
 after we figured out there might be a mismatch between reported and actually 
supported capabilities.
Thank you for your time Mark and Neil.

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements
ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 
500 EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: 04 February 2014 18:21
To: Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello Mark,

Thanks for getting back to me. We do have a couple of vms running that were 
migrated off xen that are fine, performance in rados bench is what one would 
expect (maxing the 4xgigabit bond).
The only other time I've noticed similar issues is when running mkfs.ext[3-4] 
on new images, which took ridiculously long on xen-pv and kvm and even longer 
under esxi. We have a vmfs image with configuration files for the guests and 
are trying to wget an iso into the shared config volume to install another vm 
via esxi we don't get very far (we checked the uplink etc and everything up to 
the way vmfs works on top of ceph seems ok). My thoughts are something in the 
way vmfs thin-provisions space is causing problems with ceph's own thin 
provisioning, my colleague is testing different block sizes, no luck however in 
getting any sort of improvement so far.
And yes it we are using iscsi via tgtd from the ceph-extras repo I believe (in 
response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: 04 February 2014 18:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

On 02/04/2014 11:55 AM, Maciej Bonin wrote:
> Hello guys,
>
> We're testing running an esxi hv on top of a ceph backend and we're getting 
> abysmal performance when using vmfs, has anyone else tried this successful, 
> any advice ?
> Would be really thankful for any hints.

Hi!

I don't have a ton of experience with esxi, but if you can do some rados bench 
or smalliobenchfs tests, that might help give you an idea if the problem is 
Ceph (or lower), or more related to something higher up closer to exsi.  Can 
you describe a little more what you are seeing and what you expect?

Thanks,
Mark

>
> Regards,
> Maciej Bonin
> Systems Engineer | M247 Limited
> M247.com  Connected with our Customers Contact us today to discuss 
> your hosting and connectivity requirements ISO 27001 | ISO 9001 | 
> Deloitte Technology Fast 50 | Deloitte Technology Fast 500 EMEA | 
> Sunday Times Tech Track 100
> M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra 
> Court, Manchester, M32 0QT
>
> ISO 27001 Data Protection Classification: A - Public
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Maciej Bonin
Hello Mark,

Thanks for getting back to me. We do have a couple of vms running that were 
migrated off xen that are fine, performance in rados bench is what one would 
expect (maxing the 4xgigabit bond).
The only other time I've noticed similar issues is when running mkfs.ext[3-4] 
on new images, which took ridiculously long on xen-pv and kvm and even longer 
under esxi. We have a vmfs image with configuration files for the guests and 
are trying to wget an iso into the shared config volume to install another vm 
via esxi we don't get very far (we checked the uplink etc and everything up to 
the way vmfs works on top of ceph seems ok). My thoughts are something in the 
way vmfs thin-provisions space is causing problems with ceph's own thin 
provisioning, my colleague is testing different block sizes, no luck however in 
getting any sort of improvement so far.
And yes it we are using iscsi via tgtd from the ceph-extras repo I believe (in 
response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements
ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 
500 EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: 04 February 2014 18:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

On 02/04/2014 11:55 AM, Maciej Bonin wrote:
> Hello guys,
>
> We're testing running an esxi hv on top of a ceph backend and we're getting 
> abysmal performance when using vmfs, has anyone else tried this successful, 
> any advice ?
> Would be really thankful for any hints.

Hi!

I don't have a ton of experience with esxi, but if you can do some rados bench 
or smalliobenchfs tests, that might help give you an idea if the problem is 
Ceph (or lower), or more related to something higher up closer to exsi.  Can 
you describe a little more what you are seeing and what you expect?

Thanks,
Mark

>
> Regards,
> Maciej Bonin
> Systems Engineer | M247 Limited
> M247.com  Connected with our Customers Contact us today to discuss 
> your hosting and connectivity requirements ISO 27001 | ISO 9001 | 
> Deloitte Technology Fast 50 | Deloitte Technology Fast 500 EMEA | 
> Sunday Times Tech Track 100
> M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra 
> Court, Manchester, M32 0QT
>
> ISO 27001 Data Protection Classification: A - Public
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Neil Levine
Also, how are you accessing Ceph - is it using the TGT iSCSI package?


On Tue, Feb 4, 2014 at 10:10 AM, Mark Nelson wrote:

> On 02/04/2014 11:55 AM, Maciej Bonin wrote:
>
>> Hello guys,
>>
>> We're testing running an esxi hv on top of a ceph backend and we're
>> getting abysmal performance when using vmfs, has anyone else tried this
>> successful, any advice ?
>> Would be really thankful for any hints.
>>
>
> Hi!
>
> I don't have a ton of experience with esxi, but if you can do some rados
> bench or smalliobenchfs tests, that might help give you an idea if the
> problem is Ceph (or lower), or more related to something higher up closer
> to exsi.  Can you describe a little more what you are seeing and what you
> expect?
>
> Thanks,
> Mark
>
>
>
>> Regards,
>> Maciej Bonin
>> Systems Engineer | M247 Limited
>> M247.com  Connected with our Customers
>> Contact us today to discuss your hosting and connectivity requirements
>> ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology
>> Fast 500 EMEA | Sunday Times Tech Track 100
>> M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra
>> Court, Manchester, M32 0QT
>>
>> ISO 27001 Data Protection Classification: A - Public
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Mark Nelson

On 02/04/2014 11:55 AM, Maciej Bonin wrote:

Hello guys,

We're testing running an esxi hv on top of a ceph backend and we're getting 
abysmal performance when using vmfs, has anyone else tried this successful, any 
advice ?
Would be really thankful for any hints.


Hi!

I don't have a ton of experience with esxi, but if you can do some rados 
bench or smalliobenchfs tests, that might help give you an idea if the 
problem is Ceph (or lower), or more related to something higher up 
closer to exsi.  Can you describe a little more what you are seeing and 
what you expect?


Thanks,
Mark



Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements
ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 
500 EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT

ISO 27001 Data Protection Classification: A - Public


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread Maciej Bonin
Hello guys,

We're testing running an esxi hv on top of a ceph backend and we're getting 
abysmal performance when using vmfs, has anyone else tried this successful, any 
advice ?
Would be really thankful for any hints.

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements
ISO 27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 
500 EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low RBD Performance

2014-02-04 Thread Gregory Farnum
On Tue, Feb 4, 2014 at 9:29 AM, Gruher, Joseph R
 wrote:
>
>
>>-Original Message-
>>From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>>boun...@lists.ceph.com] On Behalf Of Mark Nelson
>>Sent: Monday, February 03, 2014 6:48 PM
>>To: ceph-users@lists.ceph.com
>>Subject: Re: [ceph-users] Low RBD Performance
>>
>>On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:
>>> Hi folks-
>>>
>>> I'm having trouble demonstrating reasonable performance of RBDs.  I'm
>>> running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I have four
>>> dual-Xeon servers, each with 24GB RAM, and an Intel 320 SSD for
>>> journals and four WD 10K RPM SAS drives for OSDs, all connected with
>>> an LSI 1078.  This is just a lab experiment using scrounged hardware
>>> so everything isn't sized to be a Ceph cluster, it's just what I have
>>> lying around, but I should have more than enough CPU and memory
>>resources.
>>> Everything is connected with a single 10GbE.
>>>
>>> When testing with RBDs from four clients (also running Ubuntu 13.04
>>> with
>>> 3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random
>>> read or write workload (cephx set to none, replication set to one).
>>> IO is generated using FIO from four clients, each hosting a single 1TB
>>> RBD, and I've experimented with queue depths and increasing the number
>>> of RBDs without any benefit.  300 IOPS for a pool of 16 10K RPM HDDs
>>> seems quite low, not to mention the journal should provide a good
>>> boost on write workloads.  When I run a 4KB object write workload in
>>> Cosbench I can approach 3500 Obj/Sec which seems more reasonable.
>>>
>>> Sample FIO configuration:
>>>
>>> [global]
>>>
>>> ioengine=libaio
>>>
>>> direct=1
>>>
>>> ramp_time=300
>>>
>>> runtime=300
>>>
>>> [4k-rw]
>>>
>>> description=4k-rw
>>>
>>> filename=/dev/rbd1
>>>
>>> rw=randwrite
>>>
>>> bs=4k
>>>
>>> stonewall
>>>
>>> I use --iodepth=X on the FIO command line to set the queue depth when
>>> testing.
>>>
>>> I notice in the FIO output despite the iodepth setting it seems to be
>>> reporting an IO depth of only 1, which would certainly help explain
>>> poor performance, but I'm at a loss as to why, I wonder if it could be
>>> something specific to RBD behavior, like I need to use a different IO
>>> engine to establish queue depth.
>>>
>>> IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>=64=0.0%
>>>
>>> Any thoughts appreciated!
>>
>>Interesting results with the io depth at 1.  I Haven't seen that behaviour 
>>when
>>using libaio, direct=1, and higher io depths.  Is this kernel RBD or QEMU/KVM?
>>If it's QEMU/KVM, is it the libvirt driver?
>>
>>Certainly 300 IOPS is low for that kind of setup compared to what we've seen
>>for RBD on other systems (especially with 1x replication).  Given that you are
>>seeing more reasonable performance with RGW, I guess I'd look at a couple
>>things:
>>
>>- Figure out why fio is reporting queue depth = 1
>
> Yup, I agree, I will work on this and report back.  First thought is to try 
> specifying the queue depth in the FIO workload file instead of on the command 
> line.
>
>>- Does increasing the num jobs help (ie get concurrency another way)?
>
> I will give this a shot.
>
>>- Do you have enough PGs in the RBD pool?
>
> I should, for 16 OSDs and no replication I use 2048 PGs/PGPs (100 * 16 / 1 
> rounded up to power of 2).
>
>>- Are you using the virtio driver if QEMU/KVM?
>
> No virtualization, clients are bare metal using kernel RBD.

I believe that directIO via the kernel client will go all the way to
the OSDs and to disk before returning. I imagine that something in the
stack is preventing the dispatch from actually happening
asynchronously in that case, and the reason you're getting 300 IOPS is
because your total RTT is about 3 ms with that code...

Ilya, is that assumption of mine correct? One thing that occurs to me
is that for direct IO it's fair to use the ack instead of on-disk
response from the OSDs, although that would only help us for people
using btrfs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low RBD Performance

2014-02-04 Thread Gruher, Joseph R


>-Original Message-
>From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>boun...@lists.ceph.com] On Behalf Of Mark Nelson
>Sent: Monday, February 03, 2014 6:48 PM
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Low RBD Performance
>
>On 02/03/2014 07:29 PM, Gruher, Joseph R wrote:
>> Hi folks-
>>
>> I'm having trouble demonstrating reasonable performance of RBDs.  I'm
>> running Ceph 0.72.2 on Ubuntu 13.04 with the 3.12 kernel.  I have four
>> dual-Xeon servers, each with 24GB RAM, and an Intel 320 SSD for
>> journals and four WD 10K RPM SAS drives for OSDs, all connected with
>> an LSI 1078.  This is just a lab experiment using scrounged hardware
>> so everything isn't sized to be a Ceph cluster, it's just what I have
>> lying around, but I should have more than enough CPU and memory
>resources.
>> Everything is connected with a single 10GbE.
>>
>> When testing with RBDs from four clients (also running Ubuntu 13.04
>> with
>> 3.12 kernel) I am having trouble breaking 300 IOPS on a 4KB random
>> read or write workload (cephx set to none, replication set to one).
>> IO is generated using FIO from four clients, each hosting a single 1TB
>> RBD, and I've experimented with queue depths and increasing the number
>> of RBDs without any benefit.  300 IOPS for a pool of 16 10K RPM HDDs
>> seems quite low, not to mention the journal should provide a good
>> boost on write workloads.  When I run a 4KB object write workload in
>> Cosbench I can approach 3500 Obj/Sec which seems more reasonable.
>>
>> Sample FIO configuration:
>>
>> [global]
>>
>> ioengine=libaio
>>
>> direct=1
>>
>> ramp_time=300
>>
>> runtime=300
>>
>> [4k-rw]
>>
>> description=4k-rw
>>
>> filename=/dev/rbd1
>>
>> rw=randwrite
>>
>> bs=4k
>>
>> stonewall
>>
>> I use --iodepth=X on the FIO command line to set the queue depth when
>> testing.
>>
>> I notice in the FIO output despite the iodepth setting it seems to be
>> reporting an IO depth of only 1, which would certainly help explain
>> poor performance, but I'm at a loss as to why, I wonder if it could be
>> something specific to RBD behavior, like I need to use a different IO
>> engine to establish queue depth.
>>
>> IO depths: 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>>
>> Any thoughts appreciated!
>
>Interesting results with the io depth at 1.  I Haven't seen that behaviour when
>using libaio, direct=1, and higher io depths.  Is this kernel RBD or QEMU/KVM?
>If it's QEMU/KVM, is it the libvirt driver?
>
>Certainly 300 IOPS is low for that kind of setup compared to what we've seen
>for RBD on other systems (especially with 1x replication).  Given that you are
>seeing more reasonable performance with RGW, I guess I'd look at a couple
>things:
>
>- Figure out why fio is reporting queue depth = 1

Yup, I agree, I will work on this and report back.  First thought is to try 
specifying the queue depth in the FIO workload file instead of on the command 
line.

>- Does increasing the num jobs help (ie get concurrency another way)?

I will give this a shot.

>- Do you have enough PGs in the RBD pool?

I should, for 16 OSDs and no replication I use 2048 PGs/PGPs (100 * 16 / 1 
rounded up to power of 2).

>- Are you using the virtio driver if QEMU/KVM?

No virtualization, clients are bare metal using kernel RBD.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-04 Thread Dominik Mostowiec
Hi,
Thanks for Your help !!
We've done again 'ceph osd reweight-by-utilization 105'
Cluster stack on 10387 active+clean, 237 active+remapped;
More info in attachments.

--
Regards
Dominik


2014-02-04 Sage Weil :
> Hi,
>
> I spent a couple hours looking at your map because it did look like there
> was something wrong.  After some experimentation and adding a bucnh of
> improvements to osdmaptool to test the distribution, though, I think
> everything is working as expected.  For pool 3, your map has a standard
> deviation in utilizations of ~8%, and we should expect ~9% for this number
> of PGs.  For all pools, it is slightly higher (~9% vs expected ~8%).
> This is either just in the noise, or slightly confounded by the lack of
> the hashpspool flag on the pools (which slightly amplifies placement
> nonuniformity with multiple pools... not enough that it is worth changing
> anything though).
>
> The bad news is that that order of standard deviation results in pretty
> wide min/max range of 118 to 202 pgs.  That seems a *bit* higher than we a
> perfectly random placement generates (I'm seeing a spread in that is
> usually 50-70 pgs), but I think *that* is where the pool overlap (no
> hashpspool) is rearing its head; for just pool three the spread of 50 is
> about what is expected.
>
> Long story short: you have two options.  One is increasing the number of
> PGs.  Note that this helps but has diminishing returns (doubling PGs
> only takes you from ~8% to ~6% standard deviation, quadrupling to ~4%).
>
> The other is to use reweight-by-utilization.  That is the best approach,
> IMO.  I'm not sure why you were seeing PGs stuck in the remapped state
> after you did that, though, but I'm happy to dig into that too.
>
> BTW, the osdmaptool addition I was using to play with is here:
> https://github.com/ceph/ceph/pull/1178
>
> sage
>
>
> On Mon, 3 Feb 2014, Dominik Mostowiec wrote:
>
>> In other words,
>> 1. we've got 3 racks ( 1 replica per rack )
>> 2. in every rack we have 3 hosts
>> 3. every host has 22 OSD's
>> 4. all pg_num's are 2^n for every pool
>> 5. we enabled "crush tunables optimal".
>> 6. on every machine we disabled 4 unused disk's (osd out, osd reweight
>> 0 and osd rm)
>>
>> Pool ".rgw.buckets": one osd has 105 PGs and other one (on the same
>> machine) has 144 PGs (37% more!).
>> Other pools also have got this problem. It's not efficient placement.
>>
>> --
>> Regards
>> Dominik
>>
>>
>> 2014-02-02 Dominik Mostowiec :
>> > Hi,
>> > For more info:
>> >   crush: http://dysk.onet.pl/link/r4wGK
>> >   osd_dump: http://dysk.onet.pl/link/I3YMZ
>> >   pg_dump: http://dysk.onet.pl/link/4jkqM
>> >
>> > --
>> > Regards
>> > Dominik
>> >
>> > 2014-02-02 Dominik Mostowiec :
>> >> Hi,
>> >> Hmm,
>> >> You think about sumarize PGs from different pools on one OSD's i think.
>> >> But for one pool (.rgw.buckets) where i have almost of all my data, PG
>> >> count on OSDs is aslo different.
>> >> For example 105 vs 144 PGs from pool .rgw.buckets. In first case it is
>> >> 52% disk usage, second 74%.
>> >>
>> >> --
>> >> Regards
>> >> Dominik
>> >>
>> >>
>> >> 2014-02-02 Sage Weil :
>> >>> It occurs to me that this (and other unexplain variance reports) could
>> >>> easily be the 'hashpspool' flag not being set.  The old behavior had the
>> >>> misfeature where consecutive pool's pg's would 'line up' on the same 
>> >>> osds,
>> >>> so that 1.7 == 2.6 == 3.5 == 4.4 etc would map to the same nodes.  This
>> >>> tends to 'amplify' any variance in the placement.  The default is still 
>> >>> to
>> >>> use the old behavior for compatibility (this will finally change in
>> >>> firefly).
>> >>>
>> >>> You can do
>> >>>
>> >>>  ceph osd pool set  hashpspool true
>> >>>
>> >>> to enable the new placement logic on an existing pool, but be warned that
>> >>> this will rebalance *all* of the data in the pool, which can be a very
>> >>> heavyweight operation...
>> >>>
>> >>> sage
>> >>>
>> >>>
>> >>> On Sun, 2 Feb 2014, Dominik Mostowiec wrote:
>> >>>
>>  Hi,
>>  After scrubbing almost all PGs has equal(~) num of objects.
>>  I found something else.
>>  On one host PG coun on OSDs:
>>  OSD with small(52%) disk usage:
>>  count, pool
>>  105 3
>>   18 4
>>    3 5
>> 
>>  Osd with larger(74%) disk usage:
>>  144 3
>>   31 4
>>    2 5
>> 
>>  Pool 3 is .rgw.buckets (where is almost of all data).
>>  Pool 4 is .log, where is no data.
>> 
>>  Count of PGs shouldn't be the same per OSD ?
>>  Or maybe PG hash algorithm is disrupted by wrong count of PG for pool
>>  '4'. There is 1440 PGs ( this is not power of 2 ).
>> 
>>  ceph osd dump:
>>  pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
>>  rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
>>  crash_replay_interval 45
>>  pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
>>  rjenkins pg_num 64 pgp_num 

[ceph-users] i'm stuck with one stuck pg

2014-02-04 Thread Ingo Ebel
Hey,

I've got a mini cluster and one pg I can't get clean.
I tried everything.

Maybe someone here has an idea?

Or can I just delete this pg somehow cause it has no objects...

I'm use as tunables:

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50


> ceph health detail
HEALTH_WARN 1 pgs stuck unclean
pg 0.34 is stuck unclean for 113026.470082, current state
active+remapped, last acting [2,1]


> ceph osd tree
# idweight  type name   up/down reweight
-1  4.5 root default
-2  4   host rokix
0   2   osd.0   up  1   
1   2   osd.1   up  1   
-3  0   host prometheus
-4  0   host linux-90as
-5  0.5 host classix
2   0.28osd.2   up  1


> ceph pg dump_stuck unclean
ok
pg_stat objects mip degrunf bytes   log disklog state   
state_stamp v
reportedup  acting  last_scrub  scrub_stamp last_deep_scrub 
deep_scrub_stamp
0.340   0   0   0   0   0   0   active+remapped 
2014-02-02 17:38:47.604242  0'0
741:134 [2] [2,1]   0'0 2014-02-02 10:28:15.657027  0'0 
2014-02-02
10:28:15.657027



> ceph pg 0.34 query
{ "state": "active+remapped",
  "epoch": 741,
  "up": [
2],
  "acting": [
2,
1],
  "info": { "pgid": "0.34",
  "last_update": "0'0",
  "last_complete": "0'0",
  "log_tail": "0'0",
  "last_user_version": 0,
  "last_backfill": "MAX",
  "purged_snaps": "[]",
  "history": { "epoch_created": 1,
  "last_epoch_started": 654,
  "last_epoch_clean": 654,
  "last_epoch_split": 0,
  "same_up_since": 652,
  "same_interval_since": 653,
  "same_primary_since": 636,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2014-02-02 10:28:15.657027",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2014-02-02 10:28:15.657027",
  "last_clean_scrub_stamp": "2014-02-02 10:28:15.657027"},
  "stats": { "version": "0'0",
  "reported_seq": "134",
  "reported_epoch": "741",
  "state": "active+remapped",
  "last_fresh": "2014-02-03 13:10:51.994639",
  "last_change": "2014-02-02 17:38:47.604242",
  "last_active": "2014-02-03 13:10:51.994639",
  "last_clean": "2014-02-02 16:17:21.537236",
  "last_became_active": "0.00",
  "last_unstale": "2014-02-03 13:10:51.994639",
  "mapping_epoch": 652,
  "log_start": "0'0",
  "ondisk_log_start": "0'0",
  "created": 1,
  "last_epoch_clean": 654,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "0'0",
  "last_scrub_stamp": "2014-02-02 10:28:15.657027",
  "last_deep_scrub": "0'0",
  "last_deep_scrub_stamp": "2014-02-02 10:28:15.657027",
  "last_clean_scrub_stamp": "2014-02-02 10:28:15.657027",
  "log_size": 0,
  "ondisk_log_size": 0,
  "stats_invalid": "0",
  "stat_sum": { "num_bytes": 0,
  "num_objects": 0,
  "num_object_clones": 0,
  "num_object_copies": 0,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 0,
  "num_objects_unfound": 0,
  "num_read": 0,
  "num_read_kb": 0,
  "num_write": 0,
  "num_write_kb": 0,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 0,
  "num_bytes_recovered": 0,
  "num_keys_recovered": 0},
  "stat_cat_sum": {},
  "up": [
2],
  "acting": [
2,
1]},
  "empty": 1,
  "dne": 0,
  "incomplete": 0,
  "last_epoch_started": 654},
  "recovery_state": [
{ "name": "Started\/Primary\/Active",
  "enter_time": "2014-02-02 17:38:47.604212",
  "might_have_unfound": [],
  "recovery_progress": { "backfill_target": -1,
  "waiting_on_backfill": 0,
  "last_backfill_started": "0\/\/0\/\/-1",
  "backfill_info": { "begin": "0\/\/0\/\/-1",
  "end": "0\/\/0\/\/-1",
  "objects": []},
  "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
  "end": "0\/\/0\/\/-1",
  "objects": []},
  "backfills_in_flight": [],
  "recovering": [],
  "pg_backend": { "pull_from_peer": [],
  "pushing": []}},
  "scrub": { "scrubber.epoch_start": "0",
  "scrubber.active": 0,
  "scrubber.block_writes": 0,
  "scrubber.finalizing": 0,
  "scrubber.waiting_on": 0,
  "scru