Re: [ceph-users] scrub error with ceph

2015-12-07 Thread GuangYang
Before issuing scrub, you may check if those scrub errors would point to one 
(or a small subset of) disk/OSD, and if so, did those objects put in a 
specified interval?

It is a large amount of scrub errors in a small cluster, which might be caused 
by some hardware issue ?


> Date: Mon, 7 Dec 2015 14:15:07 -0700 
> From: erm...@ualberta.ca 
> To: ceph-users@lists.ceph.com 
> Subject: [ceph-users] scrub error with ceph 
> 
> 
> Hi, 
> 
> I found there are 128 scrub errors in my ceph system. Checked with 
> health detail and found many pgs with stuck unclean issue. Should I 
> repair all of them? Or what I should do? 
> 
> [root@gcloudnet ~]# ceph -s 
> 
> cluster a4d0879f-abdc-4f9d-8a4b-53ce57d822f1 
> 
> health HEALTH_ERR 128 pgs inconsistent; 128 scrub errors; mds1: 
> Client HTRC:cephfs_data failing to respond to cache pressure; mds0: 
> Client physics-007:cephfs_data failing to respond to cache pressure; 
> pool 'cephfs_data' is full 
> 
> monmap e3: 3 mons at 
> {gcloudnet=xxx.xxx.xxx.xxx:6789/0,gcloudsrv1=xxx.xxx.xxx.xxx:6789/0,gcloudsrv2=xxx.xxx.xxx.xxx:6789/0},
>  
> election epoch 178, quorum 0,1,2 gcloudnet,gcloudsrv1,gcloudsrv2 
> 
> mdsmap e51000: 2/2/2 up {0=gcloudsrv1=up:active,1=gcloudnet=up:active} 
> 
> osdmap e2821: 18 osds: 18 up, 18 in 
> 
> pgmap v10457877: 3648 pgs, 23 pools, 10501 GB data, 38688 kobjects 
> 
> 14097 GB used, 117 TB / 130 TB avail 
> 
> 6 active+clean+scrubbing+deep 
> 
> 3513 active+clean 
> 
> 128 active+clean+inconsistent 
> 
> 1 active+clean+scrubbing 
> 
> 
> P.S. I am increasing the pg and pgp numbers for cephfs_data pool. 
> 
> Thanks, 
> 
> Erming 
> 
> 
> 
> -- 
> 
>  
> Erming Pei, Ph.D, Senior System Analyst 
> HPC Grid/Cloud Specialist, ComputeCanada/WestGrid 
> 
> Research Computing Group, IST 
> University of Alberta, Canada T6G 2H1 
> Email: erm...@ualberta.ca 
> erming@cern.ch 
> Tel. : +1 7804929914 Fax: +1 7804921729 
> 
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd wasn't marked as down/out when it's storage folder was deleted

2015-12-07 Thread GuangYang
It is actually not part of ceph.

For some files under the folder, they are only access during OSD booting up, so 
removal would not cause a problem there. For some other files, OSD would keep a 
open handle, in which case, even you remove those files from within filesystem, 
they are not erased as there is still a handle point to it (like hard link).


> From: kane.ist...@gmail.com 
> Date: Mon, 7 Dec 2015 14:01:26 -0800 
> To: ceph-users@lists.ceph.com 
> Subject: [ceph-users] osd wasn't marked as down/out when it's storage 
> folder was deleted 
> 
> I've deleted: 
> rm -rf /var/lib/ceph/osd/ceph-0/current 
> 
> folder but looks like ceph never noticed that: 
> ceph osd df 
> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 
> 4 0.09270 1.0 97231M 54661M 42570M 56.22 1.34 95 
> 0 0.09270 1.0 97231M 139M 97091M 0.14 0.00 69 
> 5 0.09270 1.0 97231M 42603M 54627M 43.82 1.05 73 
> 1 0.09270 1.0 97231M 51950M 45280M 53.43 1.28 91 
> 6 0.09270 1.0 97231M 50656M 46575M 52.10 1.25 88 
> 2 0.09270 1.0 97231M 43896M 53334M 45.15 1.08 76 
> TOTAL 569G 238G 331G 41.81 
> MIN/MAX VAR: 0.00/1.34 STDDEV: 19.15 
> 
> ceph -w: 
> health HEALTH_OK 
> 
> Waited approximately an hour and it never reported warning, nor copied 
> data back. 
> OSD log constantly filled with: 
> 2015-12-07 15:31:06.096017 7f007505d700 -1 
> filestore(/var/lib/ceph/osd/ceph-0) could not find 
> 1/08ce93cf/rbd_data.853f1f358346.13ca/head in index: (2) No 
> such file or directory 
> 2015-12-07 15:31:06.096023 7f007505d700 0 
> filestore(/var/lib/ceph/osd/ceph-0) write couldn't open 
> 1.4f_head/1/08ce93cf/rbd_data.853f1f358346.13ca/head: (2) 
> No such file or directory 
> 2015-12-07 15:31:06.096044 7f007505d700 -1 
> filestore(/var/lib/ceph/osd/ceph-0) could not find 
> 1/08ce93cf/rbd_data.853f1f358346.13ca/head in index: (2) No 
> such file or directory 
> 2015-12-07 15:31:06.249584 7f007585e700 -1 
> filestore(/var/lib/ceph/osd/ceph-0) could not find 
> 1/08ce93cf/rbd_data.853f1f358346.13ca/head in index: (2) No 
> such file or directory 
> 2015-12-07 15:31:06.249641 7f007585e700 -1 
> filestore(/var/lib/ceph/osd/ceph-0) could not find 
> 1/08ce93cf/rbd_data.853f1f358346.13ca/head in index: (2) No 
> such file or directory 
> 2015-12-07 15:31:06.249653 7f007585e700 0 
> filestore(/var/lib/ceph/osd/ceph-0) write couldn't open 
> 1.4f_head/1/08ce93cf/rbd_data.853f1f358346.13ca/head: (2) 
> No such file or directory 
> 2015-12-07 15:31:06.249692 7f007585e700 -1 
> filestore(/var/lib/ceph/osd/ceph-0) could not find 
> 1/08ce93cf/rbd_data.853f1f358346.13ca/head in index: (2) No 
> such file or directory 
> 
> Is it a bug or intended behavior? 
> 
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgmap question

2015-09-17 Thread GuangYang
IIRC, the version got increased once the stats of the PG got changed, that is 
properly the reason why you saw changing with client I/O.

Thanks,
Guang


> Date: Thu, 17 Sep 2015 16:55:41 -0600
> From: rob...@leblancnet.us
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] pgmap question
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> My understanding was that pgmap changes only when the location of a PG
> changes due to backfill or recovery. However, watching ceph -w shows
> that it increments about every second even with a healthly cluster and
> client I/O. If there is no client I/O, it seems to not increment.
>
> What am I missing that causes the pgmap to change? Do these pgmap
> changes have to be computed by the monitors and distributed to the
> clients? Does the pgmap change constitute a CRUSH algorithm change?
>
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV+0ToCRDmVDuy+mK58QAA05sP/ji8sLjUKzmjobZNEqob
> ZPce1AAC3GazNu5XSUsf+mTHQD3A8T7OdqyLkw1qqMMauZW4qWYPQGtFFUKJ
> yTGTzwX1T8upd9mySIJ0rh1j/ZVscYlnTwt2iabooWv0syQ8fly6dCxZQyQn
> e6qMIqzMKeIxq2uwvXM/r3ft2RCTiCgmyGAPDx7IUUWNIoF7Nkkzezy5tJOF
> aGA6P3ibYgFQcKDqEafRm4WsPh7HqyDd/MC0vrw0QQsCtZSxIjoKzL5ZASQh
> +Fb046ewDAVRtViYsny27kvMwjNcSAEGESM8horZ7cDDLl9wmNqu2gXqE5/A
> GgKsvHc5ZkwZR4PwXyeO6XQUIaHoxuzUNYbyPXH0mIrLrmPWlL6FFjvSt5Qk
> segXcVF0p8STKMowEH9bn2K8ytKc0dxdWXptcdS8Zh90S2Xabnzdhlz8aUcC
> 1Z2xSlTN8aBwHtrlJxDvdKuN3XR4sZEYCtolhzgetH71aP35uUaFSoITLFX5
> ZrjuadXvwjEGPD6K0TM3L33D72G7gomH6Ws231Xpt9H1TG4YHI1Dz4SLNzcj
> ikbN30WyOcNha8QavSIlz7wd/RnR094E/9Su6NMEJcu14NsmhO+ykPc7hhMk
> SPe40P/AZzHJ87coC5jB0bd0rJrBVOyN262oSM48GEIsmDX+wyPiVRWI727x
> d+jS
> =gSWw
> -END PGP SIGNATURE-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread GuangYang
Which version are you using?

My guess is that the request (op) is waiting for lock (might be 
ondisk_read_lock of the object, but a debug_osd=20 should be helpful to tell 
what happened to the op).

How do you tell the IO wait is near to 0 (by top?)? 

Thanks,
Guang

> From: ceph.l...@daevel.fr
> To: ceph-users@lists.ceph.com
> Date: Fri, 18 Sep 2015 02:43:49 +0200
> Subject: Re: [ceph-users] Lot of blocked operations
>
> Some additionnal informations :
> - I have 4 SSD per node.
> - the CPU usage is near 0
> - IO wait is near 0 too
> - bandwith usage is also near 0
>
> The whole cluster seems waiting for something... but I don't see what.
>
>
> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
>> Hi,
>>
>> I have a cluster with lot of blocked operations each time I try to
>> move
>> data (by reweighting a little an OSD).
>>
>> It's a full SSD cluster, with 10GbE network.
>>
>> In logs, when I have blocked OSD, on the main OSD I can see that :
>> 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for> 33.976680 secs
>> 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow request
>> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
>> reached pg
>> 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for> 63.981596 secs
>> 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow request
>> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
>> reached pg
>>
>> How should I read that ? What this OSD is waiting for ?
>>
>> Thanks for any help,
>>
>> Olivier
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread GuangYang
If we are talking about requests being blocked 60+ seconds, those tunings might 
not help (they help a lot for average latency during recovering/backfilling).

It would be interesting to see the logs for those blocked requests at OSD side 
(they have level 0), pattern to search might be "slow requests \d+ seconds old".

I had a problem that for a recovery candidate object, all updates to that 
object would stuck until it is recovered, that might take extremely long time 
if there are large number of PG and objects to recover. But I think that is 
resolved by Sam to allow write for degraded objects in Hammer.


> Date: Thu, 10 Sep 2015 14:56:12 -0600
> From: rob...@leblancnet.us
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version
> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
> monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
> 3 active+remapped+wait_backfill
> 1 active+remapped+backfilling
> recovery io 70401 kB/s, 16 objects/s
> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed
> after setting up our pools, so our PGs are really out of wack. Our
> most active pool has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we
> can split the PGs in that pool. I think these large PGs is part of our
> issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in latency, but has also reduced
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> OSD process gives the recovery threads a different disk priority or if
> changing the scheduler without restarting the OSD allows the OSD to
> use disk priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and
> peer before starting the backfill. This caused more problems than
> solved as we had blocked I/O (over 200 seconds) until we set the new
> OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O
> messages. We still have 5 more disks to add from this server and four
> more servers to add.
>
> In addition to trying to minimize these impacts, would it be better to
> split the PGs then add the rest of the servers, or add the servers
> then do the PG split. I'm thinking splitting first would be better,
> but I'd like to get other opinions.
>
> No spindle stays at high utilization for long and the await drops
> below 20 ms usually within 10 seconds so I/O should be serviced
> "pretty quick". My next guess is that the journals are getting full
> and blocking while waiting for flushes, but I'm not exactly sure how
> to identify that. We are using the defaults for the journal except for
> size (10G). We'd like to have journals large to handle bursts, but if
> they are getting filled with backfill traffic, it may be counter
> productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> 

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang

> Date: Fri, 4 Sep 2015 20:31:59 -0400
> From: ski...@redhat.com
> To: yguan...@outlook.com
> CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
>> IIRC, it only triggers the move (merge or split) when that folder is hit by 
>> a request, so most likely it happens gradually.
>
> Do you know what causes this?
A requests (read/write/setxattr, etc) hitting objects in that folder.
> I would like to be more clear "gradually".
>
> Shinobu
>
> - Original Message -
> From: "GuangYang" <yguan...@outlook.com>
> To: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>
> Sent: Saturday, September 5, 2015 9:27:31 AM
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> IIRC, it only triggers the move (merge or split) when that folder is hit by a 
> request, so most likely it happens gradually.
>
> Another thing might be helpful (and we have had good experience with), is 
> that we do the folder splitting at the pool creation time, so that we avoid 
> the performance impact with runtime splitting (which is high if you have a 
> large cluster). In order to do that:
>
> 1. You will need to configure "filestore merge threshold" with a negative 
> value so that it disables merging.
> 2. When creating the pool, there is a parameter named "expected_num_objects", 
> by specifying that number, the folder will splitted to the right level with 
> the pool creation.
>
> Hope that helps.
>
> Thanks,
> Guang
>
>
> 
>> From: bhi...@gmail.com
>> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> To: n...@fisk.me.uk
>> CC: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
>> a ticket to request a way to tell an OSD to rebalance its directory
>> structure.
>>
>> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>>> similar size to yours. I didn't see any merging happening, although most of 
>>> the directory's I looked at had more files in than the new merge threshold, 
>>> so I guess this is to be expected
>>>
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>>> bring things back into order.
>>>
>>>> -Original Message-
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>> Wang, Warren
>>>> Sent: 04 September 2015 01:21
>>>> To: Mark Nelson <mnel...@redhat.com>; Ben Hines <bhi...@gmail.com>
>>>> Cc: ceph-users <ceph-users@lists.ceph.com>
>>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>>
>>>> I'm about to change it on a big cluster too. It totals around 30 million, 
>>>> so I'm a
>>>> bit nervous on changing it. As far as I understood, it would indeed move
>>>> them around, if you can get underneath the threshold, but it may be hard to
>>>> do. Two more settings that I highly recommend changing on a big prod
>>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>>
>>>> Warren
>>>>
>>>> -Original Message-
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>> Mark Nelson
>>>> Sent: Thursday, September 03, 2015 6:04 PM
>>>> To: Ben Hines <bhi...@gmail.com>
>>>> Cc: ceph-users <ceph-users@lists.ceph.com>
>>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>>
>>>> Hrm, I think it will follow the merge/split rules if it's out of whack 
>>>> given the
>>>> new settings, but I don't know that I've ever tested it on an existing 
>>>> cluster to
>>>> see that it actually happens. I guess let it sit for a while and then 
>>>> check the
>>>> OSD PG directories to see if the object counts make sense given the new
>>>> settings? :D
>>>>
>>>> Mark
>>>>
>>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
>>>>> Hey Mark,
>>>>>
>>>>> I've just tweaked these filestore settings for my clus

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang
IIRC, it only triggers the move (merge or split) when that folder is hit by a 
request, so most likely it happens gradually.

Another thing might be helpful (and we have had good experience with), is that 
we do the folder splitting at the pool creation time, so that we avoid the 
performance impact with runtime splitting (which is high if you have a large 
cluster). In order to do that:

1. You will need to configure "filestore merge threshold" with a negative value 
so that it disables merging.
2. When creating the pool, there is a parameter named "expected_num_objects", 
by specifying that number, the folder will splitted to the right level with the 
pool creation.

Hope that helps.

Thanks,
Guang



> From: bhi...@gmail.com
> Date: Fri, 4 Sep 2015 12:05:26 -0700
> To: n...@fisk.me.uk
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
> a ticket to request a way to tell an OSD to rebalance its directory
> structure.
>
> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>> similar size to yours. I didn't see any merging happening, although most of 
>> the directory's I looked at had more files in than the new merge threshold, 
>> so I guess this is to be expected
>>
>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>> bring things back into order.
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Wang, Warren
>>> Sent: 04 September 2015 01:21
>>> To: Mark Nelson ; Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> I'm about to change it on a big cluster too. It totals around 30 million, 
>>> so I'm a
>>> bit nervous on changing it. As far as I understood, it would indeed move
>>> them around, if you can get underneath the threshold, but it may be hard to
>>> do. Two more settings that I highly recommend changing on a big prod
>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>
>>> Warren
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Mark Nelson
>>> Sent: Thursday, September 03, 2015 6:04 PM
>>> To: Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> Hrm, I think it will follow the merge/split rules if it's out of whack 
>>> given the
>>> new settings, but I don't know that I've ever tested it on an existing 
>>> cluster to
>>> see that it actually happens. I guess let it sit for a while and then check 
>>> the
>>> OSD PG directories to see if the object counts make sense given the new
>>> settings? :D
>>>
>>> Mark
>>>
>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
 Hey Mark,

 I've just tweaked these filestore settings for my cluster -- after
 changing this, is there a way to make ceph move existing objects
 around to new filestore locations, or will this only apply to newly
 created objects? (i would assume the latter..)

 thanks,

 -Ben

 On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
>>> wrote:
> Basically for each PG, there's a directory tree where only a certain
> number of objects are allowed in a given directory before it splits
> into new branches/leaves. The problem is that this has a fair amount
> of overhead and also there's extra associated dentry lookups to get at any
>>> given object.
>
> You may want to try something like:
>
> "filestore merge threshold = 40"
> "filestore split multiple = 8"
>
> This will dramatically increase the number of objects per directory
>>> allowed.
>
> Another thing you may want to try is telling the kernel to greatly
> favor retaining dentries and inodes in cache:
>
> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>
> Mark
>
>
> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>
>> If I create a new pool it is generally fast for a short amount of time.
>> Not as fast as if I had a blank cluster, but close to.
>>
>> Bryn
>>>
>>> On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:
>>>
>>> I think you're probably running into the internal PG/collection
>>> splitting here; try searching for those terms and seeing what your
>>> OSD folder structures look like. You could test by creating a new
>>> pool and seeing if it's faster or slower than the one you've already 
>>> filled
>>> up.
>>> -Greg
>>>
>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>>> 

Re: [ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric

2015-08-29 Thread GuangYang

 Date: Fri, 28 Aug 2015 12:07:39 +0100
 From: gfar...@redhat.com
 To: vickey.singh22...@gmail.com
 CC: ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-de...@vger.kernel.org
 Subject: Re: [ceph-users] Opensource plugin for pulling out cluster recovery 
 and client IO metric

 On Mon, Aug 24, 2015 at 4:03 PM, Vickey Singh
 vickey.singh22...@gmail.com wrote:
 Hello Ceph Geeks

 I am planning to develop a python plugin that pulls out cluster recovery IO
 and client IO operation metrics , that can be further used with collectd.

 For example , i need to take out these values

 recovery io 814 MB/s, 101 objects/s
 client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s
The calculation *window* for those stats are very small, IIRC, they are two PG 
version which most likely map to two seconds (average of the last two seconds), 
you may increase mon_stat_smooth_intervals to enlarge the window, but I didn't 
try it myself.

I found the 'ceph status -f json' has better formatted output and more 
information.


 Could you please help me in understanding how ceph -s and ceph -w outputs
 prints cluster recovery IO and client IO information.
 Where this information is coming from. Is it coming from perf dump ? If yes
 then which section of perf dump output is should focus on. If not then how
 can i get this values.

 I tried ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump , but
 it generates hell lot of information and i am confused which section of
 output should i use.
perf counters have a tone of information which needs time to understand the 
details, but if the purpose is just to dump as what they are and do better 
aggregation/reporting, you can check 'perf schema' first to get the type of the 
field, can cross check the perf_counter's definition for each type, to 
determine how you collection/aggregate those data.

 This information is generated only on the monitors based on pg stats
 from the OSDs, is slightly laggy, and can be most easily accessed by
 calling ceph -s on a regular basis. You can get it with json output
 that is easier to parse, and you can optionally set up an API server
 for more programmatic access. I'm not sure on the details of doing
 that last, though.
 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
I don't see anything obvious, sorry..

Looks like something with osd.{5, 76, 38}, which are absent from the *up* set 
though they are up. How about increasing log level 'debug_osd = 20' on osd.76 
and restart the OSD?

Thanks,
Guang



 Date: Thu, 13 Aug 2015 09:10:31 -0700
 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 
 active+remapped
 From: sdain...@spd1.com
 To: yguan...@outlook.com
 CC: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com

 OSD tree: http://pastebin.com/3z333DP4
 Crushmap: http://pastebin.com/DBd9k56m

 I realize these nodes are quite large, I have plans to break them out
 into 12 OSD's/node.

 On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote:
 Could you share the 'ceph osd tree dump' and CRUSH map dump ?

 Thanks,
 Guang


 
 Date: Thu, 13 Aug 2015 08:16:09 -0700
 From: sdain...@spd1.com
 To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cluster health_warn 1 
 active+undersized+degraded/1 active+remapped

 I decided to set OSD 76 out and let the cluster shuffle the data off
 that disk and then brought the OSD back in. For the most part this
 seemed to be working, but then I had 1 object degraded and 88xxx
 objects misplaced:

 # ceph health detail
 HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
 (0.000%); recovery 88844/66089446 objects misplaced (0.134%)
 pg 2.e7f is stuck unclean for 88398.251351, current state
 active+remapped, last acting [58,5]
 pg 2.143 is stuck unclean for 13892.364101, current state
 active+remapped, last acting [16,76]
 pg 2.968 is stuck unclean for 13892.363521, current state
 active+remapped, last acting [44,76]
 pg 2.5f8 is stuck unclean for 13892.377245, current state
 active+remapped, last acting [17,76]
 pg 2.81c is stuck unclean for 13892.363443, current state
 active+remapped, last acting [25,76]
 pg 2.1a3 is stuck unclean for 13892.364400, current state
 active+remapped, last acting [16,76]
 pg 2.2cb is stuck unclean for 13892.374390, current state
 active+remapped, last acting [14,76]
 pg 2.d41 is stuck unclean for 13892.373636, current state
 active+remapped, last acting [27,76]
 pg 2.3f9 is stuck unclean for 13892.373147, current state
 active+remapped, last acting [35,76]
 pg 2.a62 is stuck unclean for 86283.741920, current state
 active+remapped, last acting [2,38]
 pg 2.1b0 is stuck unclean for 13892.363268, current state
 active+remapped, last acting [3,76]
 recovery 1/66089446 objects degraded (0.000%)
 recovery 88844/66089446 objects misplaced (0.134%)

 I say apparently because with one object degraded, none of the pg's
 are showing degraded:
 # ceph pg dump_stuck degraded
 ok

 # ceph pg dump_stuck unclean
 ok
 pg_stat state up up_primary acting acting_primary
 2.e7f active+remapped [58] 58 [58,5] 58
 2.143 active+remapped [16] 16 [16,76] 16
 2.968 active+remapped [44] 44 [44,76] 44
 2.5f8 active+remapped [17] 17 [17,76] 17
 2.81c active+remapped [25] 25 [25,76] 25
 2.1a3 active+remapped [16] 16 [16,76] 16
 2.2cb active+remapped [14] 14 [14,76] 14
 2.d41 active+remapped [27] 27 [27,76] 27
 2.3f9 active+remapped [35] 35 [35,76] 35
 2.a62 active+remapped [2] 2 [2,38] 2
 2.1b0 active+remapped [3] 3 [3,76] 3

 All of the OSD filesystems are below 85% full.

 I then compared a 0.94.2 cluster that was new and had not been updated
 (current cluster is 0.94.2 which had been updated a couple times) and
 noticed the crush map had 'tunable straw_calc_version 1' so I added it
 to the current cluster.

 After the data moved around for about 8 hours or so I'm left with this 
 state:

 # ceph health detail
 HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
 misplaced (0.025%)
 pg 2.e7f is stuck unclean for 149422.331848, current state
 active+remapped, last acting [58,5]
 pg 2.782 is stuck unclean for 64878.002464, current state
 active+remapped, last acting [76,31]
 recovery 16357/66089446 objects misplaced (0.025%)

 I attempted a pg repair on both of the pg's listed above, but it
 doesn't look like anything is happening. The doc's reference an
 inconsistent state as a use case for the repair command so that's
 likely why.

 These 2 pg's have been the issue throughout this process so how can I
 dig deeper to figure out what the problem is?

 # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
 # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
 yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway

Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
There are three factors that impact disk utilization of an OSD:
 1. number of PGs on the OSD (determined by CRUSH)
 2. number of objects with each PG (better to pick a 2 power PG number to make 
this one more even)
 3. object size deviation

with 'ceph osd reweight-by-pg', you can tune (1). And if you would like to get 
a better understanding of what is the root cause in your cluster, you can find 
more information from 'pg dump', from where you can get the raw data for 1 and 
2.

Once the cluster is filled, you properly go with 'ceph osd 
reweight-by-utilization', be careful of that since it could incur lots of data 
movement...


 To: ceph-users@lists.ceph.com
 From: vedran.fu...@gmail.com
 Date: Fri, 14 Aug 2015 00:15:17 +0200
 Subject: Re: [ceph-users] OSD space imbalance

 On 13.08.2015 18:01, GuangYang wrote:
 Try 'ceph osd  int' right after creating the pools?

 Would it do any good now when pool is in use and nearly full as I can't
 re-create it now. Also, what's the integer argument in the command
 above? I failed to find proper explanation in the docs.
Please check it out here - 
https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L469

 What is the typical object size in the cluster?

 Around 50 MB.


 Thanks,
 Vedran

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
Try 'ceph osd reweight-by-pg int' right after creating the pools? What is the 
typical object size in the cluster?


Thanks,
Guang



 To: ceph-users@lists.ceph.com
 From: vedran.fu...@gmail.com
 Date: Thu, 13 Aug 2015 14:58:11 +0200
 Subject: [ceph-users] OSD space imbalance

 Hello,

 I'm having an issue where disk usages between OSDs aren't well balanced
 thus causing disk space to be wasted. Ceph is latest 0.94.2, used
 exclusively through cephfs. Re-weighting helps, but just slightly, and
 it has to be done on a daily basis causing constant refills. In the end
 I get OSD with 65% usage with some other going over 90%. I also set the
 ceph osd crush tunables optimal, but I didn't notice any changes when
 it comes to disk usage. Is there anything I can do to get them within
 10% range at least?

 health HEALTH_OK
 mdsmap e2577: 1/1/1 up, 2 up:standby
 osdmap e25239: 48 osds: 48 up, 48 in
 pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects
 36156 GB used, 9472 GB / 45629 GB avail
 5184 active+clean


 ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
 37 0.92999 1.0 950G 625G 324G 65.85 0.83
 21 0.92999 1.0 950G 649G 300G 68.35 0.86
 32 0.92999 1.0 950G 670G 279G 70.58 0.89
 7 0.92999 1.0 950G 676G 274G 71.11 0.90
 17 0.92999 1.0 950G 681G 268G 71.73 0.91
 40 0.92999 1.0 950G 689G 260G 72.55 0.92
 20 0.92999 1.0 950G 690G 260G 72.62 0.92
 25 0.92999 1.0 950G 691G 258G 72.76 0.92
 2 0.92999 1.0 950G 694G 256G 73.03 0.92
 39 0.92999 1.0 950G 697G 253G 73.35 0.93
 18 0.92999 1.0 950G 703G 247G 74.00 0.93
 47 0.92999 1.0 950G 703G 246G 74.05 0.93
 23 0.92999 0.86693 950G 704G 245G 74.14 0.94
 6 0.92999 1.0 950G 726G 224G 76.39 0.96
 8 0.92999 1.0 950G 727G 223G 76.54 0.97
 5 0.92999 1.0 950G 728G 222G 76.62 0.97
 35 0.92999 1.0 950G 728G 221G 76.66 0.97
 11 0.92999 1.0 950G 730G 220G 76.82 0.97
 43 0.92999 1.0 950G 730G 219G 76.87 0.97
 33 0.92999 1.0 950G 734G 215G 77.31 0.98
 38 0.92999 1.0 950G 736G 214G 77.49 0.98
 12 0.92999 1.0 950G 737G 212G 77.61 0.98
 31 0.92999 0.85184 950G 742G 208G 78.09 0.99
 28 0.92999 1.0 950G 745G 205G 78.41 0.99
 27 0.92999 1.0 950G 751G 199G 79.04 1.00
 10 0.92999 1.0 950G 754G 195G 79.40 1.00
 13 0.92999 1.0 950G 762G 188G 80.21 1.01
 9 0.92999 1.0 950G 763G 187G 80.29 1.01
 16 0.92999 1.0 950G 764G 186G 80.37 1.01
 0 0.92999 1.0 950G 778G 171G 81.94 1.03
 3 0.92999 1.0 950G 780G 170G 82.11 1.04
 41 0.92999 1.0 950G 780G 169G 82.13 1.04
 34 0.92999 0.87303 950G 783G 167G 82.43 1.04
 14 0.92999 1.0 950G 784G 165G 82.56 1.04
 42 0.92999 1.0 950G 786G 164G 82.70 1.04
 46 0.92999 1.0 950G 788G 162G 82.93 1.05
 30 0.92999 1.0 950G 790G 160G 83.12 1.05
 45 0.92999 1.0 950G 804G 146G 84.59 1.07
 44 0.92999 1.0 950G 807G 143G 84.92 1.07
 1 0.92999 1.0 950G 817G 132G 86.05 1.09
 22 0.92999 1.0 950G 825G 125G 86.81 1.10
 15 0.92999 1.0 950G 826G 123G 86.97 1.10
 19 0.92999 1.0 950G 829G 120G 87.30 1.10
 36 0.92999 1.0 950G 831G 119G 87.48 1.10
 24 0.92999 1.0 950G 831G 118G 87.50 1.10
 26 0.92999 1.0 950G 851G 101692M 89.55 1.13
 29 0.92999 1.0 950G 851G 101341M 89.59 1.13
 4 0.92999 1.0 950G 860G 92164M 90.53 1.14
 MIN/MAX VAR: 0.83/1.14 STDDEV: 5.94
 TOTAL 45629G 36156G 9473G 79.24

 Thanks,
 Vedran


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
Could you share the 'ceph osd tree dump' and CRUSH map dump ?

Thanks,
Guang



 Date: Thu, 13 Aug 2015 08:16:09 -0700
 From: sdain...@spd1.com
 To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 
 active+remapped

 I decided to set OSD 76 out and let the cluster shuffle the data off
 that disk and then brought the OSD back in. For the most part this
 seemed to be working, but then I had 1 object degraded and 88xxx
 objects misplaced:

 # ceph health detail
 HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded
 (0.000%); recovery 88844/66089446 objects misplaced (0.134%)
 pg 2.e7f is stuck unclean for 88398.251351, current state
 active+remapped, last acting [58,5]
 pg 2.143 is stuck unclean for 13892.364101, current state
 active+remapped, last acting [16,76]
 pg 2.968 is stuck unclean for 13892.363521, current state
 active+remapped, last acting [44,76]
 pg 2.5f8 is stuck unclean for 13892.377245, current state
 active+remapped, last acting [17,76]
 pg 2.81c is stuck unclean for 13892.363443, current state
 active+remapped, last acting [25,76]
 pg 2.1a3 is stuck unclean for 13892.364400, current state
 active+remapped, last acting [16,76]
 pg 2.2cb is stuck unclean for 13892.374390, current state
 active+remapped, last acting [14,76]
 pg 2.d41 is stuck unclean for 13892.373636, current state
 active+remapped, last acting [27,76]
 pg 2.3f9 is stuck unclean for 13892.373147, current state
 active+remapped, last acting [35,76]
 pg 2.a62 is stuck unclean for 86283.741920, current state
 active+remapped, last acting [2,38]
 pg 2.1b0 is stuck unclean for 13892.363268, current state
 active+remapped, last acting [3,76]
 recovery 1/66089446 objects degraded (0.000%)
 recovery 88844/66089446 objects misplaced (0.134%)

 I say apparently because with one object degraded, none of the pg's
 are showing degraded:
 # ceph pg dump_stuck degraded
 ok

 # ceph pg dump_stuck unclean
 ok
 pg_stat state up up_primary acting acting_primary
 2.e7f active+remapped [58] 58 [58,5] 58
 2.143 active+remapped [16] 16 [16,76] 16
 2.968 active+remapped [44] 44 [44,76] 44
 2.5f8 active+remapped [17] 17 [17,76] 17
 2.81c active+remapped [25] 25 [25,76] 25
 2.1a3 active+remapped [16] 16 [16,76] 16
 2.2cb active+remapped [14] 14 [14,76] 14
 2.d41 active+remapped [27] 27 [27,76] 27
 2.3f9 active+remapped [35] 35 [35,76] 35
 2.a62 active+remapped [2] 2 [2,38] 2
 2.1b0 active+remapped [3] 3 [3,76] 3

 All of the OSD filesystems are below 85% full.

 I then compared a 0.94.2 cluster that was new and had not been updated
 (current cluster is 0.94.2 which had been updated a couple times) and
 noticed the crush map had 'tunable straw_calc_version 1' so I added it
 to the current cluster.

 After the data moved around for about 8 hours or so I'm left with this state:

 # ceph health detail
 HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects
 misplaced (0.025%)
 pg 2.e7f is stuck unclean for 149422.331848, current state
 active+remapped, last acting [58,5]
 pg 2.782 is stuck unclean for 64878.002464, current state
 active+remapped, last acting [76,31]
 recovery 16357/66089446 objects misplaced (0.025%)

 I attempted a pg repair on both of the pg's listed above, but it
 doesn't look like anything is happening. The doc's reference an
 inconsistent state as a use case for the repair command so that's
 likely why.

 These 2 pg's have been the issue throughout this process so how can I
 dig deeper to figure out what the problem is?

 # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS
 # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5


 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn
 yangyongp...@bwstor.com.cn wrote:
 You can try ceph pg repair pg_idto repair the unhealth pg.ceph health
 detail command is very useful to detect unhealth pgs.

 
 yangyongp...@bwstor.com.cn


 From: Steve Dainard
 Date: 2015-08-12 23:48
 To: ceph-users
 Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1
 active+remapped
 I ran a ceph osd reweight-by-utilization yesterday and partway through
 had a network interruption. After the network was restored the cluster
 continued to rebalance but this morning the cluster has stopped
 rebalance and status will not change from:

 # ceph status
 cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c
 health HEALTH_WARN
 1 pgs degraded
 1 pgs stuck degraded
 2 pgs stuck unclean
 1 pgs stuck undersized
 1 pgs undersized
 recovery 8163/66089054 objects degraded (0.012%)
 recovery 8194/66089054 objects misplaced (0.012%)
 monmap e24: 3 mons at
 {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0}
 election epoch 250, quorum 0,1,2 mon1,mon2,mon3
 osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs
 pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects
 251 TB used, 111 TB / 363 TB avail
 

Re: [ceph-users] osd out

2015-08-12 Thread GuangYang
If you are using the default configuration to create the pool (3 replicas), 
after losing 1 OSD and having 2 left, CRUSH would not be able to find enough 
OSDs (at least 3) to map the PG thus it would stuck at unclean.


Thanks,
Guang



 From: chm...@yandex.ru
 Date: Wed, 12 Aug 2015 19:46:01 +0300
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] osd out

 Hello.
 Could you please help me to remove osd from cluster;

 # ceph osd tree
 ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 0.02998 root default
 -2 0.00999 host ceph1
 0 0.00999 osd.0 up 1.0 1.0
 -3 0.00999 host ceph2
 1 0.00999 osd.1 up 1.0 1.0
 -4 0.00999 host ceph3
 2 0.00999 osd.2 up 1.0 1.0


 # ceph -s
 cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa
 health HEALTH_OK
 monmap e1: 3 mons at 
 {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0}
 election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e76: 3 osds: 3 up, 3 in
 pgmap v328: 128 pgs, 1 pools, 10 bytes data, 1 objects
 120 MB used, 45926 MB / 46046 MB avail
 128 active+clean


 # ceph osd out 0
 marked out osd.0.

 # ceph -w
 cluster 64f87255-d56e-499d-8ebc-65e0f577e0aa
 health HEALTH_WARN
 128 pgs stuck unclean
 recovery 1/3 objects misplaced (33.333%)
 monmap e1: 3 mons at 
 {ceph1=10.0.0.101:6789/0,ceph2=10.0.0.102:6789/0,ceph3=10.0.0.103:6789/0}
 election epoch 10, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e79: 3 osds: 3 up, 2 in; 128 remapped pgs
 pgmap v332: 128 pgs, 1 pools, 10 bytes data, 1 objects
 89120 kB used, 30610 MB / 30697 MB avail
 1/3 objects misplaced (33.333%)
 128 active+remapped

 2015-08-12 18:43:12.412286 mon.0 [INF] pgmap v332: 128 pgs: 128 
 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 
 objects misplaced (33.333%)
 2015-08-12 18:43:20.362337 mon.0 [INF] HEALTH_WARN; 128 pgs stuck unclean; 
 recovery 1/3 objects misplaced (33.333%)
 2015-08-12 18:44:15.055825 mon.0 [INF] pgmap v333: 128 pgs: 128 
 active+remapped; 10 bytes data, 89120 kB used, 30610 MB / 30697 MB avail; 1/3 
 objects misplaced (33.333%)


 and it never become active+clean .
 What I’m doing wrong ?
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw crash within libfcgi

2015-06-26 Thread GuangYang
Sadly we don't have a core dump when the crash happened, so that we are not 
able to dump the registers..

The latest status - we changed the rgw thread number from 600 to 300, and we 
haven't seen the same crash since, but still it is hard to tell if that is 
related and how it is related..

Thanks,
Guang

 Date: Wed, 24 Jun 2015 17:21:04 -0400
 From: yeh...@redhat.com
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: radosgw crash within libfcgi
 
 
 
 - Original Message -
 From: GuangYang 
 To: Yehuda Sadeh-Weinraub 
 Cc: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com
 Sent: Wednesday, June 24, 2015 2:12:23 PM
 Subject: RE: radosgw crash within libfcgi
 
 
 Date: Wed, 24 Jun 2015 17:04:05 -0400
 From: yeh...@redhat.com
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: radosgw crash within libfcgi



 - Original Message -
 From: GuangYang 
 To: Yehuda Sadeh-Weinraub 
 Cc: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com
 Sent: Wednesday, June 24, 2015 1:53:20 PM
 Subject: RE: radosgw crash within libfcgi

 Thanks Yehuda for the response.

 We already patched libfcgi to use poll instead of select to overcome the
 limitation.

 Thanks,
 Guang


 
 Date: Wed, 24 Jun 2015 14:40:25 -0400
 From: yeh...@redhat.com
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: radosgw crash within libfcgi



 - Original Message -
 From: GuangYang 
 To: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com,
 yeh...@redhat.com
 Sent: Wednesday, June 24, 2015 10:09:58 AM
 Subject: radosgw crash within libfcgi

 Hello Cephers,
 Recently we have several radosgw daemon crashes with the same following
 kernel log:

 Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
 7ffa069996f2 sp 7ff55c432710 error 6 in

 error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault.
 Is the pointer actually invalid?
 With (ip - {address_load_the_sharded_library}) to get the instruction which
 caused this crash, the objdump shows the crash happened at instruction 46f2
 (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but I
 don't quite understand how/why it could crash there.
 
 4690 :
 4690:   48 89 5c 24 f0  mov%rbx,-0x10(%rsp)
 4695:   48 89 6c 24 f8  mov%rbp,-0x8(%rsp)
 469a:   48 83 ec 18 sub$0x18,%rsp
 469e:   48 85 fftest   %rdi,%rdi
 46a1:   48 89 fbmov%rdi,%rbx
 46a4:   89 f5   mov%esi,%ebp
 46a6:   74 28   je 46d0 
 46a8:   48 8d 7f 08 lea0x8(%rdi),%rdi
 46ac:   e8 67 e3 ff ff  callq  2a18 
 46b1:   48 8d 7b 10 lea0x10(%rbx),%rdi
 46b5:   e8 5e e3 ff ff  callq  2a18 
 46ba:   48 8d 7b 18 lea0x18(%rbx),%rdi
 46be:   e8 55 e3 ff ff  callq  2a18 
 46c3:   48 8d 7b 28 lea0x28(%rbx),%rdi
 46c7:   e8 d4 f4 ff ff  callq  3ba0 
 46cc:   85 ed   test   %ebp,%ebp
 46ce:   75 10   jne46e0 
 46d0:   48 8b 5c 24 08  mov0x8(%rsp),%rbx
 46d5:   48 8b 6c 24 10  mov0x10(%rsp),%rbp
 46da:   48 83 c4 18 add$0x18,%rsp
 46de:   c3  retq
 46df:   90  nop
 46e0:   31 f6   xor%esi,%esi
 46e2:   83 7b 4c 00 cmpl   $0x0,0x4c(%rbx)
 46e6:   8b 7b 30mov0x30(%rbx),%edi
 46e9:   40 0f 94 c6 sete   %sil
 46ed:   e8 86 e6 ff ff  callq  2d78 
 46f2:   c7 43 30 ff ff ff ffmovl   $0x,0x30(%rbx)
 
 info registers?
 
 Not too familiar with the specific message, but it could be that 
 OS_IpcClose() aborts (not highly unlikely) and it only dumps the return 
 address of the current function (shouldn't be referenced as ip though).
 
 What's rbx? Is the memory at %rbx + 0x30 valid?
 
 Also, did you by any chance upgrade the binaries while the code was running? 
 is the code running over nfs?
 
 Yehuda
 

 Yehuda


 libfcgi.so.0.0.0[7ffa06995000+a000] in
 libfcgi.so.0.0.0[7ffa06995000+a000]

 Looking at the assembly, it seems crashing at this point -
 http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
 confused me. I tried to see if there is any other reference holding the
 FCGX_Request which release the handle without any luck.

 There are also other observations:
 1 Several radosgw daemon across different hosts crashed around the same
 time.
 2 Apache's error log has some fcgi error complaining ##idle

Re: [ceph-users] radosgw crash within libfcgi

2015-06-24 Thread GuangYang
Thanks Yehuda for the response.

We already patched libfcgi to use poll instead of select to overcome the 
limitation.

Thanks,
Guang



 Date: Wed, 24 Jun 2015 14:40:25 -0400
 From: yeh...@redhat.com
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: radosgw crash within libfcgi



 - Original Message -
 From: GuangYang yguan...@outlook.com
 To: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com, yeh...@redhat.com
 Sent: Wednesday, June 24, 2015 10:09:58 AM
 Subject: radosgw crash within libfcgi

 Hello Cephers,
 Recently we have several radosgw daemon crashes with the same following
 kernel log:

 Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
 7ffa069996f2 sp 7ff55c432710 error 6 in
 libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]

 Looking at the assembly, it seems crashing at this point -
 http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
 confused me. I tried to see if there is any other reference holding the
 FCGX_Request which release the handle without any luck.

 There are also other observations:
 1 Several radosgw daemon across different hosts crashed around the same
 time.
 2 Apache's error log has some fcgi error complaining ##idle timeout##
 during the time.

 Does anyone experience similar issue?


 In the past we've had issues with libfcgi that were related to the number of 
 open fds on the process ( 1024). The issue was a buggy libfcgi that was 
 using select() instead of poll(), so this might be the issue you're noticing.

 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw crash within libfcgi

2015-06-24 Thread GuangYang
Hello Cephers,
Recently we have several radosgw daemon crashes with the same following kernel 
log:

Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip 7ffa069996f2 
sp 7ff55c432710 error 6 in libfcgi.so.0.0.0[7ffa06995000+a000] in 
libfcgi.so.0.0.0[7ffa06995000+a000]

Looking at the assembly, it seems crashing at this point - 
http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which 
confused me. I tried to see if there is any other reference holding the 
FCGX_Request which release the handle without any luck.

There are also other observations:
 1 Several radosgw daemon across different hosts crashed around the same time.
 2 Apache's error log has some fcgi error complaining ##idle timeout## during 
the time.

Does anyone experience similar issue? 

Thanks,
Guang 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
Hi Cephers,
While looking at disk utilization on OSD, I noticed the disk was constantly 
busy with large number of small writes, further investigation showed that, as 
radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which 
made the xattrs get from local to extents, which incurred extra I/O.

I would like to check if anybody has experience with offloading the metadata to 
omap:
  1 Offload everything to omap? If this is the case, should we make the inode 
size as 512 (instead of 2k)?
  2 Partial offload the metadata to omap, e.g. only offloading the rgw 
specified metadata to omap.

Any sharing is deeply appreciated. Thanks!

Thanks,
Guang 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
After back-porting Sage's patch to Giant, with radosgw, the xattrs can get 
inline. I haven't run extensive testing yet, will update once I have some 
performance data to share.

Thanks,
Guang

 Date: Tue, 16 Jun 2015 15:51:44 -0500
 From: mnel...@redhat.com
 To: yguan...@outlook.com; s...@newdream.net
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: xattrs vs. omap with radosgw
 
 
 
 On 06/16/2015 03:48 PM, GuangYang wrote:
  Thanks Sage for the quick response.
 
  It is on Firefly v0.80.4.
 
  While trying to put with *rados* directly, the xattrs can be inline. The 
  problem comes to light when using radosgw, since we have a bunch of 
  metadata to keep via xattrs, including:
  rgw.idtag  : 15 bytes
  rgw.manifest :  381 bytes
 
 Ah, that manifest will push us over the limit afaik resulting in every 
 inode getting a new extent.
 
  rgw.acl : 121 bytes
  rgw.etag : 33 bytes
 
  Given the background, it looks like the problem is that the rgw.manifest is 
  too large so that XFS make it extents. If I understand correctly, if we 
  port the change to Firefly, we should be able to inline the inode since the 
  accumulated size is still less than 2K (please correct me if I am wrong 
  here).
 
 I think you are correct so long as the patch breaks that manifest down 
 into 254 byte or smaller chunks.
 
 
  Thanks,
  Guang
 
 
  
  Date: Tue, 16 Jun 2015 12:43:08 -0700
  From: s...@newdream.net
  To: yguan...@outlook.com
  CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
  Subject: Re: xattrs vs. omap with radosgw
 
  On Tue, 16 Jun 2015, GuangYang wrote:
  Hi Cephers,
  While looking at disk utilization on OSD, I noticed the disk was 
  constantly busy with large number of small writes, further investigation 
  showed that, as radosgw uses xattrs to store metadata (e.g. etag, 
  content-type, etc.), which made the xattrs get from local to extents, 
  which incurred extra I/O.
 
  I would like to check if anybody has experience with offloading the 
  metadata to omap:
  1 Offload everything to omap? If this is the case, should we make the 
  inode size as 512 (instead of 2k)?
  2 Partial offload the metadata to omap, e.g. only offloading the rgw 
  specified metadata to omap.
 
  Any sharing is deeply appreciated. Thanks!
 
  Hi Guang,
 
  Is this hammer or firefly?
 
  With hammer the size of object_info_t crossed the 255 byte boundary, which
  is the max xattr value that XFS can inline. We've since merged something
  that stripes over several small xattrs so that we can keep things inline,
  but it hasn't been backported to hammer yet. See
  c6cdb4081e366f471b372102905a1192910ab2da. Perhaps this is what you're
  seeing?
 
  I think we're still better off with larger XFS inodes and inline xattrs if
  it means we avoid leveldb at all for most objects.
 
  sage
--
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
Hi Yuan,
Thanks for sharing the link, it is interesting to read. My understanding of the 
test results, is that with a fixed size of xattrs, using smaller stripe size 
will incur larger latency for read, which kind of makes sense since there are 
more k-v pairs, and with the size, it needs to get extents anyway. 

Correct me if I am wrong here...

Thanks,
Guang

 From: yuan.z...@intel.com
 To: s...@newdream.net; yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: RE: xattrs vs. omap with radosgw
 Date: Wed, 17 Jun 2015 01:32:35 +
 
 FWIW, there was some discussion in OpenStack Swift and their performance 
 tests showed 255 is not the best in recent XFS. They decided to use large 
 xattr boundary size(65535).
 
 https://gist.github.com/smerritt/5e7e650abaa20599ff34
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, June 17, 2015 3:43 AM
 To: GuangYang
 Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: xattrs vs. omap with radosgw
 
 On Tue, 16 Jun 2015, GuangYang wrote:
 Hi Cephers,
 While looking at disk utilization on OSD, I noticed the disk was constantly 
 busy with large number of small writes, further investigation showed that, 
 as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), 
 which made the xattrs get from local to extents, which incurred extra I/O.
 
 I would like to check if anybody has experience with offloading the metadata 
 to omap:
   1 Offload everything to omap? If this is the case, should we make the 
 inode size as 512 (instead of 2k)?
   2 Partial offload the metadata to omap, e.g. only offloading the rgw 
 specified metadata to omap.
 
 Any sharing is deeply appreciated. Thanks!
 
 Hi Guang,
 
 Is this hammer or firefly?
 
 With hammer the size of object_info_t crossed the 255 byte boundary, which is 
 the max xattr value that XFS can inline. We've since merged something that 
 stripes over several small xattrs so that we can keep things inline, but it 
 hasn't been backported to hammer yet. See 
 c6cdb4081e366f471b372102905a1192910ab2da. Perhaps this is what you're seeing?
 
 I think we're still better off with larger XFS inodes and inline xattrs if it 
 means we avoid leveldb at all for most objects.
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unfound object(s)

2015-06-15 Thread GuangYang
Thanks to Sam, we can use:
  ceph pg pg_id list_missing
to get the list of unfound objects.

Thanks,
Guang



 From: yguan...@outlook.com
 To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Date: Mon, 15 Jun 2015 16:46:53 +
 Subject: [ceph-users] unfound object(s)

 Hello Cephers,
 On one of our production clusters, there is one *unfound* object reported 
 which make the PG stuck at recovering. While trying to recover the object, I 
 failed to find a way to tell which object is unfound.

 I tried:
 1 PG query
 2 Grep from monitor log

 Did I miss anything?

 Thanks,
 Guang
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] unfound object(s)

2015-06-15 Thread GuangYang
Hello Cephers,
On one of our production clusters, there is one *unfound* object reported which 
make the PG stuck at recovering. While trying to recover the object, I failed 
to find a way to tell which object is unfound.

I tried:
  1 PG query
  2 Grep from monitor log

Did I miss anything?  

Thanks,
Guang 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw geo-replication

2015-04-24 Thread GuangYang
Hi cephers,
Recently I am investigating the geo-replication of rgw, from the example at 
[1], it looks like if we want to do data geo replication between us east and us 
west, we will need to build *one* (super) RADOS cluster which cross us east and 
west, and only deploy two different radosgw instances. Is my understanding 
correct here?

If that is the case, is there any reason preventing us to deploy two completed 
isolated clusters (not only rgw, but only mon and osd) and replicate data 
between them?

[1] 
https://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication
 


Thanks,
Guang 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw geo-replication

2015-04-24 Thread GuangYang
___
 Date: Fri, 24 Apr 2015 17:29:40 +0530 
 From: vum...@redhat.com 
 To: ceph-users@lists.ceph.com 
 Subject: Re: [ceph-users] rgw geo-replication 
 
 
 On 04/24/2015 05:17 PM, GuangYang wrote: 
 
 Hi cephers, 
 Recently I am investigating the geo-replication of rgw, from the example at 
 [1], it looks like if we want to do data geo replication between us east and 
 us west, we will need to build *one* (super) RADOS cluster which cross us 
 east and west, and only deploy two different radosgw instances. Is my 
 understanding correct here? 
 
 You can do that but it is not recommended , I think doc says it would 
 be very good if you have two clusters with different radosgw servers. 
 https://ceph.com/docs/master/radosgw/federated-config/#background 
 
 1. You may deploy a single Ceph Storage Cluster with a federated 
 architecture if you have low latency network connections (this isn’t 
 recommended). 
 
 2. You may also deploy one Ceph Storage Cluster per region with a 
 separate set of pools for each zone (typical). 
This confuse me.. So in the typical recommendation, we will need to deploy a 
cluster which has mon/osd cross us east and west (say using CRUSH we can create 
two sets of pools corresponding to zones), from a failure's point of view, if 
there is an outage of the cluster, it will impact availability, as oppose if we 
have two clusters and replicate data between those two clusters, if there is 
outage of one cluster, we can redirect all traffic to the other one.
 
 3. You may also deploy a separate Ceph Storage Cluster for each zone if 
 your requirements and resources warrant this level of redundancy.
I think this makes more sense. One region but the region is only logically, two 
zones within that region, each zone (physically) maps to a standalone cluster 
and replicate data between those zones/clusters.

Does that support?

Thanks,
Guang

 
 Regards, 
 Vikhyat 
 
 
 
 If that is the case, is there any reason preventing us to deploy two 
 completed isolated clusters (not only rgw, but only mon and osd) and 
 replicate data between them? 
 
 [1] 
 https://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication
  
 
 
 Thanks, 
 Guang 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
 
 
 ___ ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph data not well distributed.

2015-04-14 Thread GuangYang
We have a tiny script which does the CRUSH re-weight based on the PGs/OSD to 
achieve balance across OSDs, and we run the script right after setup the 
cluster to avoid data migration after the cluster is filled up.

A couple of experiences to share:
 1 As suggested, it is helpful to choose a 2-powered PG number so that 
objects/PG is even (it is pretty even in our deployment, given the object size 
and disk size we have).
 2 With running the script, we try to achieve even PGs/OSD (for the data 
pool), so that the disk utilization is most likely to be even after the cluster 
is filled up.
 3 With disk replacement procedure (depending on the procedure you have), you 
may need some extra steps to make sure the CRUSH weight persist across disk 
replacement.


Sage has a built-in version (reweight-by-pg) for that, and here is our script - 
https://github.com/guangyy/ceph_misc/blob/master/osd_crush_reweight/ceph_osd_crush_reweight.pl

Hope that helps.

Thanks,
Guang



 To: ceph-users@lists.ceph.com
 From: pengyujian5201...@126.com
 Date: Wed, 15 Apr 2015 01:58:08 +
 Subject: [ceph-users] ceph data not well distributed.

 I have a ceph cluster with 125 osds with the same weight.
 But I found that data is not well distributed.
 df
 Filesystem 1K-blocks Used Available Use% Mounted on
 /dev/sda1 47929224 2066208 43405264 5% /
 udev 16434372 4 16434368 1% /dev
 tmpfs 6578584 728 6577856 1% /run
 none 5120 0 5120 0% /run/lock
 none 16446460 0 16446460 0% /run/shm
 /dev/sda6 184307 62921 107767 37% /boot
 /dev/mapper/osd-104 877797376 354662904 523134472 41% /ceph-osd/osd-104
 /dev/mapper/osd-105 877797376 596911248 280886128 69% /ceph-osd/osd-105
 /dev/mapper/osd-106 877797376 497968080 379829296 57% /ceph-osd/osd-106
 /dev/mapper/osd-107 877797376 640225368 237572008 73% /ceph-osd/osd-107
 /dev/mapper/osd-108 877797376 509972412 367824964 59% /ceph-osd/osd-108
 /dev/mapper/osd-109 877797376 581435864 296361512 67% /ceph-osd/osd-109
 /dev/mapper/osd-110 877797376 724248740 153548636 83% /ceph-osd/osd-110
 /dev/mapper/osd-111 877797376 495883796 381913580 57% /ceph-osd/osd-111
 /dev/mapper/osd-112 877797376 488635912 389161464 56% /ceph-osd/osd-112
 /dev/mapper/osd-113 877797376 613807596 263989780 70% /ceph-osd/osd-113
 /dev/mapper/osd-114 877797376 633144408 244652968 73% /ceph-osd/osd-114
 /dev/mapper/osd-115 877797376 519702956 358094420 60% /ceph-osd/osd-115
 /dev/mapper/osd-116 877797376 449834752 427962624 52% /ceph-osd/osd-116
 /dev/mapper/osd-117 877797376 641484036 236313340 74% /ceph-osd/osd-117
 /dev/mapper/osd-118 877797376 519416488 358380888 60% /ceph-osd/osd-118
 /dev/mapper/osd-119 877797376 599926788 277870588 69% /ceph-osd/osd-119
 /dev/mapper/osd-120 877797376 460384476 417412900 53% /ceph-osd/osd-120
 /dev/mapper/osd-121 877797376 646286724 231510652 74% /ceph-osd/osd-121
 /dev/mapper/osd-122 877797376 647260752 230536624 74% /ceph-osd/osd-122
 /dev/mapper/osd-123 877797376 432367436 445429940 50% /ceph-osd/osd-123
 /dev/mapper/osd-124 877797376 595846772 281950604 68% /ceph-osd/osd-124

 The osd.104 is 41% full, but osd.110 is 83%.
 Can I move some pgs from osd.110 to osd.104 manually?

 Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread GuangYang
We have had good experience so far keeping each bucket less than 0.5 million 
objects, by client side sharding. But I think it would be nice you can test at 
your scale, with your hardware configuration, as well as your expectation over 
the tail latency.

Generally the bucket sharding should help, both for Write throughput and *stall 
with recovering/scrubbing*, but it comes with a prices -  The X shards you have 
for each bucket, the listing/trimming would be X times weighted, from OSD's 
load's point of view. There was discussion to implement: 1) blind bucket (for 
use cases bucket listing is not needed). 2) Un-ordered listing, which could 
improve the problem I mentioned above. They are on the roadmap...

Thanks,
Guang



 From: bhi...@gmail.com
 Date: Mon, 2 Mar 2015 18:13:25 -0800
 To: erdem.agao...@gmail.com
 CC: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Some long running ops may lock osd

 We're seeing a lot of this as well. (as i mentioned to sage at
 SCALE..) Is there a rule of thumb at all for how big is safe to let a
 RGW bucket get?

 Also, is this theoretically resolved by the new bucket-sharding
 feature in the latest dev release?

 -Ben

 On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com 
 wrote:
 Hi Gregory,

 We are not using listomapkeys that way or in any way to be precise. I used
 it here just to reproduce the behavior/issue.

 What i am really interested in is if scrubbing-deep actually mitigates the
 problem and/or is there something that can be further improved.

 Or i guess we should go upgrade now and hope for the best :)

 On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote:

 On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com
 wrote:
 Hi all, especially devs,

 We have recently pinpointed one of the causes of slow requests in our
 cluster. It seems deep-scrubs on pg's that contain the index file for a
 large radosgw bucket lock the osds. Incresing op threads and/or disk
 threads
 helps a little bit, but we need to increase them beyond reason in order
 to
 completely get rid of the problem. A somewhat similar (and more severe)
 version of the issue occurs when we call listomapkeys for the index
 file,
 and since the logs for deep-scrubbing was much harder read, this
 inspection
 was based on listomapkeys.

 In this example osd.121 is the primary of pg 10.c91 which contains file
 .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
 ~500k objects. Standard listomapkeys call take about 3 seconds.

 time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null
 real 0m2.983s
 user 0m0.760s
 sys 0m0.148s

 In order to lock the osd we request 2 of them simultaneously with
 something
 like:

 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 
 sleep 1
 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 

 'debug_osd=30' logs show the flow like:

 At t0 some thread enqueue_op's my omap-get-keys request.
 Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
 keys.
 Op-Thread B responds to several other requests during that 1 second
 sleep.
 They're generally extremely fast subops on other pgs.
 At t1 (about a second later) my second omap-get-keys request gets
 enqueue_op'ed. But it does not start probably because of the lock held
 by
 Thread A.
 After that point other threads enqueue_op other requests on other pgs
 too
 but none of them starts processing, in which i consider the osd is
 locked.
 At t2 (about another second later) my first omap-get-keys request is
 finished.
 Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
 starts
 reading ~500k keys again.
 Op-Thread A continues to process the requests enqueued in t1-t2.

 It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
 can
 process other requests for other pg's just fine.

 My guess is a somewhat larger scenario happens in deep-scrubbing, like
 on
 the pg containing index for the bucket of20M objects. A disk/op thread
 starts reading through the omap which will take say 60 seconds. During
 the
 first seconds, other requests for other pgs pass just fine. But in 60
 seconds there are bound to be other requests for the same pg, especially
 since it holds the index file. Each of these requests lock another
 disk/op
 thread to the point where there are no free threads left to process any
 requests for any pg. Causing slow-requests.

 So first of all thanks if you can make it here, and sorry for the
 involved
 mail, i'm exploring the problem as i go.
 Now, is that deep-scrubbing situation i tried to theorize even possible?
 If
 not can you point us where to look further.
 We are currently running 0.72.2 and know about newer ioprio settings in
 Firefly and such. While we are planning to upgrade in a few weeks but i
 don't think those options will help us in any way. Am i correct?
 Are there any other improvements that we are not aware?

 

Re: [ceph-users] who is using radosgw with civetweb?

2015-02-26 Thread GuangYang
Hi Sage,
Is there any timeline around the switch? So that we can plan ahead for the 
testing.

We are running apache + mod-fastcgi in production at scale (540 OSDs, 9 RGW 
hosts) and it looks good so far. Although at the beginning we came across a 
problem with large volume of 500 error, which tracked to that mod-fastcgi is 
using select which limits to 1024 FDs. We used poll to replace select and 
the problem was solved.

Thanks,
Guang



 Date: Wed, 25 Feb 2015 11:31:54 -0800
 From: sw...@redhat.com
 To: ceph-us...@ceph.com; ceph-de...@vger.kernel.org
 Subject: [ceph-users] who is using radosgw with civetweb?

 Hey,

 We are considering switching to civetweb (the embedded/standalone rgw web
 server) as the primary supported RGW frontend instead of the current
 apache + mod-fastcgi or mod-proxy-fcgi approach. Supported here means
 both the primary platform the upstream development focuses on and what the
 downstream Red Hat product will officially support.

 How many people are using RGW standalone using the embedded civetweb
 server instead of apache? In production? At what scale? What
 version(s) (civetweb first appeared in firefly and we've backported most
 fixes).

 Have you seen any problems? Any other feedback? The hope is to (vastly)
 simplify deployment.

 Thanks!
 sage
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistency

2014-11-09 Thread GuangYang
Thanks Sage!


 Date: Fri, 7 Nov 2014 02:19:06 -0800
 From: s...@newdream.net
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: PG inconsistency

 On Thu, 6 Nov 2014, GuangYang wrote:
 Hello Cephers,
 Recently we observed a couple of inconsistencies in our Ceph cluster,
 there were two major patterns leading to inconsistency as I observed: 1)
 EIO to read the file, 2) the digest is inconsistent (for EC) even there
 is no read error).

 While ceph has built-in tool sets to repair the inconsistencies, I also
 would like to check with the community in terms of what is the best ways
 to handle such issues (e.g. should we run fsck / xfs_repair when such
 issue happens).

 In more details, I have the following questions:
 1. When there is inconsistency detected, what is the chance there is
 some hardware issues which need to be repaired physically, or should I
 run some disk/filesystem tools to further check?

 I'm not really an operator so I'm not as familiar with these tools as I
 should be :(, but I suspect the prodent route is to check the SMART info
 on the disk, and/or trigger a scrub of everything else on the OSD (ceph
 osd scrub N). For DreamObjects, I think they usually just fail the OSD
 once it starts throwing bad sectors (most of the hardware is already
 reasonably aged).
Google's data also shows the strong correlation between scrub error (especially 
several SMART parameters) and disk failure - 
https://www.usenix.org/legacy/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf.

 2. Should we use fsck / xfs_repair to fix the inconsistencies, or should
 we solely relay on Ceph's repair tool sets?

 That might not be a bad idea, but I would urge caution if xfs_repair finds
 any issues or makes any changes, as subtle changes to the fs contents can
 confuse ceph-osd. At an absolute minimum, do a full scrub after, but
 even better would be to fail the OSD.

 (FWIW I think we should document a recommended safe process for
 failing/replacing an OSD that takes the suspect data offline but waits for
 the cluster to heal before destroying any data. Simply marking the OSD
 out will work, but then when a fresh drive is added there will be a second
 repair/rebalance event, which isn't ideal.)
Yeah that would be very helpful, I think the first decision to make is to 
whether should we replace the disk, in our clusters, there is data corruption 
(EIO) along with SMART warnings, which is an indicator of bad disk, meanwhile, 
we also observed xattr is lost (http://tracker.ceph.com/issues/10018) without 
any SMART warnings, after talking to Sam, we suspected it might be due to 
unexpected host rebooting (or mis-configured RAID controller), in which case we 
properly no need to replace the disk but only repair by ceph.

In terms of disk replacement, to avoid migrating data back and forth, are the 
below two approaches reasonable?
 1. Keep the OSD in and do an ad-hoc disk replacement and provision a new OSD 
(so that keep the OSD id as the same), and then trigger data migration. In this 
way the data migration only happens once, however, it does require operators to 
replace the disk very fast.
 2. Move the data on the broken disk to a new disk completely and use Ceph to 
repair bad objects.

Thanks,
Guang


 sage


 It would be great to hear you experience and suggestions.

 BTW, we are using XFS in the cluster.

 Thanks,
 Guang 
 Nyb?v?{.n??z??ayj???f:+v??zZ+???!?
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
Hello Cephers,
Recently we observed a couple of inconsistencies in our Ceph cluster, there 
were two major patterns leading to inconsistency as I observed: 1) EIO to read 
the file, 2) the digest is inconsistent (for EC) even there is no read error).

While ceph has built-in tool sets to repair the inconsistencies, I also would 
like to check with the community in terms of what is the best ways to handle 
such issues (e.g. should we run fsck / xfs_repair when such issue happens).

In more details, I have the following questions:
1. When there is inconsistency detected, what is the chance there is some 
hardware issues which need to be repaired physically, or should I run some 
disk/filesystem tools to further check?
2. Should we use fsck / xfs_repair to fix the inconsistencies, or should we 
solely relay on Ceph's repair tool sets?

It would be great to hear you experience and suggestions.

BTW, we are using XFS in the cluster.

Thanks,
Guang 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
Thanks Dan. By killed/formatted/replaced the OSD, did you replace the disk? 
Not an filesystem expert here, but would like to understand the underlying what 
happened behind the EIO and does that reveal something (e.g. hardware issue).

In our case, we are using 6TB drive so that there are lot of data to migrate 
and as backfilling/recovering bring latency increasing, we hope to avoid that 
as much as we can..

Thanks,
Guang


 From: daniel.vanders...@cern.ch 
 Date: Thu, 6 Nov 2014 13:36:46 + 
 Subject: Re: PG inconsistency 
 To: yguan...@outlook.com; ceph-users@lists.ceph.com 
 
 Hi, 
 I've only ever seen (1), EIO to read a file. In this case I've always 
 just killed / formatted / replaced that OSD completely -- that moves 
 the PG to a new master and the new replication fixes the 
 inconsistency. This way, I've never had to pg repair. I don't know if 
 this is a best or even good practise, but it works for us. 
 Cheers, Dan 
 
 On Thu Nov 06 2014 at 2:24:32 PM GuangYang 
 yguan...@outlook.commailto:yguan...@outlook.com wrote: 
 Hello Cephers, 
 Recently we observed a couple of inconsistencies in our Ceph cluster, 
 there were two major patterns leading to inconsistency as I observed: 
 1) EIO to read the file, 2) the digest is inconsistent (for EC) even 
 there is no read error). 
 
 While ceph has built-in tool sets to repair the inconsistencies, I also 
 would like to check with the community in terms of what is the best 
 ways to handle such issues (e.g. should we run fsck / xfs_repair when 
 such issue happens). 
 
 In more details, I have the following questions: 
 1. When there is inconsistency detected, what is the chance there is 
 some hardware issues which need to be repaired physically, or should I 
 run some disk/filesystem tools to further check? 
 2. Should we use fsck / xfs_repair to fix the inconsistencies, or 
 should we solely relay on Ceph's repair tool sets? 
 
 It would be great to hear you experience and suggestions. 
 
 BTW, we are using XFS in the cluster. 
 
 Thanks, 
 Guang 
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
We are using v0.80.4. Just would like to ask for general suggestion here :)

Thanks,
Guang


 From: malm...@gmail.com 
 Date: Thu, 6 Nov 2014 13:46:12 + 
 Subject: Re: [ceph-users] PG inconsistency 
 To: yguan...@outlook.com; ceph-de...@vger.kernel.org; 
 ceph-users@lists.ceph.com 
 
 What is your version of the ceph? 
 0.80.0 - 0.80.3 
 https://github.com/ceph/ceph/commit/7557a8139425d1705b481d7f010683169fd5e49b 
 
 Thu Nov 06 2014 at 16:24:21, GuangYang 
 yguan...@outlook.commailto:yguan...@outlook.com: 
 Hello Cephers, 
 Recently we observed a couple of inconsistencies in our Ceph cluster, 
 there were two major patterns leading to inconsistency as I observed: 
 1) EIO to read the file, 2) the digest is inconsistent (for EC) even 
 there is no read error). 
 
 While ceph has built-in tool sets to repair the inconsistencies, I also 
 would like to check with the community in terms of what is the best 
 ways to handle such issues (e.g. should we run fsck / xfs_repair when 
 such issue happens). 
 
 In more details, I have the following questions: 
 1. When there is inconsistency detected, what is the chance there is 
 some hardware issues which need to be repaired physically, or should I 
 run some disk/filesystem tools to further check? 
 2. Should we use fsck / xfs_repair to fix the inconsistencies, or 
 should we solely relay on Ceph's repair tool sets? 
 
 It would be great to hear you experience and suggestions. 
 
 BTW, we are using XFS in the cluster. 
 
 Thanks, 
 Guang 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore throttling

2014-10-28 Thread GuangYang



 Date: Thu, 23 Oct 2014 21:26:07 -0700
 From: s...@newdream.net
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: RE: Filestore throttling

 On Fri, 24 Oct 2014, GuangYang wrote:
 commit 44dca5c8c5058acf9bc391303dc77893793ce0be
 Author: Sage Weil s...@inktank.com
 Date: Sat Jan 19 17:33:25 2013 -0800

 filestore: disable extra committing queue allowance

 The motivation here is if there is a problem draining the op queue
 during a sync. For XFS and ext4, this isn't generally a problem: you
 can continue to make writes while a syncfs(2) is in progress. There
 are currently some possible implementation issues with btrfs, but we
 have not demonstrated them recently.

 Meanwhile, this can cause queue length spikes that screw up latency.
 During a commit, we allow too much into the queue (say, recovery
 operations). After the sync finishes, we have to drain it out before
 we can queue new work (say, a higher priority client request). Having
 a deep queue below the point where priorities order work limits the
 value of the priority queue.

 Signed-off-by: Sage Weil s...@inktank.com

 I'm not sure it makes sense to increase it in the general case. It might
 make sense for your workload, or we may want to make peering transactions
 some sort of special case...?
 It is actually another commit:

 commit 40654d6d53436c210b2f80911217b044f4d7643a
 filestore: filestore_queue_max_ops 500 - 50
 Having a deep queue limits the effectiveness of the priority queues
 above by adding additional latency.

 Ah, you're right.

 I don't quite understand the use case that it might add additional
 latency by increasing this value, would you mind elaborating?

 There is a priority queue a bit further up the stack OpWQ, in which high
 priority items (e.g., client IO) can move ahead of low priority items
 (e.g., recovery). If the queue beneath that (the filestore one) is very
 deep, the client IO will only have a marginal advantage over the recovery
 IO since it will still sit in the second queue for a long time. Ideally,
 we want the priority queue to be the deepest one (so that we maximize the
 amount of stuff we can reorder) and the queues above and below to be as
 shallow as possible.

That makes perfect sense, thanks for explaining the details.


 I think the peering operations are different because they can't be
 reordered with respect to anything else in the same PG (unlike, say,
 client vs recovery io for that pg). On the other hand, there may be
 client IO on other PGs that we want to reorder and finish more quickly.
 Allowing all of the right reordering and also getting the priority
 inheritence right here is probably a hugely complex undertaking, so we
 probably just want to go for a reasonably simple strategy that avoids the
 worst instances of priority inversion (where an important thing is stuck
 behind a slow thing). :/

We mainly observed the following issues during peering:
  1. For several peering OPs (pg_info, pg_notify, pg_log), the dispatcher 
thread needs to queue filestore transaction which in turn need to acquire 
filesotre ops/bytes budget, so that once the OSD hit the upper limit of those 
throttler, the dispatcher thread would hang which blocks all OPs. In this 
regard, it is very dangerous to hit those threshold as it could severely impact 
performance. 
   2. If the OSD was down for a while, the peering OP to search for missing 
objects could take up to several minutes, during which time period this PG is 
inactive can all traffic to that PG would stuck, I am not sure if there is a 
chance to improve for such situation due to the strong consistency model, 
increasing the op thread number could help a little bit as sometimes those 
peering OPs could eat up all op threads.

 In any case, though, I'm skeptical that making the lowest-level queue
 deeper is going to help in general, even if it addresses the peering
 case specifically...

 sage
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore throttling

2014-10-23 Thread GuangYang
---
 Date: Thu, 23 Oct 2014 06:58:58 -0700
 From: s...@newdream.net
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: RE: Filestore throttling

 On Thu, 23 Oct 2014, GuangYang wrote:
 Thanks Sage for the quick response!

 We are using firefly (v0.80.4 with a couple of back-ports). One
 observation we have is that during peering stage (especially if the OSD
 got down/in for several hours with high load), the peering OPs are in
 contention with normal OPs and thus bring extremely long latency (up to
 minutes) for client OPs, the contention happened in filestore for
 throttling budget, it also happened at dispatcher/op threads, I will
 send another email with more details after more investigation.

 It sounds like the problem here is that when the pg logs are long (1000's
 of entries) the MOSDPGLog messages are bit and generate a big
 ObjectStore::Transaction. This can be mitigated by shortening the logs,
 but that means shortening the duration that an OSD can be down without
 triggering a backfill. Part of the answer is probably to break the PGLog
 messages into smaller pieces.
Making the transaction small should help, let me test that and get back with 
more information.

 As for this one, I created a pull request #2779 to change the default
 value of filesotre_queue_max_ops to 500 (which is specified in the
 document but code is inconsistent), do you think we should make others
 as default as well?

 We reduced it to 50 almost 2 years ago, in this commit:

 commit 44dca5c8c5058acf9bc391303dc77893793ce0be
 Author: Sage Weil s...@inktank.com
 Date: Sat Jan 19 17:33:25 2013 -0800

 filestore: disable extra committing queue allowance

 The motivation here is if there is a problem draining the op queue
 during a sync. For XFS and ext4, this isn't generally a problem: you
 can continue to make writes while a syncfs(2) is in progress. There
 are currently some possible implementation issues with btrfs, but we
 have not demonstrated them recently.

 Meanwhile, this can cause queue length spikes that screw up latency.
 During a commit, we allow too much into the queue (say, recovery
 operations). After the sync finishes, we have to drain it out before
 we can queue new work (say, a higher priority client request). Having
 a deep queue below the point where priorities order work limits the
 value of the priority queue.

 Signed-off-by: Sage Weil s...@inktank.com

 I'm not sure it makes sense to increase it in the general case. It might
 make sense for your workload, or we may want to make peering transactions
 some sort of special case...?
It is actually another commit:

commit 40654d6d53436c210b2f80911217b044f4d7643a
filestore: filestore_queue_max_ops 500 - 50
Having a deep queue limits the effectiveness of the priority queues
above by adding additional latency.
I don't quite understand the use case that it might add additional latency by 
increasing this value, would you mind elaborating?


 sage



 Thanks,
 Guang

 
 Date: Wed, 22 Oct 2014 21:06:21 -0700
 From: s...@newdream.net
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: Filestore throttling

 On Thu, 23 Oct 2014, GuangYang wrote:
 Hello Cephers,
 During our testing, I found that the filestore throttling became a 
 limiting factor for performance, the four settings (with default value) 
 are:
 filestore queue max ops = 50
 filestore queue max bytes = 100  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 100  20

 My understanding is, if we lift the threshold, the response for op (end to 
 end) could be improved a lot during high load, and that is one reason to 
 have journal. The downside is that if there is a read following a 
 successful write, the read might stuck longer as the object is not flushed.

 Is my understanding correct here?

 If that is the tradeoff and read after write is not a concern in our use 
 case, can I lift the parameters to below values?
 filestore queue max ops = 500
 filestore queue max bytes = 200  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 200  20

 It turns out very helpful during PG peering stage (e.g. OSD down and up).

 That looks reasonable to me.

 For peering, I think there isn't really any reason to block sooner rather
 than later. I wonder if we should try to mark those transactions such
 that they don't run up against the usual limits...

 Is this firefly or something later? Sometime after firefly Sam made some
 changes so that the OSD is more careful about waiting for PG metadata to
 be persisted before sharing state. I wonder if you will still see the
 same improvement now...

 sage
 Nyb?v?{.n??z??ayj???f:+v??zZ+???!?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body

[ceph-users] Filestore throttling

2014-10-22 Thread GuangYang
Hello Cephers,
During our testing, I found that the filestore throttling became a limiting 
factor for performance, the four settings (with default value) are:
 filestore queue max ops = 50
 filestore queue max bytes = 100  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 100  20

My understanding is, if we lift the threshold, the response for op (end to end) 
could be improved a lot during high load, and that is one reason to have 
journal. The downside is that if there is a read following a successful write, 
the read might stuck longer as the object is not flushed.

Is my understanding correct here?

If that is the tradeoff and read after write is not a concern in our use case, 
can I lift the parameters to below values?
 filestore queue max ops = 500
 filestore queue max bytes = 200  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 200  20

It turns out very helpful during PG peering stage (e.g. OSD down and up).

Thanks,
Guang


  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore throttling

2014-10-22 Thread GuangYang
Thanks Sage for the quick response!

We are using firefly (v0.80.4 with a couple of back-ports). One observation we 
have is that during peering stage (especially if the OSD got down/in for 
several hours with high load), the peering OPs are in contention with normal 
OPs and thus bring extremely long latency (up to minutes) for client OPs, the 
contention happened in filestore for throttling budget, it also happened at 
dispatcher/op threads, I will send another email with more details after more 
investigation.

As for this one, I created a pull request #2779 to change the default value of 
filesotre_queue_max_ops to 500 (which is specified in the document but code is 
inconsistent), do you think we should make others as default as well?

Thanks,
Guang


 Date: Wed, 22 Oct 2014 21:06:21 -0700
 From: s...@newdream.net
 To: yguan...@outlook.com
 CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
 Subject: Re: Filestore throttling

 On Thu, 23 Oct 2014, GuangYang wrote:
 Hello Cephers,
 During our testing, I found that the filestore throttling became a limiting 
 factor for performance, the four settings (with default value) are:
 filestore queue max ops = 50
 filestore queue max bytes = 100  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 100  20

 My understanding is, if we lift the threshold, the response for op (end to 
 end) could be improved a lot during high load, and that is one reason to 
 have journal. The downside is that if there is a read following a successful 
 write, the read might stuck longer as the object is not flushed.

 Is my understanding correct here?

 If that is the tradeoff and read after write is not a concern in our use 
 case, can I lift the parameters to below values?
 filestore queue max ops = 500
 filestore queue max bytes = 200  20
 filestore queue committing max ops = 500
 filestore queue committing max bytes = 200  20

 It turns out very helpful during PG peering stage (e.g. OSD down and up).

 That looks reasonable to me.

 For peering, I think there isn't really any reason to block sooner rather
 than later. I wonder if we should try to mark those transactions such
 that they don't run up against the usual limits...

 Is this firefly or something later? Sometime after firefly Sam made some
 changes so that the OSD is more careful about waiting for PG metadata to
 be persisted before sharing state. I wonder if you will still see the
 same improvement now...

 sage
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph GET latency

2014-02-21 Thread GuangYang
Thanks Greg for the response, my comments inline…

Thanks,
Guang
On Feb 20, 2014, at 11:16 PM, Gregory Farnum g...@inktank.com wrote:

 On Tue, Feb 18, 2014 at 7:24 AM, Guang Yang yguan...@yahoo.com wrote:
 Hi ceph-users,
 We are using Ceph (radosgw) to store user generated images, as GET latency
 is critical for us, most recently I did some investigation over the GET path
 to understand where time spend.
 
 I first confirmed that the latency came from OSD (read op), so that we
 instrumented code to trace the GET request (read op at OSD side, to be more
 specific,
 
 How'd you instrument it? Are you aware of the OpTracker system that
 records and can output important events?
I just add some time measurement code before and after some calls, here is a 
the diff (I ported the lock contention fix to dumpling on my test, please 
ignore that) - https://github.com/ceph/ceph/pull/1281/files
Yeah we are using OpTracker as well (dump history op), however, from the dumped 
history op, it is a little bit hard to tell where does the time spend (it is 
between started and finish). As read latency is critical for us, we would like 
to check next level details so that we add some checking for lock / file system 
call.

I am also thinking to refine the code more log to reflect where time spent at 
filestore side, are you interested in such type of code change? (it took me 
some time to trouble shoot the get latency and finally I need to add some more 
log to get a picture).

 each object with size [512K + 4M * x]  are splitted into [1 + x]
 chunks, each chunk needs one read op ), for each read op, it needs to go
 through the following steps:
1. Dispatch and take by a op thread to process (process not started).
 0   - 20 ms,94%
 20 - 50 ms,2%
 50 - 100 ms,  2%
  100ms+   , 2%
 For those having 20ms+ latency, half of them are due to waiting for
 pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
 another half are yet to be investigated.
 
 The PG lock conflict means that there's something else happening in
 the PG at the same time; that's a logical contention issue. However,
 20ms is a long time, so if you can figure out what else the PG is
 doing during that time it'd be interesting
I haven’t deep dive that yet (that is a relatively small percentage comparing 
to the second part (which comes from filestore i/o slowness due to Dentry / 
inode cache), my interoperation is that as we serialize PG operation and there 
are some ops taking (like caused by the below), so that there is a chance the 
op are trying to operate on the same pg might wait for some time.
BTW, what is the long term strategy for Ceph to support small files (which 
motivated Facebook to build haystack to solve)?

 
2. Get file xattr ('-'), which open the file and populate fd cache
 (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
  0   - 20 ms,  80%
  20 - 50 ms,   8%
  50 - 100 ms, 7%
  100ms+   ,  5%
 The latency either comes from (from more to less): file path lookup
 (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
 open, or fd cache lookup /add.
 Currently objects are store in level 6 or level 7 folder (due to
 http://tracker.ceph.com/issues/7207, I stopped folder splitting).
 
 FYI there's been some community and Inktank work to try and speed this
 up recently. None of it's been merged into master yet, but we'll
 definitely have some improvements to this post-Firefly.
Yeah I am aware of that and we are extremely interested in the effort as well. 
Basically our use case is:
  1. Get latency is more critical than PUT latency.
  2. Our work load is mainly small files (95% are less than 512KB). 
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
3. Get more xattrs, this is fast due to previous fd cache (rarely 
 1ms).
 
4. Read the data.
0   - 20 ms,   84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+, 2%
 
 I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
 favor dentry/inode cache over page cache, unfortunately it does not help.
 
 Long story short, most of the long latency read op comes from file system
 call (for cold data), as our workload mainly stores objects less than 500KB,
 so that it generates a large bunch of objects.
 
 I would like to ask if people experienced similar issue and if there is any
 suggestion I can try to boost the GET performance. On the other hand, PUT
 could be sacrificed.
 
 Thanks,
 Guang
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph GET latency

2014-02-19 Thread GuangYang
Thanks Yehuda.

 
 Try looking at the perfcounters, see if there's any other throttling
 happening. Also, make sure you have enough pgs for your data pool. One
 other thing to try is disabling leveldb xattrs and see if it affects
 your latency.
1. There is not throttling happening.
2. According to ceph recommendation, there is 100PGs per OSD and we bump that 
to 200.
3. Can you elaborate on the xattrs stuff, how could that affect the get latency?

Thanks,
Guang

On Feb 20, 2014, at 3:09 AM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Tue, Feb 18, 2014 at 7:24 AM, Guang Yang yguan...@yahoo.com wrote:
 Hi ceph-users,
 We are using Ceph (radosgw) to store user generated images, as GET latency
 is critical for us, most recently I did some investigation over the GET path
 to understand where time spend.
 
 I first confirmed that the latency came from OSD (read op), so that we
 instrumented code to trace the GET request (read op at OSD side, to be more
 specific, each object with size [512K + 4M * x]  are splitted into [1 + x]
 chunks, each chunk needs one read op ), for each read op, it needs to go
 through the following steps:
1. Dispatch and take by a op thread to process (process not started).
 0   - 20 ms,94%
 20 - 50 ms,2%
 50 - 100 ms,  2%
  100ms+   , 2%
 For those having 20ms+ latency, half of them are due to waiting for
 pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
 another half are yet to be investigated.
 
2. Get file xattr ('-'), which open the file and populate fd cache
 (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
  0   - 20 ms,  80%
  20 - 50 ms,   8%
  50 - 100 ms, 7%
  100ms+   ,  5%
 The latency either comes from (from more to less): file path lookup
 (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
 open, or fd cache lookup /add.
 Currently objects are store in level 6 or level 7 folder (due to
 http://tracker.ceph.com/issues/7207, I stopped folder splitting).
 
3. Get more xattrs, this is fast due to previous fd cache (rarely 
 1ms).
 
4. Read the data.
0   - 20 ms,   84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+, 2%
 
 I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
 favor dentry/inode cache over page cache, unfortunately it does not help.
 
 Long story short, most of the long latency read op comes from file system
 call (for cold data), as our workload mainly stores objects less than 500KB,
 so that it generates a large bunch of objects.
 
 I would like to ask if people experienced similar issue and if there is any
 suggestion I can try to boost the GET performance. On the other hand, PUT
 could be sacrificed.
 
 
 Try looking at the perfcounters, see if there's any other throttling
 happening. Also, make sure you have enough pgs for your data pool. One
 other thing to try is disabling leveldb xattrs and see if it affects
 your latency.
 
 Yehuda

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster is unreachable because of authentication failure

2014-01-14 Thread GuangYang
Hi ceph-users and ceph-devel,
I came across an issue after restarting monitors of the cluster, that 
authentication fails which prevents running any ceph command.

After we did some maintenance work, I restart OSD, however, I found that the 
OSD would not join the cluster automatically after being restarted, though TCP 
dump showed it had already sent messenger to monitor telling add me into the 
cluster.

So that I suspected there might be some issues of monitor and I restarted 
monitor one by one (3 in total), however, after restarting monitors, all ceph 
command would fail saying authentication timeout…

2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate 
timed out after 300
2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin 
authentication error (110) Connection timed out
Error connecting to cluster: Error

Any idea why such error happened (restarting OSD would result in the same 
error)?

I am thinking the authentication information is persisted in mon local disk and 
is there a chance those data got corrupted?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com