Re: [ceph-users] Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <s...@newdream.net> wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Guang Yang
Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and 
> persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are 
> peered?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD disk replacement best practise

2014-08-14 Thread Guang Yang
Hi cephers,
Most recently I am drafting the run books for OSD disk replacement, I think the 
rule of thumb is to reduce data migration (recover/backfill), and I thought the 
following procedure should achieve the purpose:
  1. ceph osd out osd.XXX (mark it out to trigger data migration)
  2. ceph osd rm osd.XXX
  3. ceph auth rm osd.XXX
  4. provision a new OSD which will take XXX as the OSD id and migrate data 
back.

With the above procedure, the crush weight of the host never changed so that we 
can limit the data migration only for those which are neccesary.

Does it make sense?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] row geo-replication to another data store?

2014-07-17 Thread Guang Yang
Hi cephers,
We are investigating a backup solution for Ceph, in short, we would like a 
solution to backup a Ceph cluster to another data store (not Ceph cluster, 
assume it has SWIFT API). We would like to have both full backup and 
incremental backup on top of the full backup.

After going through the geo-replication blueprint [1], I am thinking that we 
can leverage the effort and instead of replicate the data into another ceph 
cluster, we make it replicate to another data store. At the same time, I have a 
couple of questions which need your help:

1) How does the ragosgw-agent scale to multiple hosts? Our first investigation 
shows it only works on a single host but I would like to confirm.
2) Can we configure the interval  to do incremental backup like 1 hour / 1 day 
/ 1 month?

[1] 
https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ask a performance question for the RGW

2014-06-30 Thread Guang Yang
Hello,
There is a known limitation of bucket scalability, and there is a blueprint 
tracking it - 
https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability.

At time being, I would recommend to do sharding at application level (create 
multiple buckets) to workaround this limitation.

Thanks,
Guang

On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote:

  
 hello, everyone!
  
 when I user rest bench test RGW performance and the cmd is:
 ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 
 -t 200 -b 524288 -no-cleanup write
  
 test result:
 Total time run: 362.962324 T
 otal writes made: 48189
 Write size: 524288
 Bandwidth (MB/sec): 66.383
 Stddev Bandwidth: 40.7776
 Max bandwidth (MB/sec): 173
 Min bandwidth (MB/sec): 0
 Average Latency: 1.50435
 Stddev Latency: 0.910731
 Max latency: 9.12276
 Min latency: 0.19867
  
 my environment is 4 host and 40 disk(osd)。 but test result is very bad, 
 average latency is 1.5 seconds 。and I find write obj metadate is very 
 slowly。because it puts so many object to one bucket, we know writing object 
 metadate can call method “bucket_prepare_op”,and test find this op is very 
 slowly。 I find the osd which contain bucket-obj。and see the 
 “bucket_prepare_op”by dump_historic_ops :
 { description: osd_op(client.4742.0:87613 .dir.default.4243.3 [call 
 rgw.bucket_prepare_op] 3.3670fe74 e317),
   received_at: 2014-06-30 13:35:55.409597,
   age: 51.148026,
   duration: 4.130137,
   type_data: [
 commit sent; apply or cleanup,
 { client: client.4742,
   tid: 87613},
 [
 { time: 2014-06-30 13:35:55.409660,
   event: waiting_for_osdmap},
 { time: 2014-06-30 13:35:55.409669,
   event: queue op_wq},
 { time: 2014-06-30 13:35:55.896766,
   event: reached_pg},
 { time: 2014-06-30 13:35:55.896793,
   event: started},
 { time: 2014-06-30 13:35:55.896796,
   event: started},
 { time: 2014-06-30 13:35:55.899450,
   event: waiting for subops from [40,43]},
 { time: 2014-06-30 13:35:55.899757,
   event: commit_queued_for_journal_write},
 { time: 2014-06-30 13:35:55.899799,
   event: write_thread_in_journal_buffer},
 { time: 2014-06-30 13:35:55.899910,
   event: journaled_completion_queued},
 { time: 2014-06-30 13:35:55.899936,
   event: journal first callback},
 { time: 2014-06-30 13:35:55.899944,
   event: queuing ondisk},
 { time: 2014-06-30 13:35:56.142104,
   event: sub_op_commit_rec},
 { time: 2014-06-30 13:35:56.176950,
   event: sub_op_commit_rec},
 { time: 2014-06-30 13:35:59.535301,
   event: op_commit},
 { time: 2014-06-30 13:35:59.535331,
   event: commit_sent},
 { time: 2014-06-30 13:35:59.539723,
   event: op_applied},
 { time: 2014-06-30 13:35:59.539734,
   event: done}]]},
  
 so why from journaled_completion_queued to op_commit is very slowly, and 
 what happened?
 thanks
  
 baijia...@126.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ask a performance question for the RGW

2014-06-30 Thread Guang Yang
On Jun 30, 2014, at 3:59 PM, baijia...@126.com wrote:

 Hello,
 thanks for you answer the question.
 But when there are less than 50 thousand objects, and latency is very big . I 
 see the write ops for the bucket index object., from 
 journaled_completion_queue to op_commit  cost 3.6 seconds,this mean that 
 from “writing journal finish” to  op_commit cost 3.6 seconds。
 so I can't understand this and what happened?
The operations updating the same bucket index object get serialized, one 
possibility is that those operation was hang there waiting other ops finishing 
their work.
  
 thanks
 baijia...@126.com
  
 发件人: Guang Yang
 发送时间: 2014-06-30 14:57
 收件人: baijiaruo
 抄送: ceph-users
 主题: Re: [ceph-users] Ask a performance question for the RGW
 Hello,
 There is a known limitation of bucket scalability, and there is a blueprint 
 tracking it - 
 https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability.
  
 At time being, I would recommend to do sharding at application level (create 
 multiple buckets) to workaround this limitation.
  
 Thanks,
 Guang
  
 On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote:
  
  
  hello, everyone!
  
  when I user rest bench test RGW performance and the cmd is:
  ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 
  -t 200 -b 524288 -no-cleanup write
  
  test result:
  Total time run: 362.962324 T
  otal writes made: 48189
  Write size: 524288
  Bandwidth (MB/sec): 66.383
  Stddev Bandwidth: 40.7776
  Max bandwidth (MB/sec): 173
  Min bandwidth (MB/sec): 0
  Average Latency: 1.50435
  Stddev Latency: 0.910731
  Max latency: 9.12276
  Min latency: 0.19867
  
  my environment is 4 host and 40 disk(osd)。 but test result is very bad, 
  average latency is 1.5 seconds 。and I find write obj metadate is very 
  slowly。because it puts so many object to one bucket, we know writing object 
  metadate can call method “bucket_prepare_op”,and test find this op is very 
  slowly。 I find the osd which contain bucket-obj。and see the 
  “bucket_prepare_op”by dump_historic_ops :
  { description: osd_op(client.4742.0:87613 .dir.default.4243.3 [call 
  rgw.bucket_prepare_op] 3.3670fe74 e317),
received_at: 2014-06-30 13:35:55.409597,
age: 51.148026,
duration: 4.130137,
type_data: [
  commit sent; apply or cleanup,
  { client: client.4742,
tid: 87613},
  [
  { time: 2014-06-30 13:35:55.409660,
event: waiting_for_osdmap},
  { time: 2014-06-30 13:35:55.409669,
event: queue op_wq},
  { time: 2014-06-30 13:35:55.896766,
event: reached_pg},
  { time: 2014-06-30 13:35:55.896793,
event: started},
  { time: 2014-06-30 13:35:55.896796,
event: started},
  { time: 2014-06-30 13:35:55.899450,
event: waiting for subops from [40,43]},
  { time: 2014-06-30 13:35:55.899757,
event: commit_queued_for_journal_write},
  { time: 2014-06-30 13:35:55.899799,
event: write_thread_in_journal_buffer},
  { time: 2014-06-30 13:35:55.899910,
event: journaled_completion_queued},
  { time: 2014-06-30 13:35:55.899936,
event: journal first callback},
  { time: 2014-06-30 13:35:55.899944,
event: queuing ondisk},
  { time: 2014-06-30 13:35:56.142104,
event: sub_op_commit_rec},
  { time: 2014-06-30 13:35:56.176950,
event: sub_op_commit_rec},
  { time: 2014-06-30 13:35:59.535301,
event: op_commit},
  { time: 2014-06-30 13:35:59.535331,
event: commit_sent},
  { time: 2014-06-30 13:35:59.539723,
event: op_applied},
  { time: 2014-06-30 13:35:59.539734,
event: done}]]},
  
  so why from journaled_completion_queued to op_commit is very slowly, 
  and what happened?
  thanks
  
  baijia...@126.com
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS - number of files in a directory

2014-06-23 Thread Guang Yang
Hello Cephers,
We used to have a Ceph cluster and setup our data pool as 3 replicas, we 
estimated the number of files (given disk size and object size) for each PG was 
around 8K, we disabled folder splitting which mean all files located at the 
root PG folder. Our testing showed a good performance with such setup.

Right now we are evaluating erasure coding, which split the object into a 
number of chunks and increase the number of files several times, although XFS 
claims a good support for large directories [1], some testing also showed that 
we may expect performance degradation for large directories.

I would like to check with your experience on top of this for your Ceph cluster 
if you are using XFS. Thanks.

[1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-29 Thread Guang Yang
On May 28, 2014, at 5:31 AM, Gregory Farnum g...@inktank.com wrote:

 On Sun, May 25, 2014 at 6:24 PM, Guang Yang yguan...@yahoo.com wrote:
 On May 21, 2014, at 1:33 AM, Gregory Farnum g...@inktank.com wrote:
 
 This failure means the messenger subsystem is trying to create a
 thread and is getting an error code back — probably due to a process
 or system thread limit that you can turn up with ulimit.
 
 This is happening because a replicated PG primary needs a connection
 to only its replicas (generally 1 or 2 connections), but with an
 erasure-coded PG the primary requires a connection to m+n-1 replicas
 (everybody who's in the erasure-coding set, including itself). Right
 now our messenger requires a thread for each connection, so kerblam.
 (And it actually requires a couple such connections because we have
 separate heartbeat, cluster data, and client data systems.)
 Hi Greg,
 Is there any plan to refactor the messenger component to reduce the num of 
 threads? For example, use event-driven mode.
 
 We've discussed it in very broad terms, but there are no concrete
 designs and it's not on the schedule yet. If anybody has conclusive
 evidence that it's causing them trouble they can't work around, that
 would be good to know…
Thanks for the response!

We used to have a cluster with each OSD host having 11 disks (daemons), on each 
host, there are around 15K threads, the system is stable but when there is 
cluster wide change (e.g. OSD down / out, recovery), we observed system load 
increasing, there is no cascading failure though.

Most recently we are evaluating Ceph against high density hardware with each 
OSD host having 33 disks (daemons), on each host, there are around 40K-50K 
threads, with some OSD host down/out, we started seeing high load increasing 
and a large volume of thread join/creation.

We don’t have a strong evidence that the messenger thread model is the problem 
and how event-driven approach can help, but I think as moving to high density 
hardware (for cost saving purpose), the issue could be amplified.

If there is any plan, it is good to know and we are very interested to involve.

Thanks,
Guang

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly 0.80 rados bench cleanup / object removal broken?

2014-05-19 Thread Guang Yang
Hi Matt,
The problem you came across was due to a change made in the rados bench along 
with the Firefly release, it aimed to solve the problem that if there were 
multiple rados instance (for writing), we want to do a rados read for each run 
as well.

Unfortunately, that change broke your user case, here is my suggestion to solve 
your problem:
1. Remove the pre-defined metadata file by
$ rados -p {pool_name} rm benchmark_last_metadata
2. Cleanup by prefix
$ sudo rados -p {pool_name} cleanup --prefix bench

Moving forward, you can use the new parameter ‘--run-name’ to name each turn of 
run and cleanup on that basis, if you still want to do a slow liner search to 
cleanup, be sure removing the benchmark_last_metadata object before you kick 
off running the cleanup.

Let me know if that helps.

Thanks,
Guang

On May 20, 2014, at 6:45 AM, matt.lat...@hgst.com wrote:

 
 I was experimenting previously with 0.72 , and could easily cleanup pool
 objects from several previous rados bench (write) jobs with :
 
 rados -p poolname cleanup bench  (would remove all objects starting
 with bench)
 
 I quickly realised when I moved to 0.80 that my script was broken and
 theoretically I now need:
 
 rados -p poolname cleanup --prefix benchmark_data
 
 But this only works sometimes, and sometimes partially. Issuing the command
 line twice seems to help a bit !  Also if I do rados -p poolname ls
 before hand, it seems to increase my chances of success, but often I am
 still left with benchmark objects undeleted. I also tried using the
 --run-name option to no avail.
 
 The story gets more bizarre now I have set up a hot SSD cachepool in
 front of the backing OSD (SATA) pool. Objects won't delete from either pool
 with rados cleanup  I tried
 
 rados -p cachepoolname cache-flush-evict-all
 
 which worked (rados df shows all objects now on the backing pool). Then
 bizarrely trying cleanup from the backing OSD pool just appears to copy
 them back into the cachepool, and they remain on the backing pool.
 
 I can list individual object names with
 
 rados -p poolname ls
 
 but rados rm objectname will not remove individual objects stating file
 or directory not found.
 
 Are others seeing these things and any ways to work around or am I doing
 something wrong?  Are these commands now deprecated in which case what
 should I use?
 
 Ubuntu 12.04, Kernel 3.14.0
 
 Matt Latter
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Docs - trouble shooting mon

2014-04-24 Thread Guang
Hello,
Today I read the monitor trouble shooting doc 
(https://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/) with 
this section:
Scrap the monitor and create a new one
 You should only take this route if you are positive that you won’t 
lose the information kept by that monitor; that you have other monitors and 
that they are running just fine so that your new monitor is 
  able to  synchronize from the remaining monitors. Keep in mind that 
destroying a monitor, if there are no other copies of its contents, may lead to 
loss of data.

I would like to ask how to check if “there are other copies of its content” for 
a given monitor instance?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH's data durability with different configurations

2014-04-18 Thread Guang
Hi all,
One goal of the storage system is to achieve certain durability SLAs, so that 
we replicate data with multiple copies, and check consistency on regular basis 
(e.g. scrubbing), however, replication could increase cost (tradeoff between 
cost  durability), and cluster wide consistency checking could bring 
performance impact (tradeoff between performance  durability).

Most recently I am trying to figure out the best configuration for such, 
including:
  1) how many copies do I need? (pool min_size and size)
  2) how frequency should I run scrubbing and deep scrubbing?

Can someone share your experience tuning those numbers and what is the 
durability you can achieve with that?

BTW, S3 claims they have 99.9% durability of objects over a given year, 
that seems super high on commodity hardware.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] A simple tool to do osd crush reweigh after creating pool to gain better PG distribution across OSDs

2014-04-14 Thread Guang
Hi all,
In order to deal with PG uneven problem[1, 2] which further leads to uneven 
disk usage, I recently developed a simple script which aims to do *osd crush* 
reweight right after creating the pool (which has the most significant data, 
e.g. .rgw.buckets), we had good experience to tune the distribution difference 
to less than 10% with use of this tool.

Here is the tool - 
https://github.com/guangyy/ceph_misc/blob/master/osd_crush_reweight/ceph_osd_crush_reweight.pl

If you have a similar experience with relatively high PG distribution across 
OSDs by default, you can check out the script and see if that serve your 
purpose or not, all reviews and suggestions (especially the algorithm for a 
even better distribution) are welcomed.


[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg04216.html
[2] http://www.spinics.net/lists/ceph-devel/msg17509.html

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS tunning on OSD

2014-03-05 Thread Guang Yang
Hello all,
Recently I am working on Ceph performance analysis on our cluster, our OSD 
hardware looks like:
  11 SATA disks, 4TB for each, 7200RPM
   48GB RAM

When break down the latency, we found that half of the latency (average latency 
is around 60 milliseconds via radosgw) comes from file lookup and open (there 
could be a couple of disk seeks there). When looking at the file system  cache 
(slabtop), we found that around 5M dentry / inodes are cached, however, the 
host has around 110 million files (and directories) in total.

I am wondering if there is any good experience within community tunning for the 
same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] 
?

[1] 
http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG folder hierarchy

2014-02-25 Thread Guang
Hello,
Most recently when looking at PG’s folder splitting, I found that there was 
only one sub folder in the top 3 / 4 levels and start having 16 sub folders 
starting from level 6, what is the design consideration behind this?

For example, if the PG root folder is ‘3.1905_head’, in the first level, it 
only has one sub folder ‘DIR_5’ and then one sub folder ‘DIR_0’, and then 
‘DIR_9’, under which there are two sub folders ‘DIR_1’ and ‘DIR_9’, starting 
from which, the next level has 16 sub folders.

If we start splitting into 16 sub folders in the very first level, we may 
potential gain better performance with less dentry lookup (though most likely 
the root level been cached).

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG folder hierarchy

2014-02-25 Thread Guang
Got it. Thanks Greg for the response!

Thanks,
Guang

On Feb 26, 2014, at 11:51 AM, Gregory Farnum g...@inktank.com wrote:

 On Tue, Feb 25, 2014 at 7:13 PM, Guang yguan...@yahoo.com wrote:
 Hello,
 Most recently when looking at PG's folder splitting, I found that there was
 only one sub folder in the top 3 / 4 levels and start having 16 sub folders
 starting from level 6, what is the design consideration behind this?
 
 For example, if the PG root folder is '3.1905_head', in the first level, it
 only has one sub folder 'DIR_5' and then one sub folder 'DIR_0', and then
 'DIR_9', under which there are two sub folders 'DIR_1' and 'DIR_9', starting
 from which, the next level has 16 sub folders.
 
 If we start splitting into 16 sub folders in the very first level, we may
 potential gain better performance with less dentry lookup (though most
 likely the root level been cached).
 
 It's an implementation detail of the FileStore (the part of the OSD
 that stores data in the filesystem). Each of those folders represents
 an ever-smaller division of the hash space that objects live in. The
 more PGs you have, the less hash space each one covers, so there's
 that trail of folders.
 It's a bit unfortunate, because as you mention it involves more
 metadata memory caching, but fixing it would require some fairly
 detailed code in a critical path. The cost of fixing it and the risk
 of breaking things haven't been worth it yet.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph GET latency

2014-02-18 Thread Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is 
critical for us, most recently I did some investigation over the GET path to 
understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we 
instrumented code to trace the GET request (read op at OSD side, to be more 
specific, each object with size [512K + 4M * x]  are splitted into [1 + x] 
chunks, each chunk needs one read op ), for each read op, it needs to go 
through the following steps:
    1. Dispatch and take by a op thread to process (process not started).
             0   – 20 ms,    94%
             20 – 50 ms,    2%
             50 – 100 ms,  2%
              100ms+   ,         2%
         For those having 20ms+ latency, half of them are due to waiting for pg 
lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another 
half are yet to be investigated.

    2. Get file xattr (‘-‘), which open the file and populate fd cache 
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
              0   – 20 ms,  80%
              20 – 50 ms,   8%
              50 – 100 ms, 7%
              100ms+   ,      5%
         The latency either comes from (from more to less): file path lookup 
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file 
open, or fd cache lookup /add.
         Currently objects are store in level 6 or level 7 folder (due to 
http://tracker.ceph.com/issues/7207, I stopped folder splitting).

    3. Get more xattrs, this is fast due to previous fd cache (rarely  1ms).

    4. Read the data.
            0   – 20 ms,   84%
            20 – 50 ms, 10%
            50 – 100 ms, 4%
            100ms+        , 2%

I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS 
favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call 
(for cold data), as our workload mainly stores objects less than 500KB, so that 
it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any 
suggestion I can try to boost the GET performance. On the other hand, PUT could 
be sacrificed.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2014-02-10 Thread Guang Yang
Thanks all for the help.

We finally identified the root cause of the issue was due to a lock contention 
happening at folder splitting and here is a tracking ticket (thanks Inktank for 
the fix!): http://tracker.ceph.com/issues/7207

Thanks,
Guang


On Tuesday, December 31, 2013 8:22 AM, Guang Yang yguan...@yahoo.com wrote:
 
Thanks Wido, my comments inline...

Date: Mon, 30 Dec 2013 14:04:35 +0100
From: Wido den Hollander w...@42on.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 12:45 PM, Guang wrote:
 Hi ceph-users and ceph-devel,
 Merry Christmas and Happy New Year!

 We have a ceph cluster with radosgw, our customer is using S3 API to
 access the cluster.

 The basic information of the cluster is:
 bash-4.1$ ceph -s
    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
    monmap e1: 3 mons at
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
 election epoch 40, quorum 0,1,2 osd151,osd152,osd153
    osdmap e129885: 787 osds: 758 up, 758 in
      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
 active+clean+scrubbing, 1 active+clean+inconsistent, 76
 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
 TB avail
    mdsmap e1: 0/0/1 up

 #When the latency peak happened, there was no scrubbing, recovering or
 backfilling at the moment.#

 While the performance of the cluster (only with WRITE traffic) is stable
 until Dec 25th, our monitoring (for radosgw access log) shows a
 significant increase of average latency and 99% latency.

 And then I chose one OSD and try to grep slow requests logs and find
 that most of the slow requests were waiting for subop, I take osd22 for
 example.

 osd[561-571] are hosted by osd22.
 -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log |
 grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done 
 ~/slow_osd.txt
 -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
    3586 656,598
      289 467,629
      284 598,763
      279 584,598
      203 172,598
      182 598,6
      155 629,646
      83 631,598
      65 631,593
      21 616,629
      20 609,671
      20 609,390
      13 609,254
      12 702,629
      12 629,641
      11 665,613
      11 593,724
      11 361,591
      10 591,709
        9 681,609
        9 609,595
        9 591,772
        8 613,662
        8 575,591
        7 674,722
        7 609,603
        6 585,605
        5 613,691
        5 293,629
        4 774,591
        4 717,591
        4 613,776
        4 538,629
        4 485,629
        3 702,641
        3 608,629
        3 593,580
        3 591,676

 It turns out most of the slow requests were waiting for osd 598, 629, I
 ran the procedure on another host osd22 and got the same pattern.

 Then I turned to the host having osd598 and dump the perf counter to do
 comparision.

 -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
 /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
 op_latency,subop_latency,total_ops
 0.192097526753471,0.0344513450167198,7549045
 1.99137797628122,1.42198426157216,9184472
 0.198062399664129,0.0387090378926376,6305973
 0.621697271315762,0.396549768986993,9726679
 29.5222496247375,18.246379615, 10860858
 0.229250239525916,0.0557482067611005,8149691
 0.208981698303654,0.0375553180438224,6623842
 0.47474766302086,0.292583928601509,9838777
 0.339477790083925,0.101288409388438,9340212
 0.186448840141895,0.0327296517417626,7081410
 0.807598201207144,0.0139762289702332,6093531
 (osd 598 is op hotspot as well)

 This double confirmed that osd 598 was having some performance issues
 (it has around *30 seconds average op latency*!).
 sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
 latency difference is not as significant as we saw from osd perf.
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3

 Another disk at the same time for comparison (/dev/sdb).
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8

 Any idea why a couple of OSDs are so slow that impact the performance of
 the entire cluster?


What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

 Thanks

Re: [ceph-users] RADOS + deep scrubbing performance issues in production environment

2014-02-03 Thread Guang
+ceph-users.

Does anybody have the similar experience of scrubbing / deep-scrubbing?

Thanks,
Guang

On Jan 29, 2014, at 10:35 AM, Guang yguan...@yahoo.com wrote:

 Glad to see there are some discussion around scrubbing / deep-scrubbing.
 
 We are experiencing the same that scrubbing could affect latency quite a bit 
 and so far I found two slow patterns (dump_historic_ops): 1) waiting from 
 being dispatched 2) waiting in the op working queue to be fetched by an 
 available op thread. For the first slow pattern, it looks like there is lock 
 (as dispatcher stop working for 2 seconds and then resume, same for scrubber 
 thread), that needs further investigation. For the second slow pattern, as 
 scrubbing brings more ops (for scrubbing check), that make the op thread's 
 work load increase (client op has a lower priority), I think that could be 
 improved by increasing the op thread number, I will confirm this analysis by 
 adding more op threads and turn on scrubbing on OSD basis.
 
 Does the above observation and analysis make sense?
 
 Thanks,
 Guang
 
 On Jan 29, 2014, at 2:13 AM, Filippos Giannakos philipg...@grnet.gr wrote:
 
 On Mon, Jan 27, 2014 at 10:45:48AM -0800, Sage Weil wrote:
 There is also 
 
 ceph osd set noscrub
 
 and then later
 
 ceph osd unset noscrub
 
 I forget whether this pauses an in-progress PG scrub or just makes it stop 
 when it gets to the next PG boundary.
 
 sage
 
 I bumped into those settings but I couldn't find any documentation about 
 them.
 When I first tried them, they didn't do anything immediately, so I thought 
 they
 weren't the answer. After your mention, I tried them again, and after a while
 the deep-scrubbing stopped. So I'm guessing they stop scrubbing on the next 
 PG
 boundary.
 
 I see from this thread and others before, that some people think it is a 
 spindle
 issue. I'm not sure that it is just that. Replicating it to an idle cluster 
 that
 can do more than 250MiB/seconds and pausing for 4-5 seconds on a single 
 request,
 sounds like an issue by itself. Maybe there is too much locking or not enough
 priority to the actual I/O ? Plus, that idea of throttling deep scrubbing 
 based
 on the iops sounds appealing.
 
 Kind Regards,
 -- 
 Filippos
 philipg...@grnet.gr
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster is unreachable because of authentication failure

2014-01-19 Thread Guang
Thanks Sage.

I just captured part of the log (it was fast growing), the process did not hang 
but I saw the same pattern repeatedly. Should I increase the log level and send 
over email (it constantly reproduced)?

Thanks,
Guang

On Jan 18, 2014, at 12:05 AM, Sage Weil s...@inktank.com wrote:

 On Fri, 17 Jan 2014, Guang wrote:
 Thanks Sage.
 
 I further narrow down the problem to #any command using paxos service would 
 hang#, following are details:
 
 1. I am able to run ceph status / osd dump, etc., however, the result are 
 out of date (though I stopped all OSDs, it does not reflect in ceph status 
 report).
 
 -bash-4.1$ sudo ceph -s
  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
   health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 
 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck 
 inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 
 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near 
 full osd(s); 57/751 in osds are down; 
 noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
   monmap e1: 3 mons at 
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
  election epoch 123278, quorum 0,1,2 osd151,osd152,osd153
   osdmap e134893: 781 osds: 694 up, 751 in
pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 
 stale+active+recovering, 5020 active+clean, 242 stale, 4352 
 active+recovery_wait, 616 stale+active+clean, 177 
 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 
 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 
 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 
 active+recovery_wait+degraded, 30 remapped+peering, 151 
 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 
 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 
 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
   mdsmap e1: 0/0/1 up
 
 2. If I run a command which uses paxos, the command will hang forever, this 
 includes, ceph osd set noup (and also including those commands osd send to 
 monitor when being started (create-or-add)).
 
 I attached the corresponding monitor log (it is like a bug).
 
 I see the osd set command coming through, but it arrives while paxos is 
 converging and the log seems to end before the mon would normally process 
 te delayed messages.  Is there a reason why the log fragment you attached 
 ends there, or did the process hang or something?
 
 Thanks-
 sage
 
 I 
 
 On Jan 17, 2014, at 1:35 AM, Sage Weil s...@inktank.com wrote:
 
 Hi Guang,
 
 On Thu, 16 Jan 2014, Guang wrote:
 I still have bad the luck to figure out what is the problem making 
 authentication failure, so in order to get the cluster back, I tried:
 1. stop all daemons (mon  osd)
 2. change the configuration to disable cephx
 3. start mon daemons (3 in total)
 4. start osd daemon one by one
 
 After finishing step 3, the cluster can be reachable ('ceph -s' give 
 results):
 -bash-4.1$ sudo ceph -s
 cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
  health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 
 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck 
 inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 
 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near 
 full osd(s); 57/751 in osds are down; 
 noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
  monmap e1: 3 mons at 
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
  election epoch 106022, quorum 0,1,2 osd151,osd152,osd153
  osdmap e134893: 781 osds: 694 up, 751 in
   pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 
 stale+active+recovering, 5020 active+clean, 242 stale, 4352 
 active+recovery_wait, 616 stale+active+clean, 177 
 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 
 86 down+peering, 1547 active+degraded, 32 
 stale+active+recovering+degraded, 648 stale+peering, 21 
 stale+down+peering, 239 stale+active+degraded, 651 
 active+recovery_wait+degraded, 30 remapped+peering, 151 
 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 
 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 
 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%)
  mdsmap e1: 0/0/1 up
 (at this point, all OSDs should be down).
 
 When I tried to start OSD daemon, the starting script got hang, and the 
 process hang is:
 root  80497  80496  0 08:18 pts/000:00:00 python /usr/bin/ceph 
 --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush 
 create-or-move -- 22 0.40 root=default host=osd173
 
 When I strace the starting script, I got the following traces (process 
 75873 is the above process), it failed with futex and then do a infinite 
 loop:
  select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
 Any idea what might

Re: [ceph-users] Ceph cluster is unreachable because of authentication failure

2014-01-14 Thread Guang
Thanks Sage.

-bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok 
mon_status
{ name: osd151,
  rank: 2,
  state: electing,
  election_epoch: 85469,
  quorum: [],
  outside_quorum: [],
  extra_probe_peers: [],
  sync_provider: [],
  monmap: { epoch: 1,
  fsid: b9cb3ea9-e1de-48b4-9e86-6921e2c537d2,
  modified: 0.00,
  created: 0.00,
  mons: [
{ rank: 0,
  name: osd152,
  addr: 10.193.207.130:6789\/0},
{ rank: 1,
  name: osd153,
  addr: 10.193.207.131:6789\/0},
{ rank: 2,
  name: osd151,
  addr: 10.194.0.68:6789\/0}]}}

And:

-bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok 
quorum_status
{ election_epoch: 85480,
  quorum: [
0,
1,
2],
  quorum_names: [
osd151,
osd152,
osd153],
  quorum_leader_name: osd152,
  monmap: { epoch: 1,
  fsid: b9cb3ea9-e1de-48b4-9e86-6921e2c537d2,
  modified: 0.00,
  created: 0.00,
  mons: [
{ rank: 0,
  name: osd152,
  addr: 10.193.207.130:6789\/0},
{ rank: 1,
  name: osd153,
  addr: 10.193.207.131:6789\/0},
{ rank: 2,
  name: osd151,
  addr: 10.194.0.68:6789\/0}]}}


The election has been finished with leader selected from the above status.

Thanks,
Guang

On Jan 14, 2014, at 10:55 PM, Sage Weil s...@inktank.com wrote:

 On Tue, 14 Jan 2014, GuangYang wrote:
 Hi ceph-users and ceph-devel,
 I came across an issue after restarting monitors of the cluster, that 
 authentication fails which prevents running any ceph command.
 
 After we did some maintenance work, I restart OSD, however, I found that the 
 OSD would not join the cluster automatically after being restarted, though 
 TCP dump showed it had already sent messenger to monitor telling add me into 
 the cluster.
 
 So that I suspected there might be some issues of monitor and I restarted 
 monitor one by one (3 in total), however, after restarting monitors, all 
 ceph command would fail saying authentication timeout?
 
 2014-01-14 12:00:30.499397 7fc7f195e700  0 monclient(hunting): authenticate 
 timed out after 300
 2014-01-14 12:00:30.499440 7fc7f195e700  0 librados: client.admin 
 authentication error (110) Connection timed out
 Error connecting to cluster: Error
 
 Any idea why such error happened (restarting OSD would result in the same 
 error)?
 
 I am thinking the authentication information is persisted in mon local disk 
 and is there a chance those data got corrupted?
 
 That sounds unlikely, but you're right that the core problem is with the 
 mons.  What does 
 
 ceph daemon mon.`hostname` mon_status
 
 say?  Perhaps they are not forming a quorum and that is what is preventing 
 authentication.
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: [rgw - Bug #7073] (New) rgw gc max objs should have a prime number as default value

2014-01-01 Thread Guang
Hi ceph-users,
After reading through the GC related code, I am thinking to use a much larger 
value for rgw gc max obis (like 997), and I don't see any side effect if we  
increase this value. Did I miss anything?

Thanks,
Guang

Begin forwarded message:

 From: redm...@tracker.ceph.com
 Subject: [rgw - Bug #7073] (New) rgw gc max objs should have a prime number 
 as default value
 Date: December 31, 2013 3:28:53 PM GMT+08:00
 
 Issue #7073 has been reported by Guang Yang.
 Bug #7073: rgw gc max objs should have a prime number as default value
 Author: Guang Yang
 Status: New
 Priority: Normal
 Assignee:
 Category:
 Target version:
 Source: other
 Backport:
 Tags:
 Severity: 3 - minor
 Reviewed:
 Recently when we trouble shoot latency increasing on our ceph cluster, we 
 observed a couple of gc objects were hotspot which slow down the entire OSD, 
 after checking the .rgw.gc pool, we found a couple of gc objects has tens of 
 thousands of entries while other gc objects has zero entry.
 
 The problem is because we have a bad default value (32) for rgw gc max objs.
 
 The data flow is:
 1. For each object, it has a object ID with pattern 
 {client_id}.{eachreqincrease_by_1_number}, sample is: 0_default.4351.24557.
 2. For each delete request, it needs to set gc entry for the object, the way 
 how it does is:
 2.1 hash the object ID to figure out which gc object to use (0 – 31)
 2.2 set two entries for that gc object.
 
 The problem comes from step 2.1, as the default max objs is 32, so that for 
 each string (object tag) hashed value, it will need to mod 32, which result a 
 un-even distribution, it definitely should choose a prime number to have a 
 evenly distribution.
 
 I wrote a small problem to simulate the above as:
 #include iostream
 #include sstream
 #include string
 using namespace std;
 
 unsigned str_hash(const char* str, unsigned length) {
 unsigned long hash = 0;
 while (length--) {
 unsigned char c = *str++;
 hash = (hash + (c  4) + (c  4)) * 11;
 }
 return hash;
 }
 
 int main() {
 int gc_old32 = {0,0};
 int gc_new31 = {0,0};
 string base(0_default.4351.);
 ostringstream os;
 for (int i = 0; i  1; ++i) {
 os.clear();
 os  i;
 string tag = base + os.str();
 unsigned n = str_hash(tag.c_str(), tag.size());
 gc_old[n%32]++;
 gc_new[n%31]++;
 }
 
 cout  with use max objs 32...lt;endl;
 for(int i = 0; i  32; ++i)
 {
 cout  gc. i  gc_old[i]  endl;
 }
 cout  with use max objs 31...lt;endl;
 for(int i = 0; i  31; ++i)
 {
 cout  gc.  igc_new[i]  endl;
 }
 return 0;
 },
 output of the program is:
 with use max objs 32...
 gc.00
 gc.10
 gc.22317
 gc.358
 gc.40
 gc.50
 gc.668
 gc.757
 gc.80
 gc.90
 gc.1068
 gc.1157
 gc.120
 gc.130
 gc.1467
 gc.1557
 gc.160
 gc.170
 gc.182319
 gc.1955
 gc.200
 gc.210
 gc.2269
 gc.2357
 gc.240
 gc.250
 gc.264569
 gc.2758
 gc.280
 gc.290
 gc.3068
 gc.3156
 with use max objs 31...
 gc.0322
 gc.1287
 gc.2307
 gc.3315
 gc.4345
 gc.5333
 gc.6333
 gc.7323
 gc.8297
 gc.9324
 gc.10316
 gc.11354
 gc.12313
 gc.13331
 gc.14314
 gc.15312
 gc.16335
 gc.17320
 gc.18337
 gc.19317
 gc.20316
 gc.21340
 gc.22330
 gc.23322
 gc.24306
 gc.25350
 gc.26332
 gc.27327
 gc.28309
 gc.29292
 gc.30341
 In order to avoid the hotspot, we should choose a prime number as default 
 value and clearly document that if user need to change the value, he / she 
 should choose a prime number to have a better performance.
 
 You have received this notification because you have either subscribed to it, 
 or are involved in it.
 To change your notification preferences, please click here: 
 http://tracker.ceph.com/my/account

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Wido, my comments inline...

Date: Mon, 30 Dec 2013 14:04:35 +0100
From: Wido den Hollander w...@42on.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 12:45 PM, Guang wrote:
 Hi ceph-users and ceph-devel,
 Merry Christmas and Happy New Year!

 We have a ceph cluster with radosgw, our customer is using S3 API to
 access the cluster.

 The basic information of the cluster is:
 bash-4.1$ ceph -s
    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
    monmap e1: 3 mons at
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
 election epoch 40, quorum 0,1,2 osd151,osd152,osd153
    osdmap e129885: 787 osds: 758 up, 758 in
      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
 active+clean+scrubbing, 1 active+clean+inconsistent, 76
 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
 TB avail
    mdsmap e1: 0/0/1 up

 #When the latency peak happened, there was no scrubbing, recovering or
 backfilling at the moment.#

 While the performance of the cluster (only with WRITE traffic) is stable
 until Dec 25th, our monitoring (for radosgw access log) shows a
 significant increase of average latency and 99% latency.

 And then I chose one OSD and try to grep slow requests logs and find
 that most of the slow requests were waiting for subop, I take osd22 for
 example.

 osd[561-571] are hosted by osd22.
 -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log |
 grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done 
 ~/slow_osd.txt
 -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
    3586 656,598
      289 467,629
      284 598,763
      279 584,598
      203 172,598
      182 598,6
      155 629,646
      83 631,598
      65 631,593
      21 616,629
      20 609,671
      20 609,390
      13 609,254
      12 702,629
      12 629,641
      11 665,613
      11 593,724
      11 361,591
      10 591,709
        9 681,609
        9 609,595
        9 591,772
        8 613,662
        8 575,591
        7 674,722
        7 609,603
        6 585,605
        5 613,691
        5 293,629
        4 774,591
        4 717,591
        4 613,776
        4 538,629
        4 485,629
        3 702,641
        3 608,629
        3 593,580
        3 591,676

 It turns out most of the slow requests were waiting for osd 598, 629, I
 ran the procedure on another host osd22 and got the same pattern.

 Then I turned to the host having osd598 and dump the perf counter to do
 comparision.

 -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
 /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
 op_latency,subop_latency,total_ops
 0.192097526753471,0.0344513450167198,7549045
 1.99137797628122,1.42198426157216,9184472
 0.198062399664129,0.0387090378926376,6305973
 0.621697271315762,0.396549768986993,9726679
 29.5222496247375,18.246379615, 10860858
 0.229250239525916,0.0557482067611005,8149691
 0.208981698303654,0.0375553180438224,6623842
 0.47474766302086,0.292583928601509,9838777
 0.339477790083925,0.101288409388438,9340212
 0.186448840141895,0.0327296517417626,7081410
 0.807598201207144,0.0139762289702332,6093531
 (osd 598 is op hotspot as well)

 This double confirmed that osd 598 was having some performance issues
 (it has around *30 seconds average op latency*!).
 sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
 latency difference is not as significant as we saw from osd perf.
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3

 Another disk at the same time for comparison (/dev/sdb).
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8

 Any idea why a couple of OSDs are so slow that impact the performance of
 the entire cluster?


What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

 Thanks,
 Guang


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Mark, my comments inline...

Date: Mon, 30 Dec 2013 07:36:56 -0600
From: Mark Nelson mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 05:45 AM, Guang wrote:
 Hi ceph-users and ceph-devel,
 Merry Christmas and Happy New Year!

 We have a ceph cluster with radosgw, our customer is using S3 API to
 access the cluster.

 The basic information of the cluster is:
 bash-4.1$ ceph -s
    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
    monmap e1: 3 mons at
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
 election epoch 40, quorum 0,1,2 osd151,osd152,osd153
    osdmap e129885: 787 osds: 758 up, 758 in
      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
 active+clean+scrubbing, 1 active+clean+inconsistent, 76
 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
 TB avail
    mdsmap e1: 0/0/1 up

 #When the latency peak happened, there was no scrubbing, recovering or
 backfilling at the moment.#

 While the performance of the cluster (only with WRITE traffic) is stable
 until Dec 25th, our monitoring (for radosgw access log) shows a
 significant increase of average latency and 99% latency.

 And then I chose one OSD and try to grep slow requests logs and find
 that most of the slow requests were waiting for subop, I take osd22 for
 example.

 osd[561-571] are hosted by osd22.
 -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log |
 grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done 
 ~/slow_osd.txt
 -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
    3586 656,598
      289 467,629
      284 598,763
      279 584,598
      203 172,598
      182 598,6
      155 629,646
      83 631,598
      65 631,593
      21 616,629
      20 609,671
      20 609,390
      13 609,254
      12 702,629
      12 629,641
      11 665,613
      11 593,724
      11 361,591
      10 591,709
        9 681,609
        9 609,595
        9 591,772
        8 613,662
        8 575,591
        7 674,722
        7 609,603
        6 585,605
        5 613,691
        5 293,629
        4 774,591
        4 717,591
        4 613,776
        4 538,629
        4 485,629
        3 702,641
        3 608,629
        3 593,580
        3 591,676

 It turns out most of the slow requests were waiting for osd 598, 629, I
 ran the procedure on another host osd22 and got the same pattern.

 Then I turned to the host having osd598 and dump the perf counter to do
 comparision.

 -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
 /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
 op_latency,subop_latency,total_ops
 0.192097526753471,0.0344513450167198,7549045
 1.99137797628122,1.42198426157216,9184472
 0.198062399664129,0.0387090378926376,6305973
 0.621697271315762,0.396549768986993,9726679
 29.5222496247375,18.246379615, 10860858
 0.229250239525916,0.0557482067611005,8149691
 0.208981698303654,0.0375553180438224,6623842
 0.47474766302086,0.292583928601509,9838777
 0.339477790083925,0.101288409388438,9340212
 0.186448840141895,0.0327296517417626,7081410
 0.807598201207144,0.0139762289702332,6093531
 (osd 598 is op hotspot as well)

 This double confirmed that osd 598 was having some performance issues
 (it has around *30 seconds average op latency*!).
 sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
 latency difference is not as significant as we saw from osd perf.
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3

 Another disk at the same time for comparison (/dev/sdb).
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8

 Any idea why a couple of OSDs are so slow that impact the performance of
 the entire cluster?

You may want to use the dump_historic_ops command in the admin socket 
for the slow OSDs.  That will give you some clues regarding where the 
ops are hanging up in the OSD.  You can also crank the osd debugging way 
up on that node and search through the logs to see if there are any 
patterns or trends (consistent slowness, pauses, etc).  It may also be 
useful to look and see if that OSD is pegging CPU and if so attach 
strace or perf to it and see what it's doing.
[yguang] We have a job dump_historic_ops but unfortunately it wasn't running at 
the time (my bad), and as we are using as a pre-production system

[ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-30 Thread Guang
Hi ceph-users and ceph-devel,
Merry Christmas and Happy New Year!

We have a ceph cluster with radosgw, our customer is using S3 API to access the 
cluster.

The basic information of the cluster is:
bash-4.1$ ceph -s
  cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
   health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
   monmap e1: 3 mons at 
{osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
 election epoch 40, quorum 0,1,2 osd151,osd152,osd153
   osdmap e129885: 787 osds: 758 up, 758 in
pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 
active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 
TB used, 762 TB / 1065 TB avail
   mdsmap e1: 0/0/1 up

#When the latency peak happened, there was no scrubbing, recovering or 
backfilling at the moment.#

While the performance of the cluster (only with WRITE traffic) is stable until 
Dec 25th, our monitoring (for radosgw access log) shows a significant increase 
of average latency and 99% latency.

And then I chose one OSD and try to grep slow requests logs and find that most 
of the slow requests were waiting for subop, I take osd22 for example.

osd[561-571] are hosted by osd22.
-bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 
2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done  ~/slow_osd.txt
-bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort –nr
   3586 656,598
289 467,629
284 598,763
279 584,598
203 172,598
182 598,6
155 629,646
 83 631,598
 65 631,593
 21 616,629
 20 609,671
 20 609,390
 13 609,254
 12 702,629
 12 629,641
 11 665,613
 11 593,724
 11 361,591
 10 591,709
  9 681,609
  9 609,595
  9 591,772
  8 613,662
  8 575,591
  7 674,722
  7 609,603
  6 585,605
  5 613,691
  5 293,629
  4 774,591
  4 717,591
  4 613,776
  4 538,629
  4 485,629
  3 702,641
  3 608,629
  3 593,580
  3 591,676

It turns out most of the slow requests were waiting for osd 598, 629, I ran the 
procedure on another host osd22 and got the same pattern.

Then I turned to the host having osd598 and dump the perf counter to do 
comparision.

-bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon 
/var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
op_latency,subop_latency,total_ops
0.192097526753471,0.0344513450167198,7549045
1.99137797628122,1.42198426157216,9184472
0.198062399664129,0.0387090378926376,6305973
0.621697271315762,0.396549768986993,9726679
29.5222496247375,18.246379615, 10860858
0.229250239525916,0.0557482067611005,8149691
0.208981698303654,0.0375553180438224,6623842
0.47474766302086,0.292583928601509,9838777
0.339477790083925,0.101288409388438,9340212
0.186448840141895,0.0327296517417626,7081410
0.807598201207144,0.0139762289702332,6093531
(osd 598 is op hotspot as well)

This double confirmed that osd 598 was having some performance issues (it has 
around 30 seconds average op latency!).
sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency 
difference is not as significant as we saw from osd perf.
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
37.3459.989.8 4106.9   61.8 1.6   12.24.9
42.3545.891.8 4296.3   69.7 2.4   17.65.2 
42.0483.893.1 4263.6   68.8 1.8   13.35.1
39.7425.589.4 4327.0   68.5 1.8   14.05.3

Another disk at the same time for comparison (/dev/sdb).
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
34.2502.680.13524.353.4 1.3 11.8  4.7
35.3560.983.73742.056.0 1.2 9.8   4.7 
30.4371.5   78.8 3631.452.2 1.7 15.8 4.8
33.0389.4   78.8  3597.6   54.2 1.4  12.14.8

Any idea why a couple of OSDs are so slow that impact the performance of the 
entire cluster?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-30 Thread Guang Yang
Thanks Wido, my comments inline...

Date: Mon, 30 Dec 2013 14:04:35 +0100
From: Wido den Hollander w...@42on.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 12:45 PM, Guang wrote:
 Hi ceph-users and ceph-devel,
 Merry Christmas and Happy New Year!

 We have a ceph cluster with radosgw, our customer is using S3 API to
 access the cluster.

 The basic information of the cluster is:
 bash-4.1$ ceph -s
    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
    monmap e1: 3 mons at
 {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
 election epoch 40, quorum 0,1,2 osd151,osd152,osd153
    osdmap e129885: 787 osds: 758 up, 758 in
      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
 active+clean+scrubbing, 1 active+clean+inconsistent, 76
 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
 TB avail
    mdsmap e1: 0/0/1 up

 #When the latency peak happened, there was no scrubbing, recovering or
 backfilling at the moment.#

 While the performance of the cluster (only with WRITE traffic) is stable
 until Dec 25th, our monitoring (for radosgw access log) shows a
 significant increase of average latency and 99% latency.

 And then I chose one OSD and try to grep slow requests logs and find
 that most of the slow requests were waiting for subop, I take osd22 for
 example.

 osd[561-571] are hosted by osd22.
 -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log |
 grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done 
 ~/slow_osd.txt
 -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
    3586 656,598
      289 467,629
      284 598,763
      279 584,598
      203 172,598
      182 598,6
      155 629,646
      83 631,598
      65 631,593
      21 616,629
      20 609,671
      20 609,390
      13 609,254
      12 702,629
      12 629,641
      11 665,613
      11 593,724
      11 361,591
      10 591,709
        9 681,609
        9 609,595
        9 591,772
        8 613,662
        8 575,591
        7 674,722
        7 609,603
        6 585,605
        5 613,691
        5 293,629
        4 774,591
        4 717,591
        4 613,776
        4 538,629
        4 485,629
        3 702,641
        3 608,629
        3 593,580
        3 591,676

 It turns out most of the slow requests were waiting for osd 598, 629, I
 ran the procedure on another host osd22 and got the same pattern.

 Then I turned to the host having osd598 and dump the perf counter to do
 comparision.

 -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
 /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
 op_latency,subop_latency,total_ops
 0.192097526753471,0.0344513450167198,7549045
 1.99137797628122,1.42198426157216,9184472
 0.198062399664129,0.0387090378926376,6305973
 0.621697271315762,0.396549768986993,9726679
 29.5222496247375,18.246379615, 10860858
 0.229250239525916,0.0557482067611005,8149691
 0.208981698303654,0.0375553180438224,6623842
 0.47474766302086,0.292583928601509,9838777
 0.339477790083925,0.101288409388438,9340212
 0.186448840141895,0.0327296517417626,7081410
 0.807598201207144,0.0139762289702332,6093531
 (osd 598 is op hotspot as well)

 This double confirmed that osd 598 was having some performance issues
 (it has around *30 seconds average op latency*!).
 sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
 latency difference is not as significant as we saw from osd perf.
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3

 Another disk at the same time for comparison (/dev/sdb).
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8

 Any idea why a couple of OSDs are so slow that impact the performance of
 the entire cluster?


What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

 Thanks,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'ceph osd reweight' VS 'ceph osd crush reweight'

2013-12-11 Thread Guang Yang
Hello ceph-users,
I am a little bit confused by these two options, I understand crush reweight 
determine the weight of the OSD in the crush map so that it impacts I/O and 
utilization, however, I am a little bit confused by osd reweight option, is 
that something control the I/O distribution across different OSDs on a single 
host?

While looking at the code, I only found that if 'osd weight' is 1 (0x1), it 
means the osd is up and if it is 0, it means the osd is down.

Please advice...

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-11-02 Thread Guang
Hi Kyle,
Thanks for you response. Though I haven't tested it, my gut feeling is the 
same, changing the PG number may result in re-shuffling of the data.

In terms of the strategy you mentioned to expand a cluster, I have a few 
questions:
  1. By adding a LITTLE more weight each time, my understanding is to reduce 
the load for the OSD being added, is it? If so, can we use the throttle setting 
to achieve the same goal?
  2. If I would like to expand the cluster every quarter with 30% capacity, by 
using such way, it might take a long time to add new capacity, is my 
understanding correct?
  3. Is there any automatic tool to do this, or I will need to closely monitor, 
and dump the crush rule / edit it and push back?

I am testing a scenario to add one OSD each time (I have 330 OSD in total), the 
weight is using default one. There are a couple of observations: 1) the 
recovery start quick (several hundred MB/s) and then get slower to around 
10MB/s. 2) It impact the online traffic quite a lot (from my observation, 
mainly of the recovering PGs).

I tried to search some best practice to expand a cluster with bad luck, anybody 
would like to share your experience? Thanks very much.

Thanks,
Guang

Date: Thu, 10 Oct 2013 05:15:27 -0700
From: Kyle Bader kyle.ba...@gmail.com
To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Expanding ceph cluster by adding more OSDs
Message-ID:
cafmfnwq+hbgsezme3vwom_gqcwikd1393rxc+xb0xgt4nxq...@mail.gmail.com
Content-Type: text/plain; charset=utf-8

I've contracted and expanded clusters by up to a rack of 216 OSDs - 18
nodes, 12 drives each.  New disks are configured with a CRUSH weight of 0
and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to
become active+clean and then add more weight. I was expanding after
contraction so my PG count didn't need to be corrected, I tend to be
liberal and opt for more PGs.  If I hadn't contracted the cluster prior to
expanding it I would probably add PGs after all the new OSDs have finished
being weighted into the cluster.


On Wed, Oct 9, 2013 at 8:55 PM, Michael Lowe j.michael.l...@gmail.comwrote:

 I had those same questions, I think the answer I got was that it was
 better to have too few pg's than to have overloaded osd's.  So add osd's
 then add pg's.  I don't know the best increments to grow in, probably
 depends largely on the hardware in your osd's.
 
 Sent from my iPad
 
 On Oct 9, 2013, at 11:34 PM, Guang yguan...@yahoo.com wrote:
 
 Thanks Mike. I get your point.
 
 There are still a few things confusing me:
 1) We expand Ceph cluster by adding more OSDs, which will trigger
 re-balance PGs across the old  new OSDs, and likely it will break the
 optimized PG numbers for the cluster.
  2) We can add more PGs which will trigger re-balance objects across
 old  new PGs.
 
 So:
 1) What is the recommended way to expand the cluster by adding OSDs
 (and potentially adding PGs), should we do them at the same time?
 2) What is the recommended way to scale a cluster from like 1PB to 2PB,
 should we scale it to like 1.1PB to 1.2PB or move to 2PB directly?
 
 Thanks,
 Guang
 
 On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote:
 
 There used to be, can't find it right now.  Something like 'ceph osd
 set pg_num num' then 'ceph osd set pgp_num num' to actually move your
 data into the new pg's.  I successfully did it several months ago, when
 bobtail was current.
 
 Sent from my iPad
 
 On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote:
 
 Thanks Mike.
 
 Is there any documentation for that?
 
 Thanks,
 Guang
 
 On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote:
 
 You can add PGs,  the process is called splitting.  I don't think PG
 merging, the reduction in the number of PGs, is ready yet.
 
 On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote:
 
 Hi ceph-users,
 Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas,
 per my understanding, the number of PGs for a pool should be fixed even we
 scale out / in the cluster by adding / removing OSDs, does that mean if we
 double the OSD numbers, the PG number for a pool is not optimal any more
 and there is no chance to correct it?
 
 
 Thanks,
 Guang
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding a new OSD crash the monitors (assertion failure)

2013-10-30 Thread Guang
Hi all,
Today I tried to add a new OSD into the cluster and immediately it get the 
monitors crashed.

Platform: RHEL6.4

Steps to add new monitor:
  1. sudo ceph-disk zap /dev/sdh
  2. sudo ceph-disk activate /dev/sdh

Then the monitor got crashed with the following logs:



013-10-30 02:17:14.252726 7f44395a9700  0 mon.ceph2@0(leader) e2 handle_command 
mon_command({prefix: osd crush create-or-move, args: [root=default, 
host=ceph8], id: 24, weight: 0.40998} v 0) v1
2013-10-30 02:17:14.252792 7f44395a9700  1 mon.ceph2@0(leader).paxos(paxos 
active c 322285..323030) is_readable now=2013-10-30 02:17:14.252794 
lease_expire=2013-10-30 02:17:19.063672 has v0 lc 323030
2013-10-30 02:17:14.252911 7f44395a9700  0 mon.ceph2@0(leader).osd e916 
create-or-move crush item name 'osd.24' initial_weight 0.41 at location 
{host=ceph8,root=default}
2013-10-30 02:17:14.255347 7f44395a9700 -1 crush/CrushWrapper.cc: In function 
'int CrushWrapper::insert_item(CephContext*, int, float, std::string, const 
std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar 
, std::basic_stringchar, std::char_traitschar, std::allocatorchar , 
std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar 
 , std::allocatorstd::pairconst std::basic_stringchar, 
std::char_traitschar, std::allocatorchar , std::basic_stringchar, 
std::char_traitschar, std::allocatorchar)' thread 7f44395a9700 
time 2013-10-30 02:17:14.253030
crush/CrushWrapper.cc: 413: FAILED assert(!r)

 ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
 1: (CrushWrapper::insert_item(CephContext*, int, float, std::string, 
std::mapstd::string, std::string, std::lessstd::string, 
std::allocatorstd::pairstd::string const, std::string   const)+0x14b4) 
[0x6b9514]
 2: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::string, 
std::mapstd::string, std::string, std::lessstd::string, 
std::allocatorstd::pairstd::string const, std::string   const)+0x2d6) 
[0x6ba0f6]
 3: (OSDMonitor::prepare_command(MMonCommand*)+0x150a) [0x5aa89a]
 4: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x20b) [0x5b2e2b]
 5: (PaxosService::dispatch(PaxosServiceMessage*)+0xa20) [0x58bea0]
 6: (Monitor::handle_command(MMonCommand*)+0xdec) [0x557ddc]
 7: (Monitor::_ms_dispatch(Message*)+0xc2f) [0x5600af]
 8: (Monitor::handle_forward(MForward*)+0x990) [0x55f0c0]
 9: (Monitor::_ms_dispatch(Message*)+0xd53) [0x5601d3]
 10: (Monitor::ms_dispatch(Message*)+0x32) [0x578742]
 11: (DispatchQueue::entry()+0x5a2) [0x7bdcc2]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b690d]
 13: /lib64/libpthread.so.0() [0x3208a07851]
 14: (clone()+0x6d) [0x32086e890d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

Anyone else came across the same issue? Or am I missing anything when add a new 
OSD?

Thanks,
Guang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding a new OSD crash the monitors (assertion failure)

2013-10-30 Thread Guang
I just found the trick..

When I am using a default crush, which use straw bucket type, things are good.
However, for the error I posted below, it is using tree bucket type.

Is it related?

Thanks,
Guang

On Oct 30, 2013, at 6:52 PM, Guang wrote:

 Hi all,
 Today I tried to add a new OSD into the cluster and immediately it get the 
 monitors crashed.
 
 Platform: RHEL6.4
 
 Steps to add new monitor:
   1. sudo ceph-disk zap /dev/sdh
   2. sudo ceph-disk activate /dev/sdh
 
 Then the monitor got crashed with the following logs:
 
 
 
 013-10-30 02:17:14.252726 7f44395a9700  0 mon.ceph2@0(leader) e2 
 handle_command mon_command({prefix: osd crush create-or-move, args: 
 [root=default, host=ceph8], id: 24, weight: 0.40998} v 0) 
 v1
 2013-10-30 02:17:14.252792 7f44395a9700  1 mon.ceph2@0(leader).paxos(paxos 
 active c 322285..323030) is_readable now=2013-10-30 02:17:14.252794 
 lease_expire=2013-10-30 02:17:19.063672 has v0 lc 323030
 2013-10-30 02:17:14.252911 7f44395a9700  0 mon.ceph2@0(leader).osd e916 
 create-or-move crush item name 'osd.24' initial_weight 0.41 at location 
 {host=ceph8,root=default}
 2013-10-30 02:17:14.255347 7f44395a9700 -1 crush/CrushWrapper.cc: In function 
 'int CrushWrapper::insert_item(CephContext*, int, float, std::string, const 
 std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar 
 , std::basic_stringchar, std::char_traitschar, std::allocatorchar , 
 std::lessstd::basic_stringchar, std::char_traitschar, 
 std::allocatorchar  , std::allocatorstd::pairconst 
 std::basic_stringchar, std::char_traitschar, std::allocatorchar , 
 std::basic_stringchar, std::char_traitschar, std::allocatorchar
 )' thread 7f44395a9700 time 2013-10-30 02:17:14.253030
 crush/CrushWrapper.cc: 413: FAILED assert(!r)
 
  ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
  1: (CrushWrapper::insert_item(CephContext*, int, float, std::string, 
 std::mapstd::string, std::string, std::lessstd::string, 
 std::allocatorstd::pairstd::string const, std::string   const)+0x14b4) 
 [0x6b9514]
  2: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::string, 
 std::mapstd::string, std::string, std::lessstd::string, 
 std::allocatorstd::pairstd::string const, std::string   const)+0x2d6) 
 [0x6ba0f6]
  3: (OSDMonitor::prepare_command(MMonCommand*)+0x150a) [0x5aa89a]
  4: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x20b) [0x5b2e2b]
  5: (PaxosService::dispatch(PaxosServiceMessage*)+0xa20) [0x58bea0]
  6: (Monitor::handle_command(MMonCommand*)+0xdec) [0x557ddc]
  7: (Monitor::_ms_dispatch(Message*)+0xc2f) [0x5600af]
  8: (Monitor::handle_forward(MForward*)+0x990) [0x55f0c0]
  9: (Monitor::_ms_dispatch(Message*)+0xd53) [0x5601d3]
  10: (Monitor::ms_dispatch(Message*)+0x32) [0x578742]
  11: (DispatchQueue::entry()+0x5a2) [0x7bdcc2]
  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b690d]
  13: /lib64/libpthread.so.0() [0x3208a07851]
  14: (clone()+0x6d) [0x32086e890d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 
 Anyone else came across the same issue? Or am I missing anything when add a 
 new OSD?
 
 Thanks,
 Guang
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-24 Thread Guang Yang
Hi Mark, Greg and Kyle,
Sorry to response this late, and thanks for providing the directions for me to 
look at.

We have exact the same setup for OSD, pool replica (and even I tried to create 
the same number of PGs within the small cluster), however, I can still 
reproduce this constantly.

This is the command I run:
$ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write

With 24 OSDs:
Average Latency: 0.00494123
Max latency: 0.511864
Min latency:  0.002198

With 330 OSDs:
Average Latency:0.00913806
Max latency: 0.021967
Min latency:  0.005456

In terms of the crush rule, we are using the default one, for the small 
cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we have 30 
OSD hosts (11 * 30).

I have a couple of questions:
 1. Is it possible that latency is due to that we have only three layer 
hierarchy? like root - host - OSD, and as we are using the Straw (by default) 
bucket type, which has O(N) speed, and if host number increase, so that the 
computation actually increase. I suspect not as the computation is in the order 
of microseconds per my understanding.

 2. Is it possible because we have more OSDs, the cluster will need to maintain 
far more connections between OSDs which potentially slow things down?

 3. Anything else i might miss?

Thanks all for the constant help.

Guang  


在 2013-10-22,下午10:22,Guang Yang yguan...@yahoo.com 写道:

 Hi Kyle and Greg,
 I will get back to you with more details tomorrow, thanks for the response.
 
 Thanks,
 Guang
 在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com 写道:
 
 Besides what Mark and Greg said it could be due to additional hops through 
 network devices. What network devices are you using, what is the network  
 topology and does your CRUSH map reflect the network topology?
 
 On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com wrote:
 On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com wrote:
  Dear ceph-users,
  Recently I deployed a ceph cluster with RadosGW, from a small one (24 
  OSDs) to a much bigger one (330 OSDs).
 
  When using rados bench to test the small cluster (24 OSDs), it showed the 
  average latency was around 3ms (object size is 5K), while for the larger 
  one (330 OSDs), the average latency was around 7ms (object size 5K), twice 
  comparing the small cluster.
 
  The OSD within the two cluster have the same configuration, SAS disk,  and 
  two partitions for one disk, one for journal and the other for metadata.
 
  For PG numbers, the small cluster tested with the pool having 100 PGs, and 
  for the large cluster, the pool has 4 PGs (as I will to further scale 
  the cluster, so I choose a much large PG).
 
  Does my test result make sense? Like when the PG number and OSD increase, 
  the latency might drop?
 
 Besides what Mark said, can you describe your test in a little more
 detail? Writing/reading, length of time, number of objects, etc.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-24 Thread Guang Yang
Thanks Mark.

I cannot connect to my hosts, I will do the check and get back to you tomorrow.

Thanks,
Guang

在 2013-10-24,下午9:47,Mark Nelson mark.nel...@inktank.com 写道:

 On 10/24/2013 08:31 AM, Guang Yang wrote:
 Hi Mark, Greg and Kyle,
 Sorry to response this late, and thanks for providing the directions for 
 me to look at.
 
 We have exact the same setup for OSD, pool replica (and even I tried to 
 create the same number of PGs within the small cluster), however, I can 
 still reproduce this constantly.
 
 This is the command I run:
 $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write
 
 With 24 OSDs:
 Average Latency: 0.00494123
 Max latency: 0.511864
 Min latency:  0.002198
 
 With 330 OSDs:
 Average Latency:0.00913806
 Max latency: 0.021967
 Min latency:  0.005456
 
 In terms of the crush rule, we are using the default one, for the small 
 cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we 
 have 30 OSD hosts (11 * 30).
 
 I have a couple of questions:
  1. Is it possible that latency is due to that we have only three layer 
 hierarchy? like root - host - OSD, and as we are using the Straw (by 
 default) bucket type, which has O(N) speed, and if host number increase, 
 so that the computation actually increase. I suspect not as the 
 computation is in the order of microseconds per my understanding.
 
 I suspect this is very unlikely as well.
 
 
  2. Is it possible because we have more OSDs, the cluster will need to 
 maintain far more connections between OSDs which potentially slow things 
 down?
 
 One thing here that might be very interesting is this:
 
 After you run your tests, if you do something like:
 
 find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
 dump_historic_ops \;  foo
 
 on each OSD server, you will get a dump of the 10 slowest operations
 over the last 10 minutes for each OSD on each server, and it will tell
 you were in each OSD operations were backing up.  You can sort of search
 through these files by greping for duration first, looking for the
 long ones, and then going back and searching through the file for those
 long durations and looking at the associated latencies.
 
 Something I have been investigating recently is time spent waiting for
 osdmap propagation.  It's something I haven't had time to dig into
 meaningfully, but if we were to see that this was more significant on
 your larger cluster vs your smaller one, that would be very interesting
 news.
 
 
  3. Anything else i might miss?
 
 Thanks all for the constant help.
 
 Guang
 
 
 在 2013-10-22,下午10:22,Guang Yang yguan...@yahoo.com 
 mailto:yguan...@yahoo.com 写道:
 
 Hi Kyle and Greg,
 I will get back to you with more details tomorrow, thanks for the 
 response.
 
 Thanks,
 Guang
 在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com 
 mailto:kyle.ba...@gmail.com 写道:
 
 Besides what Mark and Greg said it could be due to additional hops 
 through network devices. What network devices are you using, what is 
 the network  topology and does your CRUSH map reflect the network 
 topology?
 
 On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com 
 mailto:g...@inktank.com wrote:
 
On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com
mailto:yguan...@yahoo.com wrote:
 Dear ceph-users,
 Recently I deployed a ceph cluster with RadosGW, from a small
one (24 OSDs) to a much bigger one (330 OSDs).
 
 When using rados bench to test the small cluster (24 OSDs), it
showed the average latency was around 3ms (object size is 5K),
while for the larger one (330 OSDs), the average latency was
around 7ms (object size 5K), twice comparing the small cluster.
 
 The OSD within the two cluster have the same configuration, SAS
disk,  and two partitions for one disk, one for journal and the
other for metadata.
 
 For PG numbers, the small cluster tested with the pool having
100 PGs, and for the large cluster, the pool has 4 PGs (as I
will to further scale the cluster, so I choose a much large PG).
 
 Does my test result make sense? Like when the PG number and OSD
increase, the latency might drop?
 
Besides what Mark said, can you describe your test in a little more
detail? Writing/reading, length of time, number of objects, etc.
-Greg
Software Engineer #42 @ http://inktank.com http://inktank.com/
| http://ceph.com http://ceph.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-22 Thread Guang Yang
Thanks Mark for the response. My comments inline...

From: Mark Nelson mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Rados bench result when increasing OSDs
Message-ID: 52653b49.8090...@inktank.com
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 10/21/2013 09:13 AM, Guang Yang wrote:
 Dear ceph-users,

Hi!

 Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) 
 to a much bigger one (330 OSDs).
 
 When using rados bench to test the small cluster (24 OSDs), it showed the 
 average latency was around 3ms (object size is 5K), while for the larger one 
 (330 OSDs), the average latency was around 7ms (object size 5K), twice 
 comparing the small cluster.

Did you have the same number of concurrent requests going?
[yguang] Yes. I run the test with 3 or 5 concurrent request, that does not 
change the result.

 
 The OSD within the two cluster have the same configuration, SAS disk,  and 
 two partitions for one disk, one for journal and the other for metadata.
 
 For PG numbers, the small cluster tested with the pool having 100 PGs, and 
 for the large cluster, the pool has 4 PGs (as I will to further scale the 
 cluster, so I choose a much large PG).

Forgive me if this is a silly question, but were the pools using the 
same level of replication?
[yguang] Yes, both have 3 replicas.
 
 Does my test result make sense? Like when the PG number and OSD increase, the 
 latency might drop?

You wouldn't necessarily expect a larger cluster to show higher latency 
if the nodes, pools, etc were all configured exactly the same, 
especially if you were using the same amount of concurrency.  It's 
possible that you have some slow drives on the larger cluster that could 
be causing the average latency to increase.  If there are more disks per 
node, that could do it too.
[yguang] Glad to know this :) I will need to gather more information in terms 
of if there is any slow disk, will get back on this.

Are there any other differences you can think of?
[yguang] Another difference is, for the large cluster, as we expect to scale it 
to more than a thousand OSDs, we have a large PG number (4) pre-created.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-22 Thread Guang Yang
Hi Kyle and Greg,
I will get back to you with more details tomorrow, thanks for the response.

Thanks,
Guang
在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com 写道:

 Besides what Mark and Greg said it could be due to additional hops through 
 network devices. What network devices are you using, what is the network  
 topology and does your CRUSH map reflect the network topology?
 
 On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com wrote:
 On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com wrote:
  Dear ceph-users,
  Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) 
  to a much bigger one (330 OSDs).
 
  When using rados bench to test the small cluster (24 OSDs), it showed the 
  average latency was around 3ms (object size is 5K), while for the larger 
  one (330 OSDs), the average latency was around 7ms (object size 5K), twice 
  comparing the small cluster.
 
  The OSD within the two cluster have the same configuration, SAS disk,  and 
  two partitions for one disk, one for journal and the other for metadata.
 
  For PG numbers, the small cluster tested with the pool having 100 PGs, and 
  for the large cluster, the pool has 4 PGs (as I will to further scale 
  the cluster, so I choose a much large PG).
 
  Does my test result make sense? Like when the PG number and OSD increase, 
  the latency might drop?
 
 Besides what Mark said, can you describe your test in a little more
 detail? Writing/reading, length of time, number of objects, etc.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rados bench result when increasing OSDs

2013-10-21 Thread Guang Yang
Dear ceph-users,
Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to 
a much bigger one (330 OSDs).

When using rados bench to test the small cluster (24 OSDs), it showed the 
average latency was around 3ms (object size is 5K), while for the larger one 
(330 OSDs), the average latency was around 7ms (object size 5K), twice 
comparing the small cluster.

The OSD within the two cluster have the same configuration, SAS disk,  and two 
partitions for one disk, one for journal and the other for metadata.

For PG numbers, the small cluster tested with the pool having 100 PGs, and for 
the large cluster, the pool has 4 PGs (as I will to further scale the 
cluster, so I choose a much large PG).

Does my test result make sense? Like when the PG number and OSD increase, the 
latency might drop?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy zap disk failure

2013-10-18 Thread Guang Yang
Thanks all for the recommendation. I worked around by modifying the ceph-deploy 
by giving and full path for sgdisk.

Thanks,
Guang
在 2013-10-16,下午10:47,Alfredo Deza alfredo.d...@inktank.com 写道:

 On Tue, Oct 15, 2013 at 9:19 PM, Guang yguan...@yahoo.com wrote:
 -bash-4.1$ which sgdisk
 /usr/sbin/sgdisk
 
 Which path does ceph-deploy use?
 
 That is unexpected... these are the paths that ceph-deploy uses:
 
 '/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin'
 
 So `/usr/sbin/` is there. I believe  this is a case where $PATH gets
 altered because of sudo (resetting the env variable).
 
 This should be fixed in the next release. In the meantime, you could
 set the $PATH for non-interactive sessions (which is what ceph-deploy
 does)
 for all users. I *think* that would be in `/etc/profile`
 
 
 
 Thanks,
 Guang
 
 On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote:
 
 On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote:
 Hi ceph-users,
 I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a
 new issue:
 
 -bash-4.1$ ceph-deploy --version
 1.2.7
 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb
 [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap
 server:/dev/sdb
 [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from
 remote host
 [ceph_deploy.osd][INFO  ] Distro info: Red Hat Enterprise Linux Server 6.4
 Santiago
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device
 [osd2.ceph.mobstor.bf1.yahoo.com][INFO  ] Running command: sudo sgdisk
 --zap-all --clear --mbrtogpt -- /dev/sdb
 [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found
 
 While I run disk zap on the host directly, it can work without issues.
 Anyone meet the same issue?
 
 Can you run `which sgdisk` on that host? I want to make sure this is
 not a $PATH problem.
 
 ceph-deploy tries to use the proper path remotely but it could be that
 this one is not there.
 
 
 
 Thanks,
 Guang
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy zap disk failure

2013-10-15 Thread Guang
-bash-4.1$ which sgdisk  
/usr/sbin/sgdisk

Which path does ceph-deploy use?

Thanks,
Guang

On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote:

 On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote:
 Hi ceph-users,
 I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a
 new issue:
 
 -bash-4.1$ ceph-deploy --version
 1.2.7
 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb
 [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap
 server:/dev/sdb
 [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from
 remote host
 [ceph_deploy.osd][INFO  ] Distro info: Red Hat Enterprise Linux Server 6.4
 Santiago
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device
 [osd2.ceph.mobstor.bf1.yahoo.com][INFO  ] Running command: sudo sgdisk
 --zap-all --clear --mbrtogpt -- /dev/sdb
 [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found
 
 While I run disk zap on the host directly, it can work without issues.
 Anyone meet the same issue?
 
 Can you run `which sgdisk` on that host? I want to make sure this is
 not a $PATH problem.
 
 ceph-deploy tries to use the proper path remotely but it could be that
 this one is not there.
 
 
 
 Thanks,
 Guang
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph stats and monitoring

2013-10-09 Thread Guang
Hi,
Can someone share your experience with monitoring the Ceph cluster? How is 
going with the work mentioned here: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/ceph_stats_and_monitoring_tools


Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-09 Thread Guang
Thanks Mike.

Is there any documentation for that?

Thanks,
Guang

On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote:

 You can add PGs,  the process is called splitting.  I don't think PG merging, 
 the reduction in the number of PGs, is ready yet.
 
 On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote:
 
 Hi ceph-users,
 Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my 
 understanding, the number of PGs for a pool should be fixed even we scale 
 out / in the cluster by adding / removing OSDs, does that mean if we double 
 the OSD numbers, the PG number for a pool is not optimal any more and there 
 is no chance to correct it?
 
 
 Thanks,
 Guang
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-09 Thread Guang
Thanks Mike. I get your point.

There are still a few things confusing me:
  1) We expand Ceph cluster by adding more OSDs, which will trigger re-balance 
PGs across the old  new OSDs, and likely it will break the optimized PG 
numbers for the cluster.
   2) We can add more PGs which will trigger re-balance objects across old  
new PGs.

So:
  1) What is the recommended way to expand the cluster by adding OSDs (and 
potentially adding PGs), should we do them at the same time?
  2) What is the recommended way to scale a cluster from like 1PB to 2PB, 
should we scale it to like 1.1PB to 1.2PB or move to 2PB directly?

Thanks,
Guang

On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote:

 There used to be, can't find it right now.  Something like 'ceph osd set 
 pg_num num' then 'ceph osd set pgp_num num' to actually move your data 
 into the new pg's.  I successfully did it several months ago, when bobtail 
 was current.
 
 Sent from my iPad
 
 On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote:
 
 Thanks Mike.
 
 Is there any documentation for that?
 
 Thanks,
 Guang
 
 On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote:
 
 You can add PGs,  the process is called splitting.  I don't think PG 
 merging, the reduction in the number of PGs, is ready yet.
 
 On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote:
 
 Hi ceph-users,
 Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per 
 my understanding, the number of PGs for a pool should be fixed even we 
 scale out / in the cluster by adding / removing OSDs, does that mean if we 
 double the OSD numbers, the PG number for a pool is not optimal any more 
 and there is no chance to correct it?
 
 
 Thanks,
 Guang
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph monitoring / stats and troubleshooting tools

2013-10-08 Thread Guang
Hi ceph-users,
After walking through the operations document, I still have several questions 
in terms of operation / monitoring for ceph which need you help. Thanks!

1. Does ceph provide build in monitoring mechanism for Rados and RadosGW? 
Taking Rados for example, is it possible to monitor the health / latency / 
storage on regular basis and ideally have an web UI?

2. One common trouble shooting requirement would be, given an object name, how 
to locate the PG / OSD / physical file path for this object? Does Ceph provide 
such type of utility?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-08 Thread Guang
Hi ceph-users,
Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my 
understanding, the number of PGs for a pool should be fixed even we scale out / 
in the cluster by adding / removing OSDs, does that mean if we double the OSD 
numbers, the PG number for a pool is not optimal any more and there is no 
chance to correct it?


Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy issues on RHEL6.4

2013-09-27 Thread Guang
Hi ceph-users,
I recently deployed a ceph cluster with use of *ceph-deploy* utility, on 
RHEL6.4, during the time, I came across a couple of issues / questions which I 
would like to ask for your help.

1. ceph-deploy does not help to install dependencies (snappy leveldb gdisk 
python-argparse gperftools-libs) on the target host, so I will need to manually 
install those dependencies before performing 'ceph-deploy install {host_name}'. 
I am investigate the way to deploy ceph onto a hundred nodes and it is 
time-consuming to manually install those dependencies manually. Am I missing 
something here? I am thinking the dependency installation should be handled by 
*ceph-deploy* itself.

2. When performing 'ceph-deploy -v disk zap ceph.host.name:/dev/sdb', I have 
the following errors:
[ceph_deploy.osd][DEBUG ] zapping /dev/sdc on ceph.host.name
[ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo
Traceback (most recent call last):
  File /usr/bin/ceph-deploy, line 21, in module
sys.exit(main())
  File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 
83, in newfunc
return f(*a, **kw)
  File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main
return args.func(args)
  File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 381, in disk
disk_zap(args)
  File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 317, in 
disk_zap
zap_r(disk)
  File /usr/lib/python2.6/site-packages/pushy/protocol/proxy.py, line 255, in 
lambda
(conn.operator(type_, self, args, kwargs))
  File /usr/lib/python2.6/site-packages/pushy/protocol/connection.py, line 
66, in operator
return self.send_request(type_, (object, args, kwargs))
  File /usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py, 
line 329, in send_request
return self.__handle(m)
  File /usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py, 
line 645, in __handle
raise e
pushy.protocol.proxy.ExceptionProxy: [Errno 2] No such file or directory 

And then I logon to the host to perform 'ceph-disk zap /dev/sdb' and it can be 
successful without any issues.

3. When performing 'ceph-deploy -v disk activate  ceph.host.name:/dev/sdb', I 
have the following errors:
ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph.host.name:/dev/sdb:
[ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo
[ceph_deploy.osd][DEBUG ] Activating host ceph.host.name disk /dev/sdb
[ceph_deploy.osd][DEBUG ] Distro RedHatEnterpriseServer codename Santiago, will 
use sysvinit
Traceback (most recent call last):
  File /usr/bin/ceph-deploy, line 21, in module
sys.exit(main())
  File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 
83, in newfunc
return f(*a, **kw)
  File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main
return args.func(args)
  File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 379, in disk
activate(args, cfg)
  File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 271, in 
activate
cmd=cmd, ret=ret, out=out, err=err)
NameError: global name 'ret' is not defined

Also, I logon to the host to perform 'ceph-disk activate /dev/sdb' and it is 
good.

Any help is appreciated.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph deployment issue in physical hosts

2013-09-25 Thread Guang
Hi ceph-users,
I deployed a cluster successfully in VMs, and today I tried to deploy a cluster 
in physical nodes. However, I came across a problem when I started creating a 
monitor.

-bash-4.1$ ceph-deploy mon create x
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts 
[ceph_deploy.mon][DEBUG ] detecting platform for host web2 ...
[ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo
[ceph_deploy.mon][INFO  ] distro info: RedHatEnterpriseServer 6.4 Santiago
[web2][DEBUG ] determining if provided host has same hostname in remote
[web2][DEBUG ] deploying mon to web2
[web2][DEBUG ] remote hostname: web2
[web2][INFO  ] write cluster configuration to /etc/ceph/{cluster}.conf
[web2][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-web2/done
[web2][INFO  ] create a done file to avoid re-doing the mon deployment
[web2][INFO  ] create the init path if it does not exist
[web2][INFO  ] locating `service` executable...
[web2][INFO  ] found `service` executable: /sbin/service
ssh: Could not resolve hostname web2: Name or service not known
Traceback (most recent call last):
  File /usr/bin/ceph-deploy, line 21, in module
sys.exit(main())
  File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 
83, in newfunc
return f(*a, **kw)
  File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main
return args.func(args)
  File /usr/lib/python2.6/site-packages/ceph_deploy/mon.py, line 246, in mon
mon_create(args)
  File /usr/lib/python2.6/site-packages/ceph_deploy/mon.py, line 105, in 
mon_create
distro.mon.create(distro, rlogger, args, monitor_keyring)
  File 
/usr/lib/python2.6/site-packages/ceph_deploy/hosts/centos/mon/create.py, line 
15, in create
rconn = get_connection(hostname, logger)
  File /usr/lib/python2.6/site-packages/ceph_deploy/connection.py, line 13, 
in get_connection
sudo=needs_sudo(),
  File /usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/connection.py, 
line 12, in __init__
self.gateway = execnet.makegateway('ssh=%s' % hostname)
  File 
/usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/multi.py, 
line 89, in makegateway
gw = gateway_bootstrap.bootstrap(io, spec)
  File 
/usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_bootstrap.py,
 line 70, in bootstrap
bootstrap_ssh(io, spec)
  File 
/usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_bootstrap.py,
 line 42, in bootstrap_ssh
raise HostNotFound(io.remoteaddress)
execnet.gateway_bootstrap.HostNotFound: web2

Does anyone come across the same issue? Looks like I mis-configured the network 
environment?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph deployment issue in physical hosts

2013-09-25 Thread Guang
Thanks Wolfgang.

-bash-4.1$ ping web2
PING web2 (10.193.244.209) 56(84) bytes of data.
64 bytes from web2 (10.193.244.209): icmp_seq=1 ttl=64 time=0.505 ms
64 bytes from web2 (10.193.244.209): icmp_seq=2 ttl=64 time=0.194 ms
...

[I omit part of the host name].

It can ping to the host and I actually used ceph-deploy to install ceph onto 
the web2 remote host…

Thanks,
Guang


Date: Wed, 25 Sep 2013 10:29:14 +0200
From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph deployment issue in physical hosts
Message-ID: 52429eda.8070...@risc-software.at
Content-Type: text/plain; charset=ISO-8859-1



On 09/25/2013 10:03 AM, Guang wrote:
 Hi ceph-users,
 I deployed a cluster successfully in VMs, and today I tried to deploy a 
 cluster in physical nodes. However, I came across a problem when I started 
 creating a monitor.
 
 -bash-4.1$ ceph-deploy mon create x

 ssh: Could not resolve hostname web2: Name or service not known
 Does anyone come across the same issue? Looks like I mis-configured the 
 network environment?

The machine you run ceph-deploy on doesn't know who web2 is. If this
command succeeds: ping web2 then ceph deploy will at least be able to
contact that host.

hint: look at your /etc/hosts file.

 Thanks,
 Guang

Wolfgang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph deployment issue in physical hosts

2013-09-25 Thread Guang
Thanks for the reply!

I don't know the reason, but I work-around this issue by add a new entry in the 
/etc/hosts with something like 'web2   {id_address_of_web2}' and it can work.

I am not sure if that is due to some mis-config by my end of the deployment 
script, will further investigate.

Thanks all for the help!

Guang

On Sep 25, 2013, at 8:38 PM, Alfredo Deza wrote:

 On Wed, Sep 25, 2013 at 5:08 AM, Guang yguan...@yahoo.com wrote:
 Thanks Wolfgang.
 
 -bash-4.1$ ping web2
 PING web2 (10.193.244.209) 56(84) bytes of data.
 64 bytes from web2 (10.193.244.209): icmp_seq=1 ttl=64 time=0.505 ms
 64 bytes from web2 (10.193.244.209): icmp_seq=2 ttl=64 time=0.194 ms
 ...
 
 [I omit part of the host name].
 
 It can ping to the host and I actually used ceph-deploy to install ceph onto
 the web2 remote host…
 
 
 This is very unexpected, it most definitely sounds like at some point
 web2 is not resolvable (as the
 error says) but you are also right in that you initiate the deployment
 correctly with ceph-deploy doing work
 on the remote end.
 
 Are you able to SSH directly to this host from where you are executing
 ceph-deploy? (same user/login)
 
 
 
 Thanks,
 Guang
 
 
 Date: Wed, 25 Sep 2013 10:29:14 +0200
 From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph deployment issue in physical hosts
 Message-ID: 52429eda.8070...@risc-software.at
 Content-Type: text/plain; charset=ISO-8859-1
 
 
 
 
 On 09/25/2013 10:03 AM, Guang wrote:
 
 Hi ceph-users,
 
 I deployed a cluster successfully in VMs, and today I tried to deploy a
 cluster in physical nodes. However, I came across a problem when I started
 creating a monitor.
 
 
 -bash-4.1$ ceph-deploy mon create x
 
 
 
 ssh: Could not resolve hostname web2: Name or service not known
 
 Does anyone come across the same issue? Looks like I mis-configured the
 network environment?
 
 
 The machine you run ceph-deploy on doesn't know who web2 is. If this
 command succeeds: ping web2 then ceph deploy will at least be able to
 contact that host.
 
 hint: look at your /etc/hosts file.
 
 Thanks,
 
 Guang
 
 
 Wolfgang
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph / RadosGW deployment questions

2013-09-24 Thread Guang
Hi ceph-users,
I deployed a Ceph cluster (including RadosGW) with use of ceph-deploy on 
RHEL6.4, during the deployment, I have a couple of questions which need your 
help.

1. I followed the steps http://ceph.com/docs/master/install/rpm/ to deploy the 
RadosGW node, however, after the deployment, all requests failed with 500 
returned. With some hints from 
http://irclogs.ceph.widodh.nl/index.php?date=2013-01-25, I changed the 
FastCgiExternalServer to FastCgiServer within rgw.conf. Is this change valid or 
I missed somewhere else which leads the need for this change?

2. It still does not work and the httpd has the following error log:
[Mon Sep 23 07:34:32 2013] [crit] (98)Address already in use: FastCGI: 
can't create server /var/www/s3gw.fcgi: bind() failed [/tmp/radosgw.sock]
which indicates that radosgw is not started properly, so that I manually run 
radosgw --rgw-socket-path=/tmp/radosgw.sock -c /etc/ceph/ceph.conf -n 
client.radosgw.gateway to start a radosgw daemon and then the gateway starts 
working as expected.
Did I miss anything this part?

3. When I was trying to run ceph admin-daemon command on the radosGW host, it 
failed because it does not have the corresponding  asok file, however, I am 
able to run the command on monitor host and found that the radosGW's 
information can be retrieved there.

@monitor (monitor and gateway are deployed on different hosts).
[xxx@startbart ceph]$ sudo ceph --admin-daemon 
/var/run/ceph/ceph-mon.startbart.asok config show | grep rgw
  rgw: 1\/5,
  rgw_data: \/var\/lib\/ceph\/radosgw\/ceph-startbart,
  rgw_enable_apis: s3, swift, swift_auth, admin,
  rgw_cache_enabled: true,
  rgw_cache_lru_size: 1,
  rgw_socket_path: ,
  rgw_host: ,
  rgw_port: ,
  rgw_dns_name: ,
  rgw_script_uri: ,
  rgw_request_uri: ,
  rgw_swift_url: ,
  rgw_swift_url_prefix: swift,
  rgw_swift_auth_url: ,
  rgw_swift_auth_entry: auth,
  rgw_keystone_url: ,
  rgw_keystone_admin_token: ,
  rgw_keystone_accepted_roles: Member, admin,
  rgw_keystone_token_cache_size: 1,
  rgw_keystone_revocation_interval: 900,
  rgw_admin_entry: admin,
  rgw_enforce_swift_acls: true,
  rgw_swift_token_expiration: 86400,
  rgw_print_continue: true,
  rgw_remote_addr_param: REMOTE_ADDR,
  rgw_op_thread_timeout: 600,
  rgw_op_thread_suicide_timeout: 0,
  rgw_thread_pool_size: 100,
Is this expected?

4. cephx authentication. After reading through the cephx introduction, I got 
the feeling that cephx is for client to cluster authentication, so that each 
librados user will need to create a new key. However, this page 
http://ceph.com/docs/master/rados/operations/authentication/#enabling-cephx got 
me confused in terms of why should we create keys for mon and osd? And how does 
that fit into the authentication diagram? BTW, I found the keyrings under 
/var/lib/cecph/{role}/ for each roles, are they being used when talk to other 
roles?

Thanks,
Guang 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy a Ceph cluster to play around with

2013-09-16 Thread Guang
Hello ceph-users, ceph-devel,
Nice to meet you in the community!
Today I tried to deploy a Ceph cluster to play around with the API, and during 
the deployment, i have a couple of questions which may need you help:
  1) How many hosts do I need if I want to deploy a cluster with RadosGW (so 
that I can try with the S3 API)? Is it 3 OSD + 1 Mon + 1 GW =  5 hosts on 
minimum?

  2) I have a list of hardwares, however, my host only have 1 disk with two 
partitions, one for boot and another for LVM members, is it possible to deploy 
an OSD on such hardware (e.g. make a partition with ext4)? Or I will need 
another disk to do so?

-bash-4.1$ ceph-deploy disk list myserver.com
[ceph_deploy.osd][INFO  ] Distro info: RedHatEnterpriseServer 6.3 Santiago
[ceph_deploy.osd][DEBUG ] Listing disks on myserver.com...
[repl101.mobstor.gq1.yahoo.com][INFO  ] Running command: ceph-disk list
[repl101.mobstor.gq1.yahoo.com][INFO  ] /dev/sda :
[repl101.mobstor.gq1.yahoo.com][INFO  ]  /dev/sda1 other, ext4, mounted on /boot
[repl101.mobstor.gq1.yahoo.com][INFO  ]  /dev/sda2 other, LVM2_member

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-20 Thread Guang Yang
Then that makes total sense to me.

Thanks,
Guang



 From: Mark Kirkwood mark.kirkw...@catalyst.net.nz
To: Guang Yang yguan...@yahoo.com 
Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com 
Sent: Tuesday, August 20, 2013 1:19 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 20/08/13 13:27, Guang Yang wrote:
 Thanks Mark.

 What is the design considerations to break large files into 4M chunk
 rather than storing the large file directly?



Quoting Wolfgang from previous reply:

= which is a good thing in terms of replication and OSD usage
distribution


...which covers what I would have said quite well :-)

Cheers

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-20 Thread Guang Yang
Thanks Greg.

The typical case is going to depend quite a lot on your scale.
[Guang] I am thinking the scale as billions of objects with size from several 
KB to several MB, my concern is over the cache efficiency for such use case.

That said, I'm not sure why you'd want to use CephFS for a small-object store 
when you could just use raw RADOS, and avoid all the posix overheads. Perhaps 
I've misunderstood your use case?
[Guang] No, you don't. That is my use case :) I am also thinking of using RADOW 
directly without the above POSIX layer, but before that, I want to consider 
each option we have and compare the cons / pros.

Thanks,
Guang



 From: Gregory Farnum g...@inktank.com
To: Guang Yang yguan...@yahoo.com 
Cc: Gregory Farnum g...@inktank.com; ceph-us...@ceph.com 
ceph-us...@ceph.com 
Sent: Tuesday, August 20, 2013 9:51 AM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 


On Monday, August 19, 2013, Guang Yang  wrote:

Thanks Greg.


Some comments inline...


On Sunday, August 18, 2013, Guang Yang  wrote:

Hi ceph-users,
This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!


After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding 
performance report?


Not really; any comparison would be highly biased depending on your Amazon 
ping and your Ceph cluster. We've got some internal benchmarks where Ceph 
looks good, but they're not anything we'd feel comfortable publishing.
 [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
over the comparison.
  2. Looking at some industry solutions for distributed storage, GFS / 
Haystack / HDFS all use meta-server to store the logical-to-physical mapping 
within memory and avoid disk I/O lookup for file reading, is the concern valid 
for Ceph (in terms of latency to read file)?


These are very different systems. Thanks to CRUSH, RADOS doesn't need to do 
any IO to find object locations; CephFS only does IO if the inode you request 
has fallen out of the MDS cache (not terribly likely in general). This 
shouldn't be an issue...
[Guang]  CephFS only does IO if the inode you request has fallen out of the 
MDS cache, my understanding is, if we use CephFS, we will need to interact 
with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
etc.) and the second time to load data, and both times will need disk I/O in 
terms of inode and data. Is my understanding correct? The way some other 
storage system tried was to cache the file handle in memory, so that it can 
avoid the I/O to read inode in.

In the worst case this can happen with CephFS, yes. However, the client is not 
accessing metadata directly; it's going through the MetaData Server, which 
caches (lots of) metadata on its own, and the client can get leases as well (so 
it doesn't need to go to the MDS for each access, and can cache information on 
its own). The typical case is going to depend quite a lot on your scale.
That said, I'm not sure why you'd want to use CephFS for a small-object store 
when you could just use raw RADOS, and avoid all the posix overheads. Perhaps 
I've misunderstood your use case?
-Greg

 
 
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce 
the ratio (Haystack for example), if we want to use ceph to store photos, 
should this be a concern as Ceph use one physical file per object?


...although this might be. The issue basically comes down to how many disk 
seeks are required to retrieve an item, and one way to reduce that number is 
to hack the filesystem by keeping a small number of very large files an 
calculating (or caching) where different objects are inside that file. Since 
Ceph is designed for MB-sized objects it doesn't go to these lengths to 
optimize that path like Haystack might (I'm not familiar with Haystack in 
particular).
That said, you need some pretty extreme latency requirements before this 
becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
you're in that ballpark. You should be fine. :)
[Guang] Yep, that makes a lot sense.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com




-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Hi ceph-users,

This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!

After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread Guang Yang
Hi ceph-users,
I would like to check if there is any manual / steps which can let me try to 
deploy ceph in RHEL?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Mark.

What is the design considerations to break large files into 4M chunk rather 
than storing the large file directly?

Thanks,
Guang



 From: Mark Kirkwood mark.kirkw...@catalyst.net.nz
To: Guang Yang yguan...@yahoo.com 
Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com 
Sent: Monday, August 19, 2013 5:18 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 19/08/13 18:17, Guang Yang wrote:

    3. Some industry research shows that one issue of file system is the
 metadata-to-data ratio, in terms of both access and storage, and some
 technic uses the mechanism to combine small files to large physical
 files to reduce the ratio (Haystack for example), if we want to use ceph
 to store photos, should this be a concern as Ceph use one physical file
 per object?

If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).


Regards

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Greg.

Some comments inline...

On Sunday, August 18, 2013, Guang Yang  wrote:

Hi ceph-users,
This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!


After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?

Not really; any comparison would be highly biased depending on your Amazon ping 
and your Ceph cluster. We've got some internal benchmarks where Ceph looks 
good, but they're not anything we'd feel comfortable publishing.
 [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
over the comparison.
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?

These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any 
IO to find object locations; CephFS only does IO if the inode you request has 
fallen out of the MDS cache (not terribly likely in general). This shouldn't be 
an issue...
[Guang]  CephFS only does IO if the inode you request has fallen out of the 
MDS cache, my understanding is, if we use CephFS, we will need to interact 
with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
etc.) and the second time to load data, and both times will need disk I/O in 
terms of inode and data. Is my understanding correct? The way some other 
storage system tried was to cache the file handle in memory, so that it can 
avoid the I/O to read inode in.
 
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

...although this might be. The issue basically comes down to how many disk 
seeks are required to retrieve an item, and one way to reduce that number is to 
hack the filesystem by keeping a small number of very large files an 
calculating (or caching) where different objects are inside that file. Since 
Ceph is designed for MB-sized objects it doesn't go to these lengths to 
optimize that path like Haystack might (I'm not familiar with Haystack in 
particular).
That said, you need some pretty extreme latency requirements before this 
becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
you're in that ballpark. You should be fine. :)
[Guang] Yep, that makes a lot sense.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage pattern and design of Ceph

2013-08-18 Thread Guang Yang
Hi ceph-users,
This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!

After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD shared between clients

2013-05-02 Thread Yudong Guang
Thank you, Gandalf and Igor. I intuitively think that building a cluster on
another is not appropriate. Maybe I should give RadosGW a try first.

On Thu, May 2, 2013 at 3:00 AM, Igor Laskovy igor.lask...@gmail.com wrote:

 Or maybe in case the hosting purposes easier implement RadosGW.




-- 
Yudong Guang
guangyudongb...@gmail.com
786-554-3993
+86-138-1174-5701
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD shared between clients

2013-05-01 Thread Yudong Guang
Hi,

I've been trying to use block device recently. I have a running cluster
with 2 machines and 3 OSDs.

On a client machine, let's say A, I created a rbd image using `rbd create`
, then formatted, mounted and wrote something in it, everything was working
fine.

However, problem occurred when I tried to use this image on the other
client, let's say B, on which I mapped the same image that created on A. I
found that any changes I made on any of them cannot be shown on the other
client, but if I unmap the device and then map again, the changes will be
shown.

I tested the same thing with ceph fs, but there was no such problem. Every
change made on one client can be shown on the other client instantly.

I wonder whether this kind of behavior of RADOS block device is normal or
not. Is there any way that we can read and write on the same image on
multiple clients?

Any idea is appreciated.

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com