Re: [ceph-users] Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Guang Yang
Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and 
> persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are 
> peered?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD disk replacement best practise

2014-08-14 Thread Guang Yang
Hi cephers,
Most recently I am drafting the run books for OSD disk replacement, I think the 
rule of thumb is to reduce data migration (recover/backfill), and I thought the 
following procedure should achieve the purpose:
  1. ceph osd out osd.XXX (mark it out to trigger data migration)
  2. ceph osd rm osd.XXX
  3. ceph auth rm osd.XXX
  4. provision a new OSD which will take XXX as the OSD id and migrate data 
back.

With the above procedure, the crush weight of the host never changed so that we 
can limit the data migration only for those which are neccesary.

Does it make sense?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] row geo-replication to another data store?

2014-07-17 Thread Guang Yang
Hi cephers,
We are investigating a backup solution for Ceph, in short, we would like a 
solution to backup a Ceph cluster to another data store (not Ceph cluster, 
assume it has SWIFT API). We would like to have both full backup and 
incremental backup on top of the full backup.

After going through the geo-replication blueprint [1], I am thinking that we 
can leverage the effort and instead of replicate the data into another ceph 
cluster, we make it replicate to another data store. At the same time, I have a 
couple of questions which need your help:

1) How does the ragosgw-agent scale to multiple hosts? Our first investigation 
shows it only works on a single host but I would like to confirm.
2) Can we configure the interval  to do incremental backup like 1 hour / 1 day 
/ 1 month?

[1] 
https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ask a performance question for the RGW

2014-06-30 Thread Guang Yang
On Jun 30, 2014, at 3:59 PM, baijia...@126.com wrote:

> Hello,
> thanks for you answer the question.
> But when there are less than 50 thousand objects, and latency is very big . I 
> see the write ops for the bucket index object., from 
> "journaled_completion_queue" to "op_commit"  cost 3.6 seconds,this mean that 
> from “writing journal finish” to  "op_commit" cost 3.6 seconds。
> so I can't understand this and what happened?
The operations updating the same bucket index object get serialized, one 
possibility is that those operation was hang there waiting other ops finishing 
their work.
>  
> thanks
> baijia...@126.com
>  
> 发件人: Guang Yang
> 发送时间: 2014-06-30 14:57
> 收件人: baijiaruo
> 抄送: ceph-users
> 主题: Re: [ceph-users] Ask a performance question for the RGW
> Hello,
> There is a known limitation of bucket scalability, and there is a blueprint 
> tracking it - 
> https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability.
>  
> At time being, I would recommend to do sharding at application level (create 
> multiple buckets) to workaround this limitation.
>  
> Thanks,
> Guang
>  
> On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote:
>  
> > 
> > hello, everyone!
> > 
> > when I user rest bench test RGW performance and the cmd is:
> > ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 
> > -t 200 -b 524288 -no-cleanup write
> > 
> > test result:
> > Total time run: 362.962324 T
> > otal writes made: 48189
> > Write size: 524288
> > Bandwidth (MB/sec): 66.383
> > Stddev Bandwidth: 40.7776
> > Max bandwidth (MB/sec): 173
> > Min bandwidth (MB/sec): 0
> > Average Latency: 1.50435
> > Stddev Latency: 0.910731
> > Max latency: 9.12276
> > Min latency: 0.19867
> > 
> > my environment is 4 host and 40 disk(osd)。 but test result is very bad, 
> > average latency is 1.5 seconds 。and I find write obj metadate is very 
> > slowly。because it puts so many object to one bucket, we know writing object 
> > metadate can call method “bucket_prepare_op”,and test find this op is very 
> > slowly。 I find the osd which contain bucket-obj。and see the 
> > “bucket_prepare_op”by dump_historic_ops :
> > { "description": "osd_op(client.4742.0:87613 .dir.default.4243.3 [call 
> > rgw.bucket_prepare_op] 3.3670fe74 e317)",
> >   "received_at": "2014-06-30 13:35:55.409597",
> >   "age": "51.148026",
> >   "duration": "4.130137",
> >   "type_data": [
> > "commit sent; apply or cleanup",
> > { "client": "client.4742",
> >   "tid": 87613},
> > [
> > { "time": "2014-06-30 13:35:55.409660",
> >   "event": "waiting_for_osdmap"},
> > { "time": "2014-06-30 13:35:55.409669",
> >   "event": "queue op_wq"},
> > { "time": "2014-06-30 13:35:55.896766",
> >   "event": "reached_pg"},
> > { "time": "2014-06-30 13:35:55.896793",
> >   "event": "started"},
> > { "time": "2014-06-30 13:35:55.896796",
> >   "event": "started"},
> > { "time": "2014-06-30 13:35:55.899450",
> >   "event": "waiting for subops from [40,43]"},
> > { "time": "2014-06-30 13:35:55.899757",
> >   "event": "commit_queued_for_journal_write"},
> > { "time": "2014-06-30 13:35:55.899799",
> >   "event": "write_thread_in_journal_buffer"},
> > { "time": "2014-06-30 13:35:55.899910",
> >   "event": "journaled_completion_queued"},
> > { "time": "2014-06-30 13:35:55.899936",
> >   "event": "journal first callback"},
> > { "time": "2014-06-30 13:35:55.899944",
> >   "event&quo

Re: [ceph-users] Ask a performance question for the RGW

2014-06-29 Thread Guang Yang
Hello,
There is a known limitation of bucket scalability, and there is a blueprint 
tracking it - 
https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability.

At time being, I would recommend to do sharding at application level (create 
multiple buckets) to workaround this limitation.

Thanks,
Guang

On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote:

>  
> hello, everyone!
>  
> when I user rest bench test RGW performance and the cmd is:
> ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 
> -t 200 -b 524288 -no-cleanup write
>  
> test result:
> Total time run: 362.962324 T
> otal writes made: 48189
> Write size: 524288
> Bandwidth (MB/sec): 66.383
> Stddev Bandwidth: 40.7776
> Max bandwidth (MB/sec): 173
> Min bandwidth (MB/sec): 0
> Average Latency: 1.50435
> Stddev Latency: 0.910731
> Max latency: 9.12276
> Min latency: 0.19867
>  
> my environment is 4 host and 40 disk(osd)。 but test result is very bad, 
> average latency is 1.5 seconds 。and I find write obj metadate is very 
> slowly。because it puts so many object to one bucket, we know writing object 
> metadate can call method “bucket_prepare_op”,and test find this op is very 
> slowly。 I find the osd which contain bucket-obj。and see the 
> “bucket_prepare_op”by dump_historic_ops :
> { "description": "osd_op(client.4742.0:87613 .dir.default.4243.3 [call 
> rgw.bucket_prepare_op] 3.3670fe74 e317)",
>   "received_at": "2014-06-30 13:35:55.409597",
>   "age": "51.148026",
>   "duration": "4.130137",
>   "type_data": [
> "commit sent; apply or cleanup",
> { "client": "client.4742",
>   "tid": 87613},
> [
> { "time": "2014-06-30 13:35:55.409660",
>   "event": "waiting_for_osdmap"},
> { "time": "2014-06-30 13:35:55.409669",
>   "event": "queue op_wq"},
> { "time": "2014-06-30 13:35:55.896766",
>   "event": "reached_pg"},
> { "time": "2014-06-30 13:35:55.896793",
>   "event": "started"},
> { "time": "2014-06-30 13:35:55.896796",
>   "event": "started"},
> { "time": "2014-06-30 13:35:55.899450",
>   "event": "waiting for subops from [40,43]"},
> { "time": "2014-06-30 13:35:55.899757",
>   "event": "commit_queued_for_journal_write"},
> { "time": "2014-06-30 13:35:55.899799",
>   "event": "write_thread_in_journal_buffer"},
> { "time": "2014-06-30 13:35:55.899910",
>   "event": "journaled_completion_queued"},
> { "time": "2014-06-30 13:35:55.899936",
>   "event": "journal first callback"},
> { "time": "2014-06-30 13:35:55.899944",
>   "event": "queuing ondisk"},
> { "time": "2014-06-30 13:35:56.142104",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-06-30 13:35:56.176950",
>   "event": "sub_op_commit_rec"},
> { "time": "2014-06-30 13:35:59.535301",
>   "event": "op_commit"},
> { "time": "2014-06-30 13:35:59.535331",
>   "event": "commit_sent"},
> { "time": "2014-06-30 13:35:59.539723",
>   "event": "op_applied"},
> { "time": "2014-06-30 13:35:59.539734",
>   "event": "done"}]]},
>  
> so why from "journaled_completion_queued" to "op_commit" is very slowly, and 
> what happened?
> thanks
>  
> baijia...@126.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS - number of files in a directory

2014-06-23 Thread Guang Yang
Hello Cephers,
We used to have a Ceph cluster and setup our data pool as 3 replicas, we 
estimated the number of files (given disk size and object size) for each PG was 
around 8K, we disabled folder splitting which mean all files located at the 
root PG folder. Our testing showed a good performance with such setup.

Right now we are evaluating erasure coding, which split the object into a 
number of chunks and increase the number of files several times, although XFS 
claims a good support for large directories [1], some testing also showed that 
we may expect performance degradation for large directories.

I would like to check with your experience on top of this for your Ceph cluster 
if you are using XFS. Thanks.

[1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-29 Thread Guang Yang
On May 28, 2014, at 5:31 AM, Gregory Farnum  wrote:

> On Sun, May 25, 2014 at 6:24 PM, Guang Yang  wrote:
>> On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:
>> 
>>> This failure means the messenger subsystem is trying to create a
>>> thread and is getting an error code back — probably due to a process
>>> or system thread limit that you can turn up with ulimit.
>>> 
>>> This is happening because a replicated PG primary needs a connection
>>> to only its replicas (generally 1 or 2 connections), but with an
>>> erasure-coded PG the primary requires a connection to m+n-1 replicas
>>> (everybody who's in the erasure-coding set, including itself). Right
>>> now our messenger requires a thread for each connection, so kerblam.
>>> (And it actually requires a couple such connections because we have
>>> separate heartbeat, cluster data, and client data systems.)
>> Hi Greg,
>> Is there any plan to refactor the messenger component to reduce the num of 
>> threads? For example, use event-driven mode.
> 
> We've discussed it in very broad terms, but there are no concrete
> designs and it's not on the schedule yet. If anybody has conclusive
> evidence that it's causing them trouble they can't work around, that
> would be good to know…
Thanks for the response!

We used to have a cluster with each OSD host having 11 disks (daemons), on each 
host, there are around 15K threads, the system is stable but when there is 
cluster wide change (e.g. OSD down / out, recovery), we observed system load 
increasing, there is no cascading failure though.

Most recently we are evaluating Ceph against high density hardware with each 
OSD host having 33 disks (daemons), on each host, there are around 40K-50K 
threads, with some OSD host down/out, we started seeing high load increasing 
and a large volume of thread join/creation.

We don’t have a strong evidence that the messenger thread model is the problem 
and how event-driven approach can help, but I think as moving to high density 
hardware (for cost saving purpose), the issue could be amplified.

If there is any plan, it is good to know and we are very interested to involve.

Thanks,
Guang

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding pg's of an erasure coded pool

2014-05-25 Thread Guang Yang
On May 21, 2014, at 1:33 AM, Gregory Farnum  wrote:

> This failure means the messenger subsystem is trying to create a
> thread and is getting an error code back — probably due to a process
> or system thread limit that you can turn up with ulimit.
> 
> This is happening because a replicated PG primary needs a connection
> to only its replicas (generally 1 or 2 connections), but with an
> erasure-coded PG the primary requires a connection to m+n-1 replicas
> (everybody who's in the erasure-coding set, including itself). Right
> now our messenger requires a thread for each connection, so kerblam.
> (And it actually requires a couple such connections because we have
> separate heartbeat, cluster data, and client data systems.)
Hi Greg,
Is there any plan to refactor the messenger component to reduce the num of 
threads? For example, use event-driven mode.

Thanks,
Guang
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
>  wrote:
>> Hi,
>> 
>> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
>> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
>> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
>> then try to expand to 4096 (not even quite enough) the cluster crashes
>> again. ( Do we need less of pg's with erasure coding?)
>> 
>> The crash starts with individual OSDs crashing, eventually bringing down the
>> mons (until there is no more quorum or too few osds)
>> 
>> Out of the logs:
>> 
>> 
>>   -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
>> pg_query(0.974 epoch 3315) v3
>>   -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
>> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
>> 026b40
>>   -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
>> pg_query(0.974 epoch 3315) v3
>>   -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
>> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
>> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>>   -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
>> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
>> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>>   -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2  47+0+0
>> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>>   -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2  47+0+0
>> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
>>-9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 
>> pg_query(7.3ffs6 epoch 3326) v3  144+0+0 (221596234 0 0) 0x10b994a0 con
>> 0x9383860
>>-8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>-1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
>> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>> 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
>> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
>> 10:31:55.630937
>> common/Thread.cc: 110: FAILED assert(ret == 0)
>> 
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>

Re: [ceph-users] Firefly 0.80 rados bench cleanup / object removal broken?

2014-05-19 Thread Guang Yang
Hi Matt,
The problem you came across was due to a change made in the rados bench along 
with the Firefly release, it aimed to solve the problem that if there were 
multiple rados instance (for writing), we want to do a rados read for each run 
as well.

Unfortunately, that change broke your user case, here is my suggestion to solve 
your problem:
1. Remove the pre-defined metadata file by
$ rados -p {pool_name} rm benchmark_last_metadata
2. Cleanup by prefix
$ sudo rados -p {pool_name} cleanup --prefix bench

Moving forward, you can use the new parameter ‘--run-name’ to name each turn of 
run and cleanup on that basis, if you still want to do a slow liner search to 
cleanup, be sure removing the benchmark_last_metadata object before you kick 
off running the cleanup.

Let me know if that helps.

Thanks,
Guang

On May 20, 2014, at 6:45 AM, matt.lat...@hgst.com wrote:

> 
> I was experimenting previously with 0.72 , and could easily cleanup pool
> objects from several previous rados bench (write) jobs with :
> 
> rados -p  cleanup bench  (would remove all objects starting
> with "bench")
> 
> I quickly realised when I moved to 0.80 that my script was broken and
> theoretically I now need:
> 
> rados -p  cleanup --prefix benchmark_data
> 
> But this only works sometimes, and sometimes partially. Issuing the command
> line twice seems to help a bit !  Also if I do "rados -p  ls"
> before hand, it seems to increase my chances of success, but often I am
> still left with benchmark objects undeleted. I also tried using the
> --run-name option to no avail.
> 
> The story gets more bizarre now I have set up a "hot SSD" cachepool in
> front of the backing OSD (SATA) pool. Objects won't delete from either pool
> with rados cleanup  I tried
> 
> "rados -p  cache-flush-evict-all"
> 
> which worked (rados df shows all objects now on the backing pool). Then
> bizarrely trying cleanup from the backing OSD pool just appears to copy
> them back into the cachepool, and they remain on the backing pool.
> 
> I can list individual object names with
> 
> rados -p  ls
> 
> but rados rm  will not remove individual objects stating "file
> or directory not found".
> 
> Are others seeing these things and any ways to work around or am I doing
> something wrong?  Are these commands now deprecated in which case what
> should I use?
> 
> Ubuntu 12.04, Kernel 3.14.0
> 
> Matt Latter
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS tunning on OSD

2014-03-05 Thread Guang Yang
Hello all,
Recently I am working on Ceph performance analysis on our cluster, our OSD 
hardware looks like:
  11 SATA disks, 4TB for each, 7200RPM
   48GB RAM

When break down the latency, we found that half of the latency (average latency 
is around 60 milliseconds via radosgw) comes from file lookup and open (there 
could be a couple of disk seeks there). When looking at the file system  cache 
(slabtop), we found that around 5M dentry / inodes are cached, however, the 
host has around 110 million files (and directories) in total.

I am wondering if there is any good experience within community tunning for the 
same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] 
?

[1] 
http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph GET latency

2014-02-18 Thread Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is 
critical for us, most recently I did some investigation over the GET path to 
understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we 
instrumented code to trace the GET request (read op at OSD side, to be more 
specific, each object with size [512K + 4M * x]  are splitted into [1 + x] 
chunks, each chunk needs one read op ), for each read op, it needs to go 
through the following steps:
    1. Dispatch and take by a op thread to process (process not started).
             0   – 20 ms,    94%
             20 – 50 ms,    2%
             50 – 100 ms,  2%
              100ms+   ,         2%
         For those having 20ms+ latency, half of them are due to waiting for pg 
lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another 
half are yet to be investigated.

    2. Get file xattr (‘-‘), which open the file and populate fd cache 
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
              0   – 20 ms,  80%
              20 – 50 ms,   8%
              50 – 100 ms, 7%
              100ms+   ,      5%
         The latency either comes from (from more to less): file path lookup 
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file 
open, or fd cache lookup /add.
         Currently objects are store in level 6 or level 7 folder (due to 
http://tracker.ceph.com/issues/7207, I stopped folder splitting).

    3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms).

    4. Read the data.
            0   – 20 ms,   84%
            20 – 50 ms, 10%
            50 – 100 ms, 4%
            100ms+        , 2%

I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS 
favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call 
(for cold data), as our workload mainly stores objects less than 500KB, so that 
it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any 
suggestion I can try to boost the GET performance. On the other hand, PUT could 
be sacrificed.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time

2014-02-10 Thread Guang Yang
Thanks all for the help.

We finally identified the root cause of the issue was due to a lock contention 
happening at folder splitting and here is a tracking ticket (thanks Inktank for 
the fix!): http://tracker.ceph.com/issues/7207

Thanks,
Guang


On Tuesday, December 31, 2013 8:22 AM, Guang Yang  wrote:
 
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.

Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Mark, my comments inline...

Date: Mon, 30 Dec 2013 07:36:56 -0600
From: Mark Nelson 
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 05:45 AM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?

You may want to use the dump_historic_ops command in the admin socket 
for the slow OSDs.  That will give you some clues regarding where the 
ops are hanging up in the OSD.  You can also crank the osd debugging way 
up on that node and search through the logs to see if there are any 
patterns or trends (consistent slowness, pauses, etc).  It may also be 
useful to look and see if that OSD is pegging CPU and if so attach 
strace or perf to it and see what it's doing.
[yguang] We have a job dump_historic_ops but unfortunate

Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

> Thanks,
> Guang
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailin

Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time

2013-12-30 Thread Guang Yang
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

> Thanks,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'ceph osd reweight' VS 'ceph osd crush reweight'

2013-12-11 Thread Guang Yang
Hello ceph-users,
I am a little bit confused by these two options, I understand crush reweight 
determine the weight of the OSD in the crush map so that it impacts I/O and 
utilization, however, I am a little bit confused by osd reweight option, is 
that something control the I/O distribution across different OSDs on a single 
host?

While looking at the code, I only found that if 'osd weight' is 1 (0x1), it 
means the osd is up and if it is 0, it means the osd is down.

Please advice...

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-24 Thread Guang Yang
Thanks Mark.

I cannot connect to my hosts, I will do the check and get back to you tomorrow.

Thanks,
Guang

在 2013-10-24,下午9:47,Mark Nelson  写道:

> On 10/24/2013 08:31 AM, Guang Yang wrote:
>> Hi Mark, Greg and Kyle,
>> Sorry to response this late, and thanks for providing the directions for 
>> me to look at.
>> 
>> We have exact the same setup for OSD, pool replica (and even I tried to 
>> create the same number of PGs within the small cluster), however, I can 
>> still reproduce this constantly.
>> 
>> This is the command I run:
>> $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write
>> 
>> With 24 OSDs:
>> Average Latency: 0.00494123
>> Max latency: 0.511864
>> Min latency:  0.002198
>> 
>> With 330 OSDs:
>> Average Latency:0.00913806
>> Max latency: 0.021967
>> Min latency:  0.005456
>> 
>> In terms of the crush rule, we are using the default one, for the small 
>> cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we 
>> have 30 OSD hosts (11 * 30).
>> 
>> I have a couple of questions:
>>  1. Is it possible that latency is due to that we have only three layer 
>> hierarchy? like root -> host -> OSD, and as we are using the Straw (by 
>> default) bucket type, which has O(N) speed, and if host number increase, 
>> so that the computation actually increase. I suspect not as the 
>> computation is in the order of microseconds per my understanding.
> 
> I suspect this is very unlikely as well.
> 
>> 
>>  2. Is it possible because we have more OSDs, the cluster will need to 
>> maintain far more connections between OSDs which potentially slow things 
>> down?
> 
> One thing here that might be very interesting is this:
> 
> After you run your tests, if you do something like:
> 
> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
> dump_historic_ops \; > foo
> 
> on each OSD server, you will get a dump of the 10 slowest operations
> over the last 10 minutes for each OSD on each server, and it will tell
> you were in each OSD operations were backing up.  You can sort of search
> through these files by greping for "duration" first, looking for the
> long ones, and then going back and searching through the file for those
> long durations and looking at the associated latencies.
> 
> Something I have been investigating recently is time spent waiting for
> osdmap propagation.  It's something I haven't had time to dig into
> meaningfully, but if we were to see that this was more significant on
> your larger cluster vs your smaller one, that would be very interesting
> news.
> 
>> 
>>  3. Anything else i might miss?
>> 
>> Thanks all for the constant help.
>> 
>> Guang
>> 
>> 
>> 在 2013-10-22,下午10:22,Guang Yang > <mailto:yguan...@yahoo.com>> 写道:
>> 
>>> Hi Kyle and Greg,
>>> I will get back to you with more details tomorrow, thanks for the 
>>> response.
>>> 
>>> Thanks,
>>> Guang
>>> 在 2013-10-22,上午9:37,Kyle Bader >> <mailto:kyle.ba...@gmail.com>> 写道:
>>> 
>>>> Besides what Mark and Greg said it could be due to additional hops 
>>>> through network devices. What network devices are you using, what is 
>>>> the network  topology and does your CRUSH map reflect the network 
>>>> topology?
>>>> 
>>>> On Oct 21, 2013 9:43 AM, "Gregory Farnum" >>> <mailto:g...@inktank.com>> wrote:
>>>> 
>>>>On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang >>><mailto:yguan...@yahoo.com>> wrote:
>>>>> Dear ceph-users,
>>>>> Recently I deployed a ceph cluster with RadosGW, from a small
>>>>one (24 OSDs) to a much bigger one (330 OSDs).
>>>>> 
>>>>> When using rados bench to test the small cluster (24 OSDs), it
>>>>showed the average latency was around 3ms (object size is 5K),
>>>>while for the larger one (330 OSDs), the average latency was
>>>>around 7ms (object size 5K), twice comparing the small cluster.
>>>>> 
>>>>> The OSD within the two cluster have the same configuration, SAS
>>>>disk,  and two partitions for one disk, one for journal and the
>>>>other for metadata.
>>>>> 
>>>>> For PG numbers, the small cluster tested with the pool having
>>>>100 PGs, and for the large cluster, the pool has 4 PGs (as I
&

Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-24 Thread Guang Yang
Hi Mark, Greg and Kyle,
Sorry to response this late, and thanks for providing the directions for me to 
look at.

We have exact the same setup for OSD, pool replica (and even I tried to create 
the same number of PGs within the small cluster), however, I can still 
reproduce this constantly.

This is the command I run:
$ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write

With 24 OSDs:
Average Latency: 0.00494123
Max latency: 0.511864
Min latency:  0.002198

With 330 OSDs:
Average Latency:0.00913806
Max latency: 0.021967
Min latency:  0.005456

In terms of the crush rule, we are using the default one, for the small 
cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we have 30 
OSD hosts (11 * 30).

I have a couple of questions:
 1. Is it possible that latency is due to that we have only three layer 
hierarchy? like root -> host -> OSD, and as we are using the Straw (by default) 
bucket type, which has O(N) speed, and if host number increase, so that the 
computation actually increase. I suspect not as the computation is in the order 
of microseconds per my understanding.

 2. Is it possible because we have more OSDs, the cluster will need to maintain 
far more connections between OSDs which potentially slow things down?

 3. Anything else i might miss?

Thanks all for the constant help.

Guang  


在 2013-10-22,下午10:22,Guang Yang  写道:

> Hi Kyle and Greg,
> I will get back to you with more details tomorrow, thanks for the response.
> 
> Thanks,
> Guang
> 在 2013-10-22,上午9:37,Kyle Bader  写道:
> 
>> Besides what Mark and Greg said it could be due to additional hops through 
>> network devices. What network devices are you using, what is the network  
>> topology and does your CRUSH map reflect the network topology?
>> 
>> On Oct 21, 2013 9:43 AM, "Gregory Farnum"  wrote:
>> On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang  wrote:
>> > Dear ceph-users,
>> > Recently I deployed a ceph cluster with RadosGW, from a small one (24 
>> > OSDs) to a much bigger one (330 OSDs).
>> >
>> > When using rados bench to test the small cluster (24 OSDs), it showed the 
>> > average latency was around 3ms (object size is 5K), while for the larger 
>> > one (330 OSDs), the average latency was around 7ms (object size 5K), twice 
>> > comparing the small cluster.
>> >
>> > The OSD within the two cluster have the same configuration, SAS disk,  and 
>> > two partitions for one disk, one for journal and the other for metadata.
>> >
>> > For PG numbers, the small cluster tested with the pool having 100 PGs, and 
>> > for the large cluster, the pool has 4 PGs (as I will to further scale 
>> > the cluster, so I choose a much large PG).
>> >
>> > Does my test result make sense? Like when the PG number and OSD increase, 
>> > the latency might drop?
>> 
>> Besides what Mark said, can you describe your test in a little more
>> detail? Writing/reading, length of time, number of objects, etc.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-22 Thread Guang Yang
Hi Kyle and Greg,
I will get back to you with more details tomorrow, thanks for the response.

Thanks,
Guang
在 2013-10-22,上午9:37,Kyle Bader  写道:

> Besides what Mark and Greg said it could be due to additional hops through 
> network devices. What network devices are you using, what is the network  
> topology and does your CRUSH map reflect the network topology?
> 
> On Oct 21, 2013 9:43 AM, "Gregory Farnum"  wrote:
> On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang  wrote:
> > Dear ceph-users,
> > Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) 
> > to a much bigger one (330 OSDs).
> >
> > When using rados bench to test the small cluster (24 OSDs), it showed the 
> > average latency was around 3ms (object size is 5K), while for the larger 
> > one (330 OSDs), the average latency was around 7ms (object size 5K), twice 
> > comparing the small cluster.
> >
> > The OSD within the two cluster have the same configuration, SAS disk,  and 
> > two partitions for one disk, one for journal and the other for metadata.
> >
> > For PG numbers, the small cluster tested with the pool having 100 PGs, and 
> > for the large cluster, the pool has 4 PGs (as I will to further scale 
> > the cluster, so I choose a much large PG).
> >
> > Does my test result make sense? Like when the PG number and OSD increase, 
> > the latency might drop?
> 
> Besides what Mark said, can you describe your test in a little more
> detail? Writing/reading, length of time, number of objects, etc.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-22 Thread Guang Yang
Thanks Mark for the response. My comments inline...

From: Mark Nelson 
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Rados bench result when increasing OSDs
Message-ID: <52653b49.8090...@inktank.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 10/21/2013 09:13 AM, Guang Yang wrote:
> Dear ceph-users,

Hi!

> Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) 
> to a much bigger one (330 OSDs).
> 
> When using rados bench to test the small cluster (24 OSDs), it showed the 
> average latency was around 3ms (object size is 5K), while for the larger one 
> (330 OSDs), the average latency was around 7ms (object size 5K), twice 
> comparing the small cluster.

Did you have the same number of concurrent requests going?
[yguang] Yes. I run the test with 3 or 5 concurrent request, that does not 
change the result.

> 
> The OSD within the two cluster have the same configuration, SAS disk,  and 
> two partitions for one disk, one for journal and the other for metadata.
> 
> For PG numbers, the small cluster tested with the pool having 100 PGs, and 
> for the large cluster, the pool has 4 PGs (as I will to further scale the 
> cluster, so I choose a much large PG).

Forgive me if this is a silly question, but were the pools using the 
same level of replication?
[yguang] Yes, both have 3 replicas.
> 
> Does my test result make sense? Like when the PG number and OSD increase, the 
> latency might drop?

You wouldn't necessarily expect a larger cluster to show higher latency 
if the nodes, pools, etc were all configured exactly the same, 
especially if you were using the same amount of concurrency.  It's 
possible that you have some slow drives on the larger cluster that could 
be causing the average latency to increase.  If there are more disks per 
node, that could do it too.
[yguang] Glad to know this :) I will need to gather more information in terms 
of if there is any slow disk, will get back on this.

Are there any other differences you can think of?
[yguang] Another difference is, for the large cluster, as we expect to scale it 
to more than a thousand OSDs, we have a large PG number (4) pre-created.

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rados bench result when increasing OSDs

2013-10-21 Thread Guang Yang
Dear ceph-users,
Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to 
a much bigger one (330 OSDs).

When using rados bench to test the small cluster (24 OSDs), it showed the 
average latency was around 3ms (object size is 5K), while for the larger one 
(330 OSDs), the average latency was around 7ms (object size 5K), twice 
comparing the small cluster.

The OSD within the two cluster have the same configuration, SAS disk,  and two 
partitions for one disk, one for journal and the other for metadata.

For PG numbers, the small cluster tested with the pool having 100 PGs, and for 
the large cluster, the pool has 4 PGs (as I will to further scale the 
cluster, so I choose a much large PG).

Does my test result make sense? Like when the PG number and OSD increase, the 
latency might drop?

Thanks,
Guang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy zap disk failure

2013-10-18 Thread Guang Yang
Thanks all for the recommendation. I worked around by modifying the ceph-deploy 
by giving and full path for sgdisk.

Thanks,
Guang
在 2013-10-16,下午10:47,Alfredo Deza  写道:

> On Tue, Oct 15, 2013 at 9:19 PM, Guang  wrote:
>> -bash-4.1$ which sgdisk
>> /usr/sbin/sgdisk
>> 
>> Which path does ceph-deploy use?
> 
> That is unexpected... these are the paths that ceph-deploy uses:
> 
> '/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin'
> 
> So `/usr/sbin/` is there. I believe  this is a case where $PATH gets
> altered because of sudo (resetting the env variable).
> 
> This should be fixed in the next release. In the meantime, you could
> set the $PATH for non-interactive sessions (which is what ceph-deploy
> does)
> for all users. I *think* that would be in `/etc/profile`
> 
> 
>> 
>> Thanks,
>> Guang
>> 
>> On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote:
>> 
>>> On Tue, Oct 15, 2013 at 10:52 AM, Guang  wrote:
 Hi ceph-users,
 I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a
 new issue:
 
 -bash-4.1$ ceph-deploy --version
 1.2.7
 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb
 [ceph_deploy.cli][INFO  ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap
 server:/dev/sdb
 [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from
 remote host
 [ceph_deploy.osd][INFO  ] Distro info: Red Hat Enterprise Linux Server 6.4
 Santiago
 [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device
 [osd2.ceph.mobstor.bf1.yahoo.com][INFO  ] Running command: sudo sgdisk
 --zap-all --clear --mbrtogpt -- /dev/sdb
 [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found
 
 While I run disk zap on the host directly, it can work without issues.
 Anyone meet the same issue?
>>> 
>>> Can you run `which sgdisk` on that host? I want to make sure this is
>>> not a $PATH problem.
>>> 
>>> ceph-deploy tries to use the proper path remotely but it could be that
>>> this one is not there.
>>> 
>>> 
 
 Thanks,
 Guang
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-20 Thread Guang Yang
Thanks Greg.

>>The typical case is going to depend quite a lot on your scale.
[Guang] I am thinking the scale as billions of objects with size from several 
KB to several MB, my concern is over the cache efficiency for such use case.

That said, I'm not sure why you'd want to use CephFS for a small-object store 
when you could just use raw RADOS, and avoid all the posix overheads. Perhaps 
I've misunderstood your use case?
[Guang] No, you don't. That is my use case :) I am also thinking of using RADOW 
directly without the above POSIX layer, but before that, I want to consider 
each option we have and compare the cons / pros.

Thanks,
Guang



 From: Gregory Farnum 
To: Guang Yang  
Cc: Gregory Farnum ; "ceph-us...@ceph.com" 
 
Sent: Tuesday, August 20, 2013 9:51 AM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 


On Monday, August 19, 2013, Guang Yang  wrote:

Thanks Greg.
>
>
>Some comments inline...
>
>
>On Sunday, August 18, 2013, Guang Yang  wrote:
>
>Hi ceph-users,
>>This is Guang and I am pretty new to ceph, glad to meet you guys in the 
>>community!
>>
>>
>>After walking through some documents of Ceph, I have a couple of questions:
>>  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
>>to handle different work-loads (from KB to GB), with corresponding 
>>performance report?
>
>
>Not really; any comparison would be highly biased depending on your Amazon 
>ping and your Ceph cluster. We've got some internal benchmarks where Ceph 
>looks good, but they're not anything we'd feel comfortable publishing.
> [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
>over the comparison.
>  2. Looking at some industry solutions for distributed storage, GFS / 
>Haystack / HDFS all use meta-server to store the logical-to-physical mapping 
>within memory and avoid disk I/O lookup for file reading, is the concern valid 
>for Ceph (in terms of latency to read file)?
>
>
>These are very different systems. Thanks to CRUSH, RADOS doesn't need to do 
>any IO to find object locations; CephFS only does IO if the inode you request 
>has fallen out of the MDS cache (not terribly likely in general). This 
>shouldn't be an issue...
>[Guang] " CephFS only does IO if the inode you request has fallen out of the 
>MDS cache", my understanding is, if we use CephFS, we will need to interact 
>with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
>etc.) and the second time to load data, and both times will need disk I/O in 
>terms of inode and data. Is my understanding correct? The way some other 
>storage system tried was to cache the file handle in memory, so that it can 
>avoid the I/O to read inode in.

In the worst case this can happen with CephFS, yes. However, the client is not 
accessing metadata directly; it's going through the MetaData Server, which 
caches (lots of) metadata on its own, and the client can get leases as well (so 
it doesn't need to go to the MDS for each access, and can cache information on 
its own). The typical case is going to depend quite a lot on your scale.
That said, I'm not sure why you'd want to use CephFS for a small-object store 
when you could just use raw RADOS, and avoid all the posix overheads. Perhaps 
I've misunderstood your use case?
-Greg

 
 
>  3. Some industry research shows that one issue of file system is the 
>metadata-to-data ratio, in terms of both access and storage, and some technic 
>uses the mechanism to combine small files to large physical files to reduce 
>the ratio (Haystack for example), if we want to use ceph to store photos, 
>should this be a concern as Ceph use one physical file per object?
>
>
>...although this might be. The issue basically comes down to how many disk 
>seeks are required to retrieve an item, and one way to reduce that number is 
>to hack the filesystem by keeping a small number of very large files an 
>calculating (or caching) where different objects are inside that file. Since 
>Ceph is designed for MB-sized objects it doesn't go to these lengths to 
>optimize that path like Haystack might (I'm not familiar with Haystack in 
>particular).
>That said, you need some pretty extreme latency requirements before this 
>becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
>you're in that ballpark. You should be fine. :)
>[Guang] Yep, that makes a lot sense.
>-Greg
>
>-- 
>Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-20 Thread Guang Yang
Then that makes total sense to me.

Thanks,
Guang



 From: Mark Kirkwood 
To: Guang Yang  
Cc: "ceph-users@lists.ceph.com"  
Sent: Tuesday, August 20, 2013 1:19 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 20/08/13 13:27, Guang Yang wrote:
> Thanks Mark.
>
> What is the design considerations to break large files into 4M chunk
> rather than storing the large file directly?
>
>

Quoting Wolfgang from previous reply:

=> which is a good thing in terms of replication and OSD usage
distribution


...which covers what I would have said quite well :-)

Cheers

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Greg.

Some comments inline...

On Sunday, August 18, 2013, Guang Yang  wrote:

Hi ceph-users,
>This is Guang and I am pretty new to ceph, glad to meet you guys in the 
>community!
>
>
>After walking through some documents of Ceph, I have a couple of questions:
>  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
>to handle different work-loads (from KB to GB), with corresponding performance 
>report?

Not really; any comparison would be highly biased depending on your Amazon ping 
and your Ceph cluster. We've got some internal benchmarks where Ceph looks 
good, but they're not anything we'd feel comfortable publishing.
 [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
over the comparison.
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?

These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any 
IO to find object locations; CephFS only does IO if the inode you request has 
fallen out of the MDS cache (not terribly likely in general). This shouldn't be 
an issue...
[Guang] " CephFS only does IO if the inode you request has fallen out of the 
MDS cache", my understanding is, if we use CephFS, we will need to interact 
with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
etc.) and the second time to load data, and both times will need disk I/O in 
terms of inode and data. Is my understanding correct? The way some other 
storage system tried was to cache the file handle in memory, so that it can 
avoid the I/O to read inode in.
 
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

...although this might be. The issue basically comes down to how many disk 
seeks are required to retrieve an item, and one way to reduce that number is to 
hack the filesystem by keeping a small number of very large files an 
calculating (or caching) where different objects are inside that file. Since 
Ceph is designed for MB-sized objects it doesn't go to these lengths to 
optimize that path like Haystack might (I'm not familiar with Haystack in 
particular).
That said, you need some pretty extreme latency requirements before this 
becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
you're in that ballpark. You should be fine. :)
[Guang] Yep, that makes a lot sense.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Mark.

What is the design considerations to break large files into 4M chunk rather 
than storing the large file directly?

Thanks,
Guang



 From: Mark Kirkwood 
To: Guang Yang  
Cc: "ceph-users@lists.ceph.com"  
Sent: Monday, August 19, 2013 5:18 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 19/08/13 18:17, Guang Yang wrote:

>    3. Some industry research shows that one issue of file system is the
> metadata-to-data ratio, in terms of both access and storage, and some
> technic uses the mechanism to combine small files to large physical
> files to reduce the ratio (Haystack for example), if we want to use ceph
> to store photos, should this be a concern as Ceph use one physical file
> per object?

If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).


Regards

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread Guang Yang
Hi ceph-users,
I would like to check if there is any manual / steps which can let me try to 
deploy ceph in RHEL?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage pattern and design of Ceph

2013-08-18 Thread Guang Yang
Hi ceph-users,

This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!

After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage pattern and design of Ceph

2013-08-18 Thread Guang Yang
Hi ceph-users,
This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!

After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com