Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread GuangYang
If we are talking about requests being blocked 60+ seconds, those tunings might 
not help (they help a lot for average latency during recovering/backfilling).

It would be interesting to see the logs for those blocked requests at OSD side 
(they have level 0), pattern to search might be "slow requests \d+ seconds old".

I had a problem that for a recovery candidate object, all updates to that 
object would stuck until it is recovered, that might take extremely long time 
if there are large number of PG and objects to recover. But I think that is 
resolved by Sam to allow write for degraded objects in Hammer.


> Date: Thu, 10 Sep 2015 14:56:12 -0600
> From: rob...@leblancnet.us
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version
> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
> monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
> 3 active+remapped+wait_backfill
> 1 active+remapped+backfilling
> recovery io 70401 kB/s, 16 objects/s
> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed
> after setting up our pools, so our PGs are really out of wack. Our
> most active pool has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we
> can split the PGs in that pool. I think these large PGs is part of our
> issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in latency, but has also reduced
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> OSD process gives the recovery threads a different disk priority or if
> changing the scheduler without restarting the OSD allows the OSD to
> use disk priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and
> peer before starting the backfill. This caused more problems than
> solved as we had blocked I/O (over 200 seconds) until we set the new
> OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O
> messages. We still have 5 more disks to add from this server and four
> more servers to add.
>
> In addition to trying to minimize these impacts, would it be better to
> split the PGs then add the rest of the servers, or add the servers
> then do the PG split. I'm thinking splitting first would be better,
> but I'd like to get other opinions.
>
> No spindle stays at high utilization for long and the await drops
> below 20 ms usually within 10 seconds so I/O should be serviced
> "pretty quick". My next guess is that the journals are getting full
> and blocking while waiting for flushes, but I'm not exactly sure how
> to identify that. We are using the defaults for the journal except for
> size (10G). We'd like to have journals large to handle bursts, but if
> they are getting filled with backfill traffic, it may be counter
> productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread Paweł Sadowski
On 09/10/2015 10:56 PM, Robert LeBlanc wrote:
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> OSD process gives the recovery threads a different disk priority or if
> changing the scheduler without restarting the OSD allows the OSD to
> use disk priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and
> peer before starting the backfill. This caused more problems than
> solved as we had blocked I/O (over 200 seconds) until we set the new
> OSDs to in.

You can also try to lower this settings (from the default):

  "osd_backfill_scan_min": "64",
  "osd_backfill_scan_max": "512",

In our case we've set them to 1 and 8. And it helps a lot but recovery
will take more time.

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-11 Thread Bill Sanders
Is there a thread on the mailing list (or LKML?) with some background about
tcp_low_latency and TCP_NODELAY?

Bill

On Fri, Sep 11, 2015 at 2:30 AM, Jan Schermer  wrote:

> Can you try
>
> echo 1 > /proc/sys/net/ipv4/tcp_low_latency
>
> And see if it improves things? I remember there being an option to disable
> nagle completely, but it's gone apparently.
>
> Jan
>
> > On 11 Sep 2015, at 10:43, Nick Fisk  wrote:
> >
> >
> >
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> >> Somnath Roy
> >> Sent: 11 September 2015 06:23
> >> To: Rafael Lopez 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
> >>
> >> That’s probably because the krbd version you are using doesn’t have the
> >> TCP_NODELAY patch. We have submitted it (and you can build it from
> latest
> >> rbd source) , but, I am not sure when it will be in linux mainline.
> >
> > From memory it landed in 3.19, but there are also several issues with
> max IO size, max nr_requests and readahead. I would suggest for testing,
> try one of these:-
> >
> >
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back/
> >
> >
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
> >> Sent: Thursday, September 10, 2015 10:12 PM
> >> To: Somnath Roy
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
> >>
> >> Ok I ran the two tests again with direct=1, smaller block size (4k) and
> smaller
> >> total io (100m), disabled cache at ceph.conf side on client by adding:
> >>
> >> [client]
> >> rbd cache = false
> >> rbd cache max dirty = 0
> >> rbd cache size = 0
> >> rbd cache target dirty = 0
> >>
> >>
> >> The result seems to have swapped around, now the librbd job is running
> >> ~50% faster than the krbd job!
> >>
> >> ### krbd job:
> >>
> >> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
> >> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
> >> fio-2.2.8
> >> Starting 1 process
> >> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops]
> [eta
> >> 00m:00s]
> >> job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
> >>  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
> >>clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
> >> lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
> >>clat percentiles (usec):
> >> |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[
> 5536],
> >> | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[
> 6624],
> >> | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[
> 7968],
> >> | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808],
> 99.95th=[17536],
> >> | 99.99th=[19328]
> >>bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22,
> stdev=104.77
> >>lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
> >>  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
> >>  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >>> =64=0.0%
> >> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>> =64=0.0%
> >> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>> =64=0.0%
> >> issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
> >> drop=r=0/w=0/d=0
> >> latency   : target=0, window=0, percentile=100.00%, depth=16
> >>
> >> Run status group 0 (all jobs):
> >>  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
> >> mint=162033msec, maxt=162033msec
> >>
> >> Disk stats (read/write):
> >>  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
> >> util=99.11%
> >> [root@rcprsdc1r72-01-ac rafaell]#
> >>
> >> ## librb job:
> >>
> >> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
> >> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
> >> fio-2.2.8
> >> Starting 1 process
> >> rbd engine: RBD version: 0.1.9
> >> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops]
> [eta
> >> 00m:00s]
> >> job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
> >>  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
> >>slat (usec): min=70, max=992, avg=115.05, stdev=30.07
> >>clat (msec): min=13, max=117, avg=67.91, stdev=24.93
> >> lat (msec): min=13, max=117, avg=68.03, stdev=24.93
> >>clat percentiles (msec):
> >> |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[
>  40],
> >> | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[
>  85],
> >> | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[
>  99],
> >> | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[
> 117],
> >> | 99.99th=[  118]
> >>bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74,
> stdev=407.67
> >>

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-11 Thread Somnath Roy
Check this..

http://www.spinics.net/lists/ceph-users/msg16294.html

http://tracker.ceph.com/issues/9344

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bill 
Sanders
Sent: Friday, September 11, 2015 11:17 AM
To: Jan Schermer
Cc: Rafael Lopez; ceph-users@lists.ceph.com; Nick Fisk
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Is there a thread on the mailing list (or LKML?) with some background about 
tcp_low_latency and TCP_NODELAY?
Bill

On Fri, Sep 11, 2015 at 2:30 AM, Jan Schermer 
> wrote:
Can you try

echo 1 > /proc/sys/net/ipv4/tcp_low_latency

And see if it improves things? I remember there being an option to disable 
nagle completely, but it's gone apparently.

Jan

> On 11 Sep 2015, at 10:43, Nick Fisk > 
> wrote:
>
>
>
>
>
>> -Original Message-
>> From: ceph-users 
>> [mailto:ceph-users-boun...@lists.ceph.com]
>>  On Behalf Of
>> Somnath Roy
>> Sent: 11 September 2015 06:23
>> To: Rafael Lopez >
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>>
>> That’s probably because the krbd version you are using doesn’t have the
>> TCP_NODELAY patch. We have submitted it (and you can build it from latest
>> rbd source) , but, I am not sure when it will be in linux mainline.
>
> From memory it landed in 3.19, but there are also several issues with max IO 
> size, max nr_requests and readahead. I would suggest for testing, try one of 
> these:-
>
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back/
>
>
>>
>> Thanks & Regards
>> Somnath
>>
>> From: Rafael Lopez 
>> [mailto:rafael.lo...@monash.edu]
>> Sent: Thursday, September 10, 2015 10:12 PM
>> To: Somnath Roy
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>>
>> Ok I ran the two tests again with direct=1, smaller block size (4k) and 
>> smaller
>> total io (100m), disabled cache at ceph.conf side on client by adding:
>>
>> [client]
>> rbd cache = false
>> rbd cache max dirty = 0
>> rbd cache size = 0
>> rbd cache target dirty = 0
>>
>>
>> The result seems to have swapped around, now the librbd job is running
>> ~50% faster than the krbd job!
>>
>> ### krbd job:
>>
>> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
>> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
>> fio-2.2.8
>> Starting 1 process
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta
>> 00m:00s]
>> job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
>>  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
>>clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>> lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>>clat percentiles (usec):
>> |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
>> | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
>> | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
>> | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
>> | 99.99th=[19328]
>>bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
>>lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
>>  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
>>  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
>> drop=r=0/w=0/d=0
>> latency   : target=0, window=0, percentile=100.00%, depth=16
>>
>> Run status group 0 (all jobs):
>>  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
>> mint=162033msec, maxt=162033msec
>>
>> Disk stats (read/write):
>>  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
>> util=99.11%
>> [root@rcprsdc1r72-01-ac rafaell]#
>>
>> ## librb job:
>>
>> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
>> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
>> fio-2.2.8
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta
>> 00m:00s]
>> job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
>>  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
>>slat (usec): min=70, max=992, avg=115.05, stdev=30.07
>>clat (msec): min=13, max=117, avg=67.91, stdev=24.93
>> lat (msec): min=13, max=117, avg=68.03, stdev=24.93
>>clat percentiles (msec):
>> |  1.00th=[   19],  5.00th=[ 

[ceph-users] 5Tb useful space based on Erasure Coded Pool

2015-09-11 Thread Mike
Hello Cephers!
I have interesting a task from our client.
The client have 3000+ video cams (monitoring streets, porchs,  entrance,
etc.), we need store data from these cams for 30 days.

Each cam generating 1.3Tb data for 30 days, total bandwidth is 14Gbit/s.
In total we need ( 1.3+3000 ) ~4Pb+ data on storage plus 20% for
recovery if one jbod fail.

Quantity of cams can increase in time.

Another thing to keep in mind is to make cheaper storage.

My points of view:
* Make pair with ceph server + fat jbod
* Make ~15 pairs
* On jbods make erasure coded pool with reasonable fault domain
* On ceph server make read only cache tiring, because erasure coded pool
can't be directly access from clients.

Hardware:
Ceph server
* 2 x e5-2690v3 Xeon (may be 2697)
* 256Gb RAM
* some Intel SSD DCS36xxx series
* 2 x Dualport 10Gbit/s NIC (may be 1 x dualport 10Gbit plus 1 x
Dualport 40Gbit/s for storage network)
* 2 x 4 SAS external port HBA SAS controllers

JBOD
* DATAon DNS-2670/DNS-2684 each can carry 70 or 84 drives or Supermicro
946ED-R2KJBOD that can carry 90 drives.

Ceph settings
* Use lrc plugin (?), with k=6, m=3, l=3, ruleset-failure-domain=host,
ruleset-locality=rack

I have not yet learned much about the difference erasure plugins,
performance, low level configuration.

Have you some advice about it? It's can work at all or not? Erasure and
this implementation it Ceph can solve the task?

For any advice thanks.

--
Mike, yes.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Brad Hubbard
- Original Message -
> From: "Wido den Hollander" 
> To: "ceph-users" 
> Sent: Friday, 11 September, 2015 6:46:11 AM
> Subject: [ceph-users] 9 PGs stay incomplete
> 
> Hi,
> 
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
> 
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>755 TB used, 14468 TB / 15224 TB avail
>   51831 active+clean
>   9 incomplete
> 
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
> 
> I found out that these PGs are the problem:
> 
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
> 
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
> 
> $ ceph pg 10.3762 query
> 
> Looking at the recovery state, this is shown:
> 
> {
> "first": 65286,
> "last": 67355,
> "maybe_went_rw": 0,
> "up": [
> 1420,
> 854,
> 1105

Anything interesting in the OSD logs for these OSDs?

> ],
> "acting": [
> 1420
> ],
> "primary": 1420,
> "up_primary": 1420
> },
> 
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
> 
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
> 
> I restarted both 854 and 1105, without result.
> 
> The output of PG query can be found here: http://pastebin.com/qQL699zC
> 
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
> 
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with iSCSI

2015-09-11 Thread Nick Fisk
It’s a long shot, but check if librados is installed.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Daleep 
Bais
Sent: 11 September 2015 10:18
To: Jake Young ; p...@daystrom.com
Cc: Ceph-User 
Subject: Re: [ceph-users] RBD with iSCSI

 

Hi Jake, Hello Paul,

 

I was able to mount the iscsi target to another initiator. However, after 
installing the tgt and tgt-rbd, my rbd was not working. Getting error message :

 

root@ceph-node1:~# rbd ls test1

rbd: symbol lookup error: rbd: undefined symbol: _ZTIN8librados9WatchCtx

 

I am using this node as target for iscsi initiator ( Ref : 
http://tracker.ceph.com/issues/12563) and using other node in cluster to create 
pools and images.

 

root@ceph-node1:~# tgtadm --version

1.0.51

 

Paul, I will also check the option you have suggested.  Appreciate the 
suggestion!

 

Thanks.

 

Daleep Singh Bais

 

On Thu, Sep 10, 2015 at 7:57 PM, Jake Young  > wrote:

 

 

On Wed, Sep 9, 2015 at 8:13 AM, Daleep Bais  > wrote:

Hi,

 

I am following steps from URL 
http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/ 
  to create a RBD pool  and share to another initiator.

 

I am not able to get rbd in the backstore list. Please suggest.

 

below is the output of tgtadm command:

 

tgtadm --lld iscsi --op show --mode system   

System:

State: ready

debug: off

LLDs:

iscsi: ready

iser: error

Backing stores:

sheepdog

bsg

sg

null

ssc

smc (bsoflags sync:direct)

mmc (bsoflags sync:direct)

rdwr (bsoflags sync:direct)

Device types:

disk

cd/dvd

osd

controller

changer

tape

passthrough

iSNS:

iSNS=Off

iSNSServerIP=

iSNSServerPort=3205

iSNSAccessControl=Off

 

 

I have installed tgt and tgt-rbd packages till now. Working on Debian GNU/Linux 
8.1 (jessie)

 

Thanks.

 

Daleep Singh Bais

 

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

Hey Daleep,

 

The tgt you have installed does not support Ceph rbd.  See the output from my 
system using a more recent tgt that supports rbd.

 

tgtadm --lld iscsi --mode system --op show

System:

State: ready

debug: off

LLDs:

iscsi: ready

iser: error

Backing stores:

rbd (bsoflags sync:direct)

sheepdog

bsg

sg

null

ssc

rdwr (bsoflags sync:direct)

Device types:

disk

cd/dvd

osd

controller

changer

tape

passthrough

iSNS:

iSNS=Off

iSNSServerIP=

iSNSServerPort=3205

iSNSAccessControl=Off

 

 

You will need a new version of tgt.  I think the earliest version that supports 
rbd is 1.0.42

 

https://github.com/fujita/tgt

 

 

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-11 Thread Mariusz Gronczewski
Well, if you plan for OSD to have 2GB per daemon and suddenly it eats
4x as much RAM you might get cluster to a unrecoverable state if you
can't just increase amount of RAM at will. I managed to recover it
because I had only 4 OSDs per machine but I cant imagine what would
happen on 36 OSD machine...

Of course, failure on my cluster was extreme case (flapping caused by
NIC driver basically made most of PGs be on "wrong" OSDs causing
excessige memory usage) but if one bad driver can get cluster to such
hard to recover state it is pretty bad

And swap wont help in that case as it only will make daemons to
time-out.

It would be preferable to have slower recovery and not have kernel
OOM-killing processes every few minutes.



On Wed, 9 Sep 2015 16:36:44 +0200, Jan Schermer
 wrote:

> You can sort of simulate it:
> 
>  * E.g. if you do something silly like "ceph osd crush reweight osd.1 
>  1" you will see the RSS of osd.28 skyrocket. Reweighting it back 
>  down will not release the memory until you do "heap release".
> 
> But this is expected, methinks.
> 
> Jan
> 
> 
> > On 09 Sep 2015, at 15:51, Mark Nelson  wrote:
> > 
> > Yes, under no circumstances is it really ok for an OSD to consume 8GB of 
> > RSS! :)  It'd be really swell if we could replicate that kind of memory 
> > growth in-house on demand.
> > 
> > Mark
> > 
> > On 09/09/2015 05:56 AM, Jan Schermer wrote:
> >> Sorry if I wasn't clear.
> >> Going from 2GB to 8GB is not normal, although some slight bloating is 
> >> expected. In your case it just got much worse than usual for reasons yet 
> >> unknown.
> >> 
> >> Jan
> >> 
> >> 
> >>> On 09 Sep 2015, at 12:40, Mariusz Gronczewski 
> >>>  wrote:
> >>> 
> >>> 
> >>> well I was going by
> >>> http://ceph.com/docs/master/start/hardware-recommendations/ and planning 
> >>> for 2GB per OSD so that was a suprise maybe there should be warning 
> >>> somewhere ?
> >>> 
> >>> 
> >>> On Wed, 9 Sep 2015 12:21:15 +0200, Jan Schermer  wrote:
> >>> 
>  The memory gets used for additional PGs on the OSD.
>  If you were to "swap" PGs between two OSDs, you'll get memory wasted on 
>  both of them because tcmalloc doesn't release it.*
>  It usually gets stable after few days even during backfills, so it does 
>  get reused if needed.
>  If for some reason your OSDs get to 8GB RSS then I recommend you just 
>  get more memory, or try disabling tcmalloc which can either help or make 
>  it even worse :-)
>  
>  * E.g. if you do something silly like "ceph osd crush reweight osd.1 
>  1" you will see the RSS of osd.28 skyrocket. Reweighting it back 
>  down will not release the memory until you do "heap release".
>  
>  Jan
>  
>  
> > On 09 Sep 2015, at 12:05, Mariusz Gronczewski 
> >  wrote:
> > 
> > On Tue, 08 Sep 2015 16:14:15 -0500, Chad William Seys
> >  wrote:
> > 
> >> Does 'ceph tell osd.* heap release' help with OSD RAM usage?
> >> 
> >> From
> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003932.html
> >> 
> >> Chad.
> > 
> > it did help now, but cluster is in clean state at the moment. But I
> > didnt know that one, thanks.
> > 
> > High memory usage stopped once cluster rebuilt, but I've planned
> > cluster to have 2GB per OSD so I needed to add ram to even get to the
> > point of ceph starting to rebuild, as some OSD ate up to 8 GBs during
> > recover
> > 
> > --
> > Mariusz Gronczewski, Administrator
> > 
> > Efigence S. A.
> > ul. Wołoska 9a, 02-583 Warszawa
> > T: [+48] 22 380 13 13
> > F: [+48] 22 380 13 14
> > E: mariusz.gronczew...@efigence.com
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Mariusz Gronczewski, Administrator
> >>> 
> >>> Efigence S. A.
> >>> ul. Wołoska 9a, 02-583 Warszawa
> >>> T: [+48] 22 380 13 13
> >>> F: [+48] 22 380 13 14
> >>> E: mariusz.gronczew...@efigence.com
> >>> 
> >> 
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. 

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-11 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Somnath Roy
> Sent: 11 September 2015 06:23
> To: Rafael Lopez 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
> 
> That’s probably because the krbd version you are using doesn’t have the
> TCP_NODELAY patch. We have submitted it (and you can build it from latest
> rbd source) , but, I am not sure when it will be in linux mainline.

From memory it landed in 3.19, but there are also several issues with max IO 
size, max nr_requests and readahead. I would suggest for testing, try one of 
these:-

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back/


> 
> Thanks & Regards
> Somnath
> 
> From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
> Sent: Thursday, September 10, 2015 10:12 PM
> To: Somnath Roy
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
> 
> Ok I ran the two tests again with direct=1, smaller block size (4k) and 
> smaller
> total io (100m), disabled cache at ceph.conf side on client by adding:
> 
> [client]
> rbd cache = false
> rbd cache max dirty = 0
> rbd cache size = 0
> rbd cache target dirty = 0
> 
> 
> The result seems to have swapped around, now the librbd job is running
> ~50% faster than the krbd job!
> 
> ### krbd job:
> 
> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
> fio-2.2.8
> Starting 1 process
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta
> 00m:00s]
> job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
>   write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
> clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>  lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
> clat percentiles (usec):
>  |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
>  | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
>  | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
>  | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
>  | 99.99th=[19328]
> bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
> lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
>   cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=16
> 
> Run status group 0 (all jobs):
>   WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
> mint=162033msec, maxt=162033msec
> 
> Disk stats (read/write):
>   rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
> util=99.11%
> [root@rcprsdc1r72-01-ac rafaell]#
> 
> ## librb job:
> 
> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
> fio-2.2.8
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta
> 00m:00s]
> job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
>   write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
> slat (usec): min=70, max=992, avg=115.05, stdev=30.07
> clat (msec): min=13, max=117, avg=67.91, stdev=24.93
>  lat (msec): min=13, max=117, avg=68.03, stdev=24.93
> clat percentiles (msec):
>  |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[   40],
>  | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[   85],
>  | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[   99],
>  | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[  117],
>  | 99.99th=[  118]
> bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74, stdev=407.67
> lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29%
>   cpu  : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750
>   IO depths: 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%,
> >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  complete  : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=16
> 
> Run status group 0 (all jobs):
>   WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s,
> mint=110360msec, maxt=110360msec
> 
> Disk stats (read/write):
> dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, 

Re: [ceph-users] higher read iop/s for single thread

2015-09-11 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 10 September 2015 16:20
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] higher read iop/s for single thread
> 
> I'm not sure you will be able to get there with firefly.  I've gotten
close to 1ms
> after lots of tuning on hammer, but 0.5ms is probably not likely to happen
> without all of the new work that Sandisk/Fujitsu/Intel/Others have been
> doing to improve the data path.

Hi Mark, is that for 1 or 2+ copies? Fast SSD's I assume? 

What's the best you can get with HDD's + SSD Journals?

Just out of interest I tried switching a small test cluster to use jemalloc
last night, its only 4 HDD OSD's with SSD journals. But I didn't see any
improvement over tcmalloc at 4kb IO, but I guess this is expected at this
end of the performance spectrum. However what I did notice is that at 64kb
IO size jemalloc was around 10% slower than tcmalloc. I can do a full sweep
of IO sizes to double check this if it would be handy? Might need to be
considered if jemalloc will be default going forwards.


> 
> Your best bet is probably going to be a combination of:
> 
> 1) switch to jemalloc (and make sure you have enough RAM to deal with it)
> 2) disabled ceph auth
> 3) disable all logging
> 4) throw a high clock speed CPU at the OSDs and keep the number of OSDs
> per server lowish (will need to be tested to see where the sweet spot is).
> 5) potentially implement some kind of scheme to make sure OSD threads
> stay pinned to specific cores.
> 6) lots of investigation to make sure the kernel/tcp stack/vm/etc isn't
getting
> in the way.
> 
> Mark
> 
> On 09/10/2015 08:34 AM, Stefan Priebe - Profihost AG wrote:
> > Hi,
> >
> > while we're happy running ceph firefly in production and also reach
> > enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
> 2.2.1.
> >
> > We've now a customer having a single threaded application needing
> > around
> > 2000 iop/s but we don't go above 600 iop/s in this case.
> >
> > Any tuning hints for this case?
> >
> > Stefan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with iSCSI

2015-09-11 Thread Daleep Bais
Hi Jake, Hello Paul,

I was able to mount the iscsi target to another initiator. However, after
installing the tgt and tgt-rbd, my rbd was not working. Getting error
message :

*root@ceph-node1:~# rbd ls test1*
*rbd: symbol lookup error: rbd: undefined symbol: _ZTIN8librados9WatchCtx*

I am using this node as target for iscsi initiator ( Ref :
http://tracker.ceph.com/issues/12563) and using other node in cluster to
create pools and images.

root@ceph-node1:~# tgtadm --version
1.0.51

Paul, I will also check the option you have suggested.  Appreciate the
suggestion!

Thanks.

Daleep Singh Bais

On Thu, Sep 10, 2015 at 7:57 PM, Jake Young  wrote:

>
>
> On Wed, Sep 9, 2015 at 8:13 AM, Daleep Bais  wrote:
>
>> Hi,
>>
>> I am following steps from URL 
>> *http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/
>> *
>>   to create a RBD pool  and share to another initiator.
>>
>> I am not able to get rbd in the backstore list. Please suggest.
>>
>> below is the output of tgtadm command:
>>
>> tgtadm --lld iscsi --op show --mode system
>> System:
>> State: ready
>> debug: off
>> LLDs:
>> iscsi: ready
>> iser: error
>> Backing stores:
>> sheepdog
>> bsg
>> sg
>> null
>> ssc
>> smc (bsoflags sync:direct)
>> mmc (bsoflags sync:direct)
>> rdwr (bsoflags sync:direct)
>> Device types:
>> disk
>> cd/dvd
>> osd
>> controller
>> changer
>> tape
>> passthrough
>> iSNS:
>> iSNS=Off
>> iSNSServerIP=
>> iSNSServerPort=3205
>> iSNSAccessControl=Off
>>
>>
>> I have installed tgt and tgt-rbd packages till now. Working on Debian
>> GNU/Linux 8.1 (jessie)
>>
>> Thanks.
>>
>> Daleep Singh Bais
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> Hey Daleep,
>
> The tgt you have installed does not support Ceph rbd.  See the output from
> my system using a more recent tgt that supports rbd.
>
> tgtadm --lld iscsi --mode system --op show
> System:
> State: ready
> debug: off
> LLDs:
> iscsi: ready
> iser: error
> Backing stores:
> *rbd (bsoflags sync:direct)*
> sheepdog
> bsg
> sg
> null
> ssc
> rdwr (bsoflags sync:direct)
> Device types:
> disk
> cd/dvd
> osd
> controller
> changer
> tape
> passthrough
> iSNS:
> iSNS=Off
> iSNSServerIP=
> iSNSServerPort=3205
> iSNSAccessControl=Off
>
>
> You will need a new version of tgt.  I think the earliest version that
> supports rbd is 1.0.42
>
> https://github.com/fujita/tgt
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-11 Thread Jan Schermer
Can you try

echo 1 > /proc/sys/net/ipv4/tcp_low_latency

And see if it improves things? I remember there being an option to disable 
nagle completely, but it's gone apparently.

Jan

> On 11 Sep 2015, at 10:43, Nick Fisk  wrote:
> 
> 
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Somnath Roy
>> Sent: 11 September 2015 06:23
>> To: Rafael Lopez 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>> 
>> That’s probably because the krbd version you are using doesn’t have the
>> TCP_NODELAY patch. We have submitted it (and you can build it from latest
>> rbd source) , but, I am not sure when it will be in linux mainline.
> 
> From memory it landed in 3.19, but there are also several issues with max IO 
> size, max nr_requests and readahead. I would suggest for testing, try one of 
> these:-
> 
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back/
> 
> 
>> 
>> Thanks & Regards
>> Somnath
>> 
>> From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
>> Sent: Thursday, September 10, 2015 10:12 PM
>> To: Somnath Roy
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>> 
>> Ok I ran the two tests again with direct=1, smaller block size (4k) and 
>> smaller
>> total io (100m), disabled cache at ceph.conf side on client by adding:
>> 
>> [client]
>> rbd cache = false
>> rbd cache max dirty = 0
>> rbd cache size = 0
>> rbd cache target dirty = 0
>> 
>> 
>> The result seems to have swapped around, now the librbd job is running
>> ~50% faster than the krbd job!
>> 
>> ### krbd job:
>> 
>> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
>> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
>> fio-2.2.8
>> Starting 1 process
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta
>> 00m:00s]
>> job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
>>  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
>>clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>> lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>>clat percentiles (usec):
>> |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
>> | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
>> | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
>> | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
>> | 99.99th=[19328]
>>bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
>>lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
>>  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
>>  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
>> drop=r=0/w=0/d=0
>> latency   : target=0, window=0, percentile=100.00%, depth=16
>> 
>> Run status group 0 (all jobs):
>>  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
>> mint=162033msec, maxt=162033msec
>> 
>> Disk stats (read/write):
>>  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
>> util=99.11%
>> [root@rcprsdc1r72-01-ac rafaell]#
>> 
>> ## librb job:
>> 
>> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
>> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
>> fio-2.2.8
>> Starting 1 process
>> rbd engine: RBD version: 0.1.9
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta
>> 00m:00s]
>> job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
>>  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
>>slat (usec): min=70, max=992, avg=115.05, stdev=30.07
>>clat (msec): min=13, max=117, avg=67.91, stdev=24.93
>> lat (msec): min=13, max=117, avg=68.03, stdev=24.93
>>clat percentiles (msec):
>> |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[   40],
>> | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[   85],
>> | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[   99],
>> | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[  117],
>> | 99.99th=[  118]
>>bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74, stdev=407.67
>>lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29%
>>  cpu  : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750
>>  IO depths: 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%,
>>> =64=0.0%
>> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete  : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued: total=r=0/w=25600/d=0, 

Re: [ceph-users] RadosGW not working after upgrade to Hammer

2015-09-11 Thread James Page
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Arnoud

On 26/05/15 16:53, Arnoud de Jonge wrote:
> Hi,
[...]
> 
> 2015-05-26 17:43:37.352569 7f0fce0ff840  0 ceph version 0.94.1
> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process radosgw, pid
> 4259 2015-05-26 17:43:37.435921 7f0f8a4f2700  0 ERROR: can't get
> key: ret=-2 2015-05-26 17:43:37.435932 7f0f8a4f2700  0 ERROR:
> sync_all_users() returned ret=-2 2015-05-26 17:43:37.436179
> 7f0fce0ff840  0 framework: fastcgi 2015-05-26 17:43:37.436191
> 7f0fce0ff840  0 framework: civetweb 2015-05-26 17:43:37.436198
> 7f0fce0ff840  0 framework conf key: port, val: 7480 2015-05-26
> 17:43:37.436208 7f0fce0ff840  0 starting handler: civetweb 
> 2015-05-26 17:43:37.453013 7f0fce0ff840  0 starting handler:
> fastcgi 2015-05-26 17:43:41.347403 7f0d487a0700  1 == starting
> new request req=0x7f0d7c011590 = 2015-05-26 17:43:41.498851
> 7f0d487a0700  0 validated token: openstack:admin expires:
> 1432658621 2015-05-26 17:43:41.509877 7f0d487a0700  0 ERROR: could
> not get stats for buckets 2015-05-26 17:43:41.509943 7f0d487a0700
> 1 == req done req=0x7f0d7c011590 http_status=400 == 
> 2015-05-26 17:43:44.140730 7f0d897fa700  1 == starting new
> request req=0x7f0d7c0162e0 = 2015-05-26 17:43:44.285039
> 7f0d897fa700  0 validated token: openstack:admin expires:
> 1432658624 2015-05-26 17:43:44.290175 7f0d897fa700  0 ERROR: could
> not get stats for buckets 2015-05-26 17:43:44.290222 7f0d897fa700
> 1 == req done req=0x7f0d7c0162e0 http_status=400 == 
> 2015-05-26 17:43:46.972310 7f0d77fff700  1 == starting new
> request req=0x7f0d7c016270 = 2015-05-26 17:43:47.121784
> 7f0d77fff700  0 validated token: openstack:admin expires:
> 1432658626 2015-05-26 17:43:47.125191 7f0d77fff700  0 ERROR: could
> not get stats for buckets 2015-05-26 17:43:47.125241 7f0d77fff700
> 1 == req done req=0x7f0d7c016270 http_status=400 ==

I just debugged what I think is the same problem in one of our
clusters; the Gateway was working fine for a bit, but then started
throwing this error.

Turned out the back-end mon and osd daemons had not been restarted so
where still running firefly code - if you're running on Ubuntu, the
packaging won't restart daemons automatically - it has to be done
manually post upgrade (so you can co-ordinate it across your clusters).

Cheers

James

- -- 
James Page
Ubuntu and Debian Developer
james.p...@ubuntu.com
jamesp...@debian.org

-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJV8qWSAAoJEL/srsug59jDHasP+weu5D7P5DR4zbRxdvGXw8DR
RWkI2oo4FSB5QYeVBR5NbCOBKQivUoDKs8wcldD7rRLBwNuxxkacrpoBviKHJkZF
XYya8ZGufIX4RTseF9F3qGoJnA5rxCrTPojcB8KGRlXMFutswnv5sgS11J3OlSZn
QG4bB/oW9Cmdw3slxc/Poe3UdxQxgTfyXIoj9eiUWNcCZtbElcggF+EFiN47EneG
mA8kRIZY7ofhl3Lr1AGvXffK/4tjgpeWmPnCPIZlzuigmxbTWvwBjDk2H3gt0C2f
l3vUi1fT2NRslR4v+4MJuI9JtxhKNCaX3QjU5vNVJkRV11nd2RYw+a+YQNBPqEbQ
TstCLDBg2rAnPHpPduqsb+tujx8+p7SHNcgMHPJpxkeZ96XPts0Qxhr8M3i++MlM
blZvRYyRf3KDsWRG7kq65msFGPW9H4eWP4gmCP02Uy3VJ1LtawpypAX0Elq+SC8a
wWRV4GyVEv1tpNyP7gSqBIGhHRthODXQFii6TYJNvN2YpRWiO1XTMUvooNaH/kg5
/P42QUifQ7K2XjVaB9yFQLH4TkQkYjiHY3ro/DxARpVtb41M3yJinCSJaHlgWArc
pDlJm4mz6lRVIo6evTJsqOLoe0cew/+m+atlfuucTM3+SVONSCay9H56wB/Z3/VG
ZTlm19gjHjp6Rpwi401i
=Nfh4
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-11 Thread Mariusz Gronczewski
On Wed, 09 Sep 2015 08:59:53 -0500, Chad William Seys
 wrote:

> 
> > Going from 2GB to 8GB is not normal, although some slight bloating is
> > expected. 
> 
> If I recall correctly, Mariusz's cluster had a period of flapping OSDs?

NIC got packet loss under traffic which caused heartbeats to
periodically fail, which caused more traffic, which caused more
failures. So basically worst possible scenario where most of PGs needed
to be recovered. Lets just say that after getting all nodes up there
was no single pg in active+clean state.


> I experienced a  a similar situation using hammer. My OSDs went from 10GB in 
> RAM in a Healthy state to 24GB RAM + 10GB swap in a recovering state.  I also 
> could not re-add a node b/c every time I tried OOM killer would kill an OSD 
> daemon somewhere before the cluster could become healthy again.
> 
> Therefore I propose we begin expecting bloating under these circumstances.  
> :) 
> 
> > In your case it just got much worse than usual for reasons yet
> > unknown.
> 
> Not really unknown: B/c 'ceph tell osd.* heap release' freed RAM for Mariusz, 
> I think we know the reason for so much RAM use is b/c of tcmalloc not freeing 
> unused memory.   Right?

note that I've only did it after most of pg were recovered


> Here is a related "urgent" and "won't fix" bug to which applies 
> http://tracker.ceph.com/issues/12681 .  Sage suggests making the heap release 
> command a cron job .   :)
> 
> Have fun!
> Chad.



-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com



pgpoXc7HkqCyk.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] maximum object size

2015-09-11 Thread Ilya Dryomov
On Wed, Sep 9, 2015 at 11:22 AM, HEWLETT, Paul (Paul)
 wrote:
> By setting a parameter osd_max_write_size to 2047Š
> This normally defaults to 90
>
> Setting to 2048 exposes a bug in Ceph where signed overflow occurs...
>
> Part of the problem is my expectations. Ilya pointed out that one can use
> libradosstriper to stripe a large object over many OSD¹s. I expected this
> to happen automatically for any object > osd_max_write_size (=90MB) but it
> does not. Instead one has to set special attributes to trigger striping.
>
> Additionally interaction with erasure coding is unclear - apparently the
> error is reached when the total file size exceeds the limit - if EC is
> enabled then maybe a better solution would be to test the size of the
> chunk written to the OSD which will be only part of the total file size.
> Or do I have that wrong?

That limit is from before EC times and it's a request size limit, as in
"reject any write requests bigger than that with EMSGSIZE".  The
primary OSD sees the entire write, so the fact that the request may be
split into EC chunks later down the road is irrelevant.

>
> If EC is being used then would the individual chunks after splitting the
> file then be erasure coded ? I.e if we decide to split a large file into 5
> striped chunks does ceph then EC the individual chunks?

What libradosstriper is doing is slicing your huge object into smaller
objects.  Such an object is then treated as any other "normal" object
would be - if there is EC involved, it'll get get split into chunks,
additional erasure chunks will be computed, etc.  EC is handled in
librados, and libradosstriper is just some code that sits on top of
librados.  So I think the answer is yes.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-11 Thread Shinobu Kinjo
If you really want to improve performance of *distributed* filesystem
like Ceph, Lustre, GPFS,
you must consider from networking of the linux kernel.

 L5: Socket
 L4: TCP
 L3: IP
 L2: Queuing

In this discussion, problem could be in L2 which is queuing in descriptor.
We may have to take a closer look at qdisc, if qlen is good enough or not.

But this case:

> 399 16 32445 32429 325.054 84 0.0233839 0.193655
 to
> 400 16 32445 32429 324.241 0 - 0.193655

probably different story -;

> needless to say, very strange. 

Yes, it is quite strange like my English...

Shinobu

- Original Message -
From: "Vickey Singh" 
To: "Jan Schermer" 
Cc: ceph-users@lists.ceph.com
Sent: Thursday, September 10, 2015 2:22:22 AM
Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops   
are blocked

Hello Jan 

On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer < j...@schermer.cz > wrote: 


Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not 
flushing something to drives (iostat)? Not cleaning pagecache (kswapd and 
similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network 
link errors, no bad checksums (those are hard to spot, though)? 

Unless you find something I suggest you try disabling offloads on the NICs and 
see if the problem goes away. 

Could you please elaborate this point , how do you disable / offload on the NIC 
? what does it mean ? how to do it ? how its gonna help. 

Sorry i don't know about this. 

- Vickey - 




Jan 

> On 08 Sep 2015, at 18:26, Lincoln Bryant < linco...@uchicago.edu > wrote: 
> 
> For whatever it’s worth, my problem has returned and is very similar to 
> yours. Still trying to figure out what’s going on over here. 
> 
> Performance is nice for a few seconds, then goes to 0. This is a similar 
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) 
> 
> 384 16 29520 29504 307.287 1188 0.0492006 0.208259 
> 385 16 29813 29797 309.532 1172 0.0469708 0.206731 
> 386 16 30105 30089 311.756 1168 0.0375764 0.205189 
> 387 16 30401 30385 314.009 1184 0.036142 0.203791 
> 388 16 30695 30679 316.231 1176 0.0372316 0.202355 
> 389 16 30987 30971 318.42 1168 0.0660476 0.200962 
> 390 16 31282 31266 320.628 1180 0.0358611 0.199548 
> 391 16 31568 31552 322.734 1144 0.0405166 0.198132 
> 392 16 31857 31841 324.859 1156 0.0360826 0.196679 
> 393 16 32090 32074 326.404 932 0.0416869 0.19549 
> 394 16 32205 32189 326.743 460 0.0251877 0.194896 
> 395 16 32302 32286 326.897 388 0.0280574 0.194395 
> 396 16 32348 32332 326.537 184 0.0256821 0.194157 
> 397 16 32385 32369 326.087 148 0.0254342 0.193965 
> 398 16 32424 32408 325.659 156 0.0263006 0.193763 
> 399 16 32445 32429 325.054 84 0.0233839 0.193655 
> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 
> 0.193655 
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
> 400 16 32445 32429 324.241 0 - 0.193655 
> 401 16 32445 32429 323.433 0 - 0.193655 
> 402 16 32445 32429 322.628 0 - 0.193655 
> 403 16 32445 32429 321.828 0 - 0.193655 
> 404 16 32445 32429 321.031 0 - 0.193655 
> 405 16 32445 32429 320.238 0 - 0.193655 
> 406 16 32445 32429 319.45 0 - 0.193655 
> 407 16 32445 32429 318.665 0 - 0.193655 
> 
> needless to say, very strange. 
> 
> —Lincoln 
> 
> 
>> On Sep 7, 2015, at 3:35 PM, Vickey Singh < vickey.singh22...@gmail.com > 
>> wrote: 
>> 
>> Adding ceph-users. 
>> 
>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < vickey.singh22...@gmail.com 
>> > wrote: 
>> 
>> 
>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke < ulem...@polarzone.de > wrote: 
>> Hi Vickey, 
>> Thanks for your time in replying to my problem. 
>> 
>> I had the same rados bench output after changing the motherboard of the 
>> monitor node with the lowest IP... 
>> Due to the new mainboard, I assume the hw-clock was wrong during startup. 
>> Ceph health show no errors, but all VMs aren't able to do IO (very high load 
>> on the VMs - but no traffic). 
>> I stopped the mon, but this don't changed anything. I had to restart all 
>> other mons to get IO again. After that I started the first mon also (with 
>> the right time now) and all worked fine again... 
>> 
>> Thanks i will try to restart all OSD / MONS and report back , if it solves 
>> my problem 
>> 
>> Another posibility: 
>> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage 
>> collection? 
>> 
>> No i don't have journals on SSD , they are on the same OSD disk. 
>> 
>> 
>> 
>> Udo 
>> 
>> 
>> On 07.09.2015 16:36, Vickey Singh wrote: 
>>> Dear Experts 
>>> 
>>> Can someone please help me , why my cluster is not able write data. 
>>> 
>>> See the below output cur MB/S is 0 and Avg MB/s is decreasing. 
>>> 
>>> 
>>> Ceph Hammer 0.94.2 
>>> CentOS 6 (3.10.69-1) 
>>> 
>>> The Ceph status says OPS are blocked , i have tried checking , what all i 
>>> know 
>>> 
>>> - System resources ( CPU , net, disk , memory ) -- All normal 
>>> - 10G 

Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant

2015-09-11 Thread Chad William Seys
> note that I've only did it after most of pg were recovered

My guess / hope is that heap free would also help during the recovery process.  
Recovery causing failures does not seem like the best outcome.  :)

C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse auto down

2015-09-11 Thread Shinobu Kinjo
There should be some complains in /var/log/messages.
Can you attach?

Shinobu

- Original Message -
From: "谷枫" 
To: "ceph-users" 
Sent: Saturday, September 12, 2015 1:30:49 PM
Subject: [ceph-users] ceph-fuse auto down

Hi,all 
My cephfs cluster deploy on three nodes with Ceph Hammer 0.94.3 on Ubuntu 14.04 
the kernal version is 3.19.0. 

I mount the cephfs with ceph-fuse on 9 clients,but some of them (ceph-fuse 
process) auto down sometimes and i can't find the reason seems like there is no 
other logs can be found except this file /var/log/ceph/ceph-client.admin.log 
that without useful messages for me. 

When the ceph-fuse down . The mount driver is gone. 
How can i find the reason of this problem. Can some guys give me good ideas? 

Regards 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse auto down

2015-09-11 Thread Shinobu Kinjo
Ah, you are using ubuntu, sorry for that.
How about:

  /var/log/dmesg

I believe you can attach file not paste.
Pasting a bunch of logs would not be good for me -;

And when did you notice that cephfs was hung?

Shinobu

- Original Message -
From: "谷枫" 
To: "Shinobu Kinjo" 
Cc: "ceph-users" 
Sent: Saturday, September 12, 2015 1:50:05 PM
Subject: Re: [ceph-users] ceph-fuse auto down

hi,Shinobu
There is no /var/log/messages on my system but i saw the /var/log/syslog
and no useful messages be found.
I discover the /var/crash/_usr_bin_ceph-fuse.0.crash with grep the "fuse"
on the system.
Below is the message in it :
ProcStatus:
 Name:  ceph-fuse
 State: D (disk sleep)
 Tgid:  2903
 Ngid:  0
 Pid:   2903
 PPid:  1
 TracerPid: 0
 Uid:   0   0   0   0
 Gid:   0   0   0   0
 FDSize:64
 Groups:0
 VmPeak: 7428552 kB
 VmSize: 6838728 kB
 VmLck:0 kB
 VmPin:0 kB
 VmHWM:  1175864 kB
 VmRSS:   343116 kB
 VmData: 6786232 kB
 VmStk:  136 kB
 VmExe: 5628 kB
 VmLib: 7456 kB
 VmPTE: 3404 kB
 VmSwap:   0 kB
 Threads:   37
 SigQ:  1/64103
 SigPnd:
 ShdPnd:
 SigBlk:1000
 SigIgn:1000
 SigCgt:0001c18040eb
 CapInh:
 CapPrm:003f
 CapEff:003f
 CapBnd:003f
 Seccomp:   0
 Cpus_allowed:  
 Cpus_allowed_list: 0-15
 Mems_allowed:  ,0001
 Mems_allowed_list: 0
 voluntary_ctxt_switches:   25
 nonvoluntary_ctxt_switches:2
Signal: 11
Uname: Linux 3.19.0-28-generic x86_64
UserGroups:
CoreDump: base64

Is this useful infomations?


2015-09-12 12:33 GMT+08:00 Shinobu Kinjo :

> There should be some complains in /var/log/messages.
> Can you attach?
>
> Shinobu
>
> - Original Message -
> From: "谷枫" 
> To: "ceph-users" 
> Sent: Saturday, September 12, 2015 1:30:49 PM
> Subject: [ceph-users] ceph-fuse auto down
>
> Hi,all
> My cephfs cluster deploy on three nodes with Ceph Hammer 0.94.3 on Ubuntu
> 14.04 the kernal version is 3.19.0.
>
> I mount the cephfs with ceph-fuse on 9 clients,but some of them (ceph-fuse
> process) auto down sometimes and i can't find the reason seems like there
> is no other logs can be found except this file
> /var/log/ceph/ceph-client.admin.log that without useful messages for me.
>
> When the ceph-fuse down . The mount driver is gone.
> How can i find the reason of this problem. Can some guys give me good
> ideas?
>
> Regards
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-11 Thread Ilya Dryomov
On Wed, Sep 9, 2015 at 5:34 PM, Gregory Farnum  wrote:
> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson  wrote:
>> We are using Hammer - latest released version. How do I check if it's
>> getting promoted into the cache?
>
> Umm...that's a good question. You can run rados ls on the cache pool,
> but that's not exactly scalable; you can turn up logging and dig into
> them to see if redirects are happening, or watch the OSD operations
> happening via the admin socket. But I don't know if there's a good
> interface for users to just query the cache state of a single object.
> :/
>
>>
>> We're using the latest ceph kernel client. Where do I poke at readahead
>> settings there?
>
> Just the standard kernel readahead settings; I'm not actually familiar
> with how to configure those but I don't believe Ceph's are in any way
> special. What do you mean by "latest ceph kernel client"; are you
> running one of the developer testing kernels or something? I think
> Ilya might have mentioned some issues with readahead being
> artificially blocked, but that might have only been with RBD.

That's a system wide issue - it affects md folks and generally
everybody who can benefit from a large readahead window.  Notes from
the perf etherpad:

- kernel page cache readahead is capped at 2M since 3.15, that patch
was backported so older kernels (e.g. 3.14.21+) are affected as well
- Red Hat and Oracle seem to be shipping a patch that effectively
reverts this, but someone needs to check that
- a patch that gets rid of this limit is in the works, Linus seems to
be happy with it, expect it in 4.3 or 4.4?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse auto down

2015-09-11 Thread 谷枫
Hi,all
My cephfs cluster deploy on three nodes with Ceph Hammer 0.94.3 on Ubuntu
14.04 the kernal version is 3.19.0.

I mount the cephfs with ceph-fuse on 9 clients,but some of them (ceph-fuse
process) auto down sometimes and i can't find the reason seems like there
is no other logs can be found except this file
/var/log/ceph/ceph-client.admin.log that without useful messages for me.

When the ceph-fuse down . The mount driver is gone.
How can i find the reason of this problem. Can some guys  give me good
ideas?

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on cephfs recovery tools

2015-09-11 Thread Shinobu Kinjo
> In your procedure, the umount problems have nothing to do with
> corruption.  It's (sometimes) hanging because the MDS is offline.  If

How did you notice that the MDS was offline?
It's just because ceph client could not unmount filesystem, or anything?

I would like to see logs in mds and osd. Because there should be some complains
from daemons.

> the client has dirty metadata, it may not be able to flush it until
> the MDS is online -- there's no general way to "abort" this without
> breaking userspace semantics.  Similar case:
> http://tracker.ceph.com/issues/9477
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse auto down

2015-09-11 Thread 谷枫
hi,Shinobu
There is no /var/log/messages on my system but i saw the /var/log/syslog
and no useful messages be found.
I discover the /var/crash/_usr_bin_ceph-fuse.0.crash with grep the "fuse"
on the system.
Below is the message in it :
ProcStatus:
 Name:  ceph-fuse
 State: D (disk sleep)
 Tgid:  2903
 Ngid:  0
 Pid:   2903
 PPid:  1
 TracerPid: 0
 Uid:   0   0   0   0
 Gid:   0   0   0   0
 FDSize:64
 Groups:0
 VmPeak: 7428552 kB
 VmSize: 6838728 kB
 VmLck:0 kB
 VmPin:0 kB
 VmHWM:  1175864 kB
 VmRSS:   343116 kB
 VmData: 6786232 kB
 VmStk:  136 kB
 VmExe: 5628 kB
 VmLib: 7456 kB
 VmPTE: 3404 kB
 VmSwap:   0 kB
 Threads:   37
 SigQ:  1/64103
 SigPnd:
 ShdPnd:
 SigBlk:1000
 SigIgn:1000
 SigCgt:0001c18040eb
 CapInh:
 CapPrm:003f
 CapEff:003f
 CapBnd:003f
 Seccomp:   0
 Cpus_allowed:  
 Cpus_allowed_list: 0-15
 Mems_allowed:  ,0001
 Mems_allowed_list: 0
 voluntary_ctxt_switches:   25
 nonvoluntary_ctxt_switches:2
Signal: 11
Uname: Linux 3.19.0-28-generic x86_64
UserGroups:
CoreDump: base64

Is this useful infomations?


2015-09-12 12:33 GMT+08:00 Shinobu Kinjo :

> There should be some complains in /var/log/messages.
> Can you attach?
>
> Shinobu
>
> - Original Message -
> From: "谷枫" 
> To: "ceph-users" 
> Sent: Saturday, September 12, 2015 1:30:49 PM
> Subject: [ceph-users] ceph-fuse auto down
>
> Hi,all
> My cephfs cluster deploy on three nodes with Ceph Hammer 0.94.3 on Ubuntu
> 14.04 the kernal version is 3.19.0.
>
> I mount the cephfs with ceph-fuse on 9 clients,but some of them (ceph-fuse
> process) auto down sometimes and i can't find the reason seems like there
> is no other logs can be found except this file
> /var/log/ceph/ceph-client.admin.log that without useful messages for me.
>
> When the ceph-fuse down . The mount driver is gone.
> How can i find the reason of this problem. Can some guys give me good
> ideas?
>
> Regards
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] higher read iop/s for single thread

2015-09-11 Thread Gregory Farnum
On Fri, Sep 11, 2015 at 9:52 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 10 September 2015 16:20
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] higher read iop/s for single thread
>>
>> I'm not sure you will be able to get there with firefly.  I've gotten
> close to 1ms
>> after lots of tuning on hammer, but 0.5ms is probably not likely to happen
>> without all of the new work that Sandisk/Fujitsu/Intel/Others have been
>> doing to improve the data path.
>
> Hi Mark, is that for 1 or 2+ copies? Fast SSD's I assume?
>
> What's the best you can get with HDD's + SSD Journals?
>
> Just out of interest I tried switching a small test cluster to use jemalloc
> last night, its only 4 HDD OSD's with SSD journals. But I didn't see any
> improvement over tcmalloc at 4kb IO, but I guess this is expected at this
> end of the performance spectrum. However what I did notice is that at 64kb
> IO size jemalloc was around 10% slower than tcmalloc. I can do a full sweep
> of IO sizes to double check this if it would be handy? Might need to be
> considered if jemalloc will be default going forwards.

Mark, have you run any tests like this on more standard hardware? I
haven't heard anything like this but if jemalloc is also *slower* on
more standard systems then that'll definitely put the kibosh on
switching to it.
-Greg "fighting the good fight" ;)

>
>
>>
>> Your best bet is probably going to be a combination of:
>>
>> 1) switch to jemalloc (and make sure you have enough RAM to deal with it)
>> 2) disabled ceph auth
>> 3) disable all logging
>> 4) throw a high clock speed CPU at the OSDs and keep the number of OSDs
>> per server lowish (will need to be tested to see where the sweet spot is).
>> 5) potentially implement some kind of scheme to make sure OSD threads
>> stay pinned to specific cores.
>> 6) lots of investigation to make sure the kernel/tcp stack/vm/etc isn't
> getting
>> in the way.
>>
>> Mark
>>
>> On 09/10/2015 08:34 AM, Stefan Priebe - Profihost AG wrote:
>> > Hi,
>> >
>> > while we're happy running ceph firefly in production and also reach
>> > enough 4k read iop/s for multithreaded apps (around 23 000) with qemu
>> 2.2.1.
>> >
>> > We've now a customer having a single threaded application needing
>> > around
>> > 2000 iop/s but we don't go above 600 iop/s in this case.
>> >
>> > Any tuning hints for this case?
>> >
>> > Stefan
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Gregory Farnum
On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander  wrote:
> Hi,
>
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
>
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>755 TB used, 14468 TB / 15224 TB avail
>   51831 active+clean
>   9 incomplete
>
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
>
> I found out that these PGs are the problem:
>
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
>
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
>
> $ ceph pg 10.3762 query
>
> Looking at the recovery state, this is shown:
>
> {
> "first": 65286,
> "last": 67355,
> "maybe_went_rw": 0,
> "up": [
> 1420,
> 854,
> 1105
> ],
> "acting": [
> 1420
> ],
> "primary": 1420,
> "up_primary": 1420
> },
>
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
>
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
>
> I restarted both 854 and 1105, without result.
>
> The output of PG query can be found here: http://pastebin.com/qQL699zC

Hmm. The pg query results from each peer aren't quite the same but
look largely consistent to me. I think somebody from the RADOS team
will need to check it out. I do see that the log tail on the primary
hasn't advanced as far as the other peers have, but I'm not sure if
that's the OSD being responsible or evidence of the root cause...
-Greg

>
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
>
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hi all Very new to ceph

2015-09-11 Thread John Spray
On Fri, Sep 11, 2015 at 11:57 AM, M.Tarkeshwar Rao
 wrote:
> Hi all,
>
> We have a product which is written in c++ on Red hat.
>
> In production our customers using our product with Veritas cluster file
> system for HA and as sharded storage(EMC).
>
> Initially this product was run on only single node. In our last release we
> make it Scalable(more than one nodes).
>
> Due to excessive locking(CFS) we are not getting the performance. Can you
> please suggest CEPH will resolve our problem as it is distributed file
> system.

You're asking whether an unknown application will run faster on CephFS
compared to another (proprietary) filesystem?  It's impossible to say
- you will have to benchmark it yourself.

More generally: in the past I've often found "the filesystem is slow"
complaints often mean "we're trying to use the filesystem for
something other than storing data".  Look at the places your
application is thrashing the filesystem, and consider whether you
could be using something more appropriate for synchronisation, like a
message bus.

Cheers,
John




> Can we use it in Production? Pls suggest.
>
>
> If any other file system please suggest.
>
> Regards
> Tarkeshwar
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hi all Very new to ceph

2015-09-11 Thread Nick Fisk
Hi Tarkeshwar,

 

CephFS is not considered ready for production use currently mainly due to there 
being no fsck tool. There are people using it so YMMV.

 

However if this app is written in house, is there any chance you could change 
it to write objects directly into the RADOS layer? The RADOS layer is very 
stable and is ready for production use.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
M.Tarkeshwar Rao
Sent: 11 September 2015 11:58
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Hi all Very new to ceph

 

Hi all,

 

We have a product which is written in c++ on Red hat.

 

In production our customers using our product with Veritas cluster file system 
for HA and as sharded storage(EMC).

 

Initially this product was run on only single node. In our last release we make 
it Scalable(more than one nodes).

 

Due to excessive locking(CFS) we are not getting the performance. Can you 
please suggest CEPH will resolve our problem as it is distributed file system.

 

Can we use it in Production? Pls suggest.

 

If any other file system please suggest.

 

Regards

Tarkeshwar




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

2015-09-11 Thread Shinobu Kinjo
Dropwatch.stp would help us see who dropped packets, 
here packets were dropped at.

To do further investigation regarding to networking, 
I always check:

  /sys/class/net//statistics/*

tc command also is quite useful.

Have we already check if there is any bo or not using 
vmstat?

Using vmstat and tcpdump, tc would give you more concu-
rrency information to solve the problem.

Shinobu

- Original Message -
From: "Shinobu Kinjo" 
To: "Vickey Singh" 
Cc: ceph-users@lists.ceph.com
Sent: Friday, September 11, 2015 10:32:27 PM
Subject: Re: [ceph-users] Ceph cluster NO read / write performance ::   Ops 
are blocked

If you really want to improve performance of *distributed* filesystem
like Ceph, Lustre, GPFS,
you must consider from networking of the linux kernel.

 L5: Socket
 L4: TCP
 L3: IP
 L2: Queuing

In this discussion, problem could be in L2 which is queuing in descriptor.
We may have to take a closer look at qdisc, if qlen is good enough or not.

But this case:

> 399 16 32445 32429 325.054 84 0.0233839 0.193655
 to
> 400 16 32445 32429 324.241 0 - 0.193655

probably different story -;

> needless to say, very strange. 

Yes, it is quite strange like my English...

Shinobu

- Original Message -
From: "Vickey Singh" 
To: "Jan Schermer" 
Cc: ceph-users@lists.ceph.com
Sent: Thursday, September 10, 2015 2:22:22 AM
Subject: Re: [ceph-users] Ceph cluster NO read / write performance :: Ops   
are blocked

Hello Jan 

On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer < j...@schermer.cz > wrote: 


Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not 
flushing something to drives (iostat)? Not cleaning pagecache (kswapd and 
similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network 
link errors, no bad checksums (those are hard to spot, though)? 

Unless you find something I suggest you try disabling offloads on the NICs and 
see if the problem goes away. 

Could you please elaborate this point , how do you disable / offload on the NIC 
? what does it mean ? how to do it ? how its gonna help. 

Sorry i don't know about this. 

- Vickey - 




Jan 

> On 08 Sep 2015, at 18:26, Lincoln Bryant < linco...@uchicago.edu > wrote: 
> 
> For whatever it’s worth, my problem has returned and is very similar to 
> yours. Still trying to figure out what’s going on over here. 
> 
> Performance is nice for a few seconds, then goes to 0. This is a similar 
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) 
> 
> 384 16 29520 29504 307.287 1188 0.0492006 0.208259 
> 385 16 29813 29797 309.532 1172 0.0469708 0.206731 
> 386 16 30105 30089 311.756 1168 0.0375764 0.205189 
> 387 16 30401 30385 314.009 1184 0.036142 0.203791 
> 388 16 30695 30679 316.231 1176 0.0372316 0.202355 
> 389 16 30987 30971 318.42 1168 0.0660476 0.200962 
> 390 16 31282 31266 320.628 1180 0.0358611 0.199548 
> 391 16 31568 31552 322.734 1144 0.0405166 0.198132 
> 392 16 31857 31841 324.859 1156 0.0360826 0.196679 
> 393 16 32090 32074 326.404 932 0.0416869 0.19549 
> 394 16 32205 32189 326.743 460 0.0251877 0.194896 
> 395 16 32302 32286 326.897 388 0.0280574 0.194395 
> 396 16 32348 32332 326.537 184 0.0256821 0.194157 
> 397 16 32385 32369 326.087 148 0.0254342 0.193965 
> 398 16 32424 32408 325.659 156 0.0263006 0.193763 
> 399 16 32445 32429 325.054 84 0.0233839 0.193655 
> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 
> 0.193655 
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
> 400 16 32445 32429 324.241 0 - 0.193655 
> 401 16 32445 32429 323.433 0 - 0.193655 
> 402 16 32445 32429 322.628 0 - 0.193655 
> 403 16 32445 32429 321.828 0 - 0.193655 
> 404 16 32445 32429 321.031 0 - 0.193655 
> 405 16 32445 32429 320.238 0 - 0.193655 
> 406 16 32445 32429 319.45 0 - 0.193655 
> 407 16 32445 32429 318.665 0 - 0.193655 
> 
> needless to say, very strange. 
> 
> —Lincoln 
> 
> 
>> On Sep 7, 2015, at 3:35 PM, Vickey Singh < vickey.singh22...@gmail.com > 
>> wrote: 
>> 
>> Adding ceph-users. 
>> 
>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < vickey.singh22...@gmail.com 
>> > wrote: 
>> 
>> 
>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke < ulem...@polarzone.de > wrote: 
>> Hi Vickey, 
>> Thanks for your time in replying to my problem. 
>> 
>> I had the same rados bench output after changing the motherboard of the 
>> monitor node with the lowest IP... 
>> Due to the new mainboard, I assume the hw-clock was wrong during startup. 
>> Ceph health show no errors, but all VMs aren't able to do IO (very high load 
>> on the VMs - but no traffic). 
>> I stopped the mon, but this don't changed anything. I had to restart all 
>> other mons to get IO again. After that I started the first mon also (with 
>> the right time now) and all worked fine again... 
>> 
>> Thanks i will try to restart all OSD / MONS and report back , if it solves 

Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Wido den Hollander


On 11-09-15 12:22, Gregory Farnum wrote:
> On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander  wrote:
>> Hi,
>>
>> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
>> test 9 PGs stay incomplete:
>>
>> osdmap e78770: 2294 osds: 2294 up, 2294 in
>> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>>755 TB used, 14468 TB / 15224 TB avail
>>   51831 active+clean
>>   9 incomplete
>>
>> As you can see, all 2294 OSDs are online and about all PGs became
>> active+clean again, except for 9.
>>
>> I found out that these PGs are the problem:
>>
>> 10.3762
>> 7.309e
>> 7.29a2
>> 10.2289
>> 7.17dd
>> 10.165a
>> 7.1050
>> 7.c65
>> 10.abf
>>
>> Digging further, all the PGs map back to a OSD which is running on the
>> same host. 'ceph-stg-01' in this case.
>>
>> $ ceph pg 10.3762 query
>>
>> Looking at the recovery state, this is shown:
>>
>> {
>> "first": 65286,
>> "last": 67355,
>> "maybe_went_rw": 0,
>> "up": [
>> 1420,
>> 854,
>> 1105
>> ],
>> "acting": [
>> 1420
>> ],
>> "primary": 1420,
>> "up_primary": 1420
>> },
>>
>> osd.1420 is online. I tried restarting it, but nothing happens, these 9
>> PGs stay incomplete.
>>
>> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
>> the PG with identical numbers.
>>
>> I restarted both 854 and 1105, without result.
>>
>> The output of PG query can be found here: http://pastebin.com/qQL699zC
> 
> Hmm. The pg query results from each peer aren't quite the same but
> look largely consistent to me. I think somebody from the RADOS team
> will need to check it out. I do see that the log tail on the primary
> hasn't advanced as far as the other peers have, but I'm not sure if
> that's the OSD being responsible or evidence of the root cause...
> -Greg
> 

That's what I noticed as well. I ran osd.1420 with debug osd/filestore =
20 and the output is here:
http://ceph.o.auroraobjects.eu/tmp/txc1-osd.1420.log.gz

I can't tell what is going on, I don't see any 'errors', but that's
probably me not being able to diagnose the logs properly.

>>
>> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
>> 3.13 kernel. XFS is being used as the backing filesystem.
>>
>> Any suggestions to fix this issue? There is no valuable data in these
>> pools, so I can remove them, but I'd rather fix the root-cause.
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com