Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Christian Eichelmann
Hi Christian, Hi Robert,

thank you for your replies!
I was already expecting something like this. But I am seriously worried
about that!

Just assume that this is happening at night. Our shift has not
necessarily enough knowledge to perform all the steps in Sebasien's
article. And if we always have to do that when a scrub error appears, we
are putting several hours per week into fixing such problems.

It is also very misleading that a command called "ceph pg repair" might
do quite the opposit and overwrite the "good" data in your cluster with
corrupt one. I don't know much about the interna of ceph, but if the
cluster can already recognize that checksums are not the same, why can't
he just build a quorum from the existing replicas if possible?

And again the question:
Are these placementgroups (scrub error, inconsistent) blocking on
read/write requests? Because if yes, we have a serious problem here...

Regards,
Christian

Am 12.05.2015 um 08:20 schrieb Christian Balzer:
> 
> Hello,
> 
> I can only nod emphatically to what Robert said, don't issue repairs
> unless you 
> a) don't care about the data or 
> b) have verified that your primary OSD is good.
> 
> See this for some details on how establish which replica(s) are actually
> good or not:
> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> 
> Of course if you somehow wind up with more subtle data corruption and are
> faced with 3 slightly differing data sets, you may have have to resort to
> rolling a dice after all.
> 
> A word from the devs about the state of checksums and automatic repairs we
> can trust would be appreciated.
> 
> Christian
> 
> On Mon, 11 May 2015 10:19:08 -0600 Robert LeBlanc wrote:
> 
>> Personally I would not just run this command automatically because as you
>> stated, it only copies the primary PGs to the replicas and if the primary
>> is corrupt, you will corrupt your secondaries.I think the monitor log
>> shows which OSD has the problem so if it is not your primary, then just
>> issue the repair command.
>>
>> There was talk, and I believe work towards, Ceph storing a hash of the
>> object so that it can be smarter about which replica has the correct data
>> and automatically replicate the good data no matter where it is. I think
>> the first part, creating the hash and storing it, has been included in
>> Hammer. I'm not an authority on this so take it with a grain of salt.
>>
>> Right now our procedure is to find the PG files on the OSDs, perform a
>> MD5 on all of them and the one that doesn't match, overwrite, either by
>> issuing the PG repair command, or removing the bad PG files, rsyncing
>> them with the -X argument and then instructing a deep-scrub on the PG to
>> clear it up in Ceph.
>>
>> I've only tested this on an idle cluster, so I don't know how well it
>> will work on an active cluster. Since we issue a deep-scrub, if the PGs
>> of the replicas change during the rsync, it should come up with an
>> error. The idea is to keep rsyncing until the deep-scrub is clean. Be
>> warned that you may be aiming your gun at your foot with this!
>>
>> 
>> Robert LeBlanc
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>> On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann <
>> christian.eichelm...@1und1.de> wrote:
>>
>>> Hi all!
>>>
>>> We are experiencing approximately 1 scrub error / inconsistent pg every
>>> two days. As far as I know, to fix this you can issue a "ceph pg
>>> repair", which works fine for us. I have a few qestions regarding the
>>> behavior of the ceph cluster in such a case:
>>>
>>> 1. After ceph detects the scrub error, the pg is marked as
>>> inconsistent. Does that mean that any IO to this pg is blocked until
>>> it is repaired?
>>>
>>> 2. Is this amount of scrub errors normal? We currently have only 150TB
>>> in our cluster, distributed over 720 2TB disks.
>>>
>>> 3. As far as I know, a "ceph pg repair" just copies the content of the
>>> primary pg to all replicas. Is this still the case? What if the primary
>>> copy is the one having errors? We have a 4x replication level and it
>>> would be cool if ceph would use one of the pg for recovery which has
>>> the same checksum as the majority of pgs.
>>>
>>> 4. Some of this errors are happening at night. Since ceph reports this
>>> as a critical error, our shift is called and wake up, just to issue a
>>> single command. Do you see any problems in triggering this command
>>> automatically via monitoring event? Is there a reason why ceph isn't
>>> resolving these errors itself when it has enought replicas to do so?
>>>
>>> Regards,
>>> Christian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
> 
> 


-- 
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721

Re: [ceph-users] EC backend benchmark

2015-05-11 Thread Christian Balzer

Hello,

Could you have another EC run with differing block sizes like described
here:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
and look for write amplification?

I'd suspect that by the very nature of EC and the addition local checksums
it (potentially) writes it to be worse than replication.

Which is something very much to consider with SSDs.

Christian

On Mon, 11 May 2015 21:23:40 + Somnath Roy wrote:

> Hi Loic and community,
> 
> I have gathered the following data on EC backend (all flash). I have
> decided to use Jerasure since space saving is the utmost priority.
> 
> Setup:
> 
> 41 OSDs (each on 8 TB flash), 5 node Ceph cluster. 48 core HT enabled
> cpu/64 GB RAM. Tested with Rados Bench clients.
> 
> Result:
> -
> 
> It is attached in the doc.
> 
> Summary :
> -
> 
> 1. It is doing pretty good in Reads and 4 Rados Bench clients are
> saturating 40 GB network. With more physical server, it is scaling
> almost linearly and saturating 40 GbE on both the host.
> 
> 2. As suspected with Ceph, problem is again with writes. Throughput wise
> it is beating replicated pools in significant numbers. But, it is not
> scaling with multiple clients and not saturating anything.
> 
> So, my question is the following.
> 
> 1. Probably, nothing to do with EC backend, we are suffering because of
> filestore inefficiencies. Do you think any tunable like EC stipe size
> (or anything else) will help here ?
> 
> 2. I couldn't make fault domain as 'host', because of HW limitation. Do
> you think will that play a role in performance for bigger k values ?
> 
> 3. Even though it is not saturating 40 GbE for writes, do you think
> separating out public/private network will help in terms of performance ?
> 
> Any feedback on this is much appreciated.
> 
> Thanks & Regards
> Somnath
> 
> 
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that
> any review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Christian Balzer

Hello,

I can only nod emphatically to what Robert said, don't issue repairs
unless you 
a) don't care about the data or 
b) have verified that your primary OSD is good.

See this for some details on how establish which replica(s) are actually
good or not:
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

Of course if you somehow wind up with more subtle data corruption and are
faced with 3 slightly differing data sets, you may have have to resort to
rolling a dice after all.

A word from the devs about the state of checksums and automatic repairs we
can trust would be appreciated.

Christian

On Mon, 11 May 2015 10:19:08 -0600 Robert LeBlanc wrote:

> Personally I would not just run this command automatically because as you
> stated, it only copies the primary PGs to the replicas and if the primary
> is corrupt, you will corrupt your secondaries.I think the monitor log
> shows which OSD has the problem so if it is not your primary, then just
> issue the repair command.
> 
> There was talk, and I believe work towards, Ceph storing a hash of the
> object so that it can be smarter about which replica has the correct data
> and automatically replicate the good data no matter where it is. I think
> the first part, creating the hash and storing it, has been included in
> Hammer. I'm not an authority on this so take it with a grain of salt.
> 
> Right now our procedure is to find the PG files on the OSDs, perform a
> MD5 on all of them and the one that doesn't match, overwrite, either by
> issuing the PG repair command, or removing the bad PG files, rsyncing
> them with the -X argument and then instructing a deep-scrub on the PG to
> clear it up in Ceph.
> 
> I've only tested this on an idle cluster, so I don't know how well it
> will work on an active cluster. Since we issue a deep-scrub, if the PGs
> of the replicas change during the rsync, it should come up with an
> error. The idea is to keep rsyncing until the deep-scrub is clean. Be
> warned that you may be aiming your gun at your foot with this!
> 
> 
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann <
> christian.eichelm...@1und1.de> wrote:
> 
> > Hi all!
> >
> > We are experiencing approximately 1 scrub error / inconsistent pg every
> > two days. As far as I know, to fix this you can issue a "ceph pg
> > repair", which works fine for us. I have a few qestions regarding the
> > behavior of the ceph cluster in such a case:
> >
> > 1. After ceph detects the scrub error, the pg is marked as
> > inconsistent. Does that mean that any IO to this pg is blocked until
> > it is repaired?
> >
> > 2. Is this amount of scrub errors normal? We currently have only 150TB
> > in our cluster, distributed over 720 2TB disks.
> >
> > 3. As far as I know, a "ceph pg repair" just copies the content of the
> > primary pg to all replicas. Is this still the case? What if the primary
> > copy is the one having errors? We have a 4x replication level and it
> > would be cool if ceph would use one of the pg for recovery which has
> > the same checksum as the majority of pgs.
> >
> > 4. Some of this errors are happening at night. Since ceph reports this
> > as a critical error, our shift is called and wake up, just to issue a
> > single command. Do you see any problems in triggering this command
> > automatically via monitoring event? Is there a reason why ceph isn't
> > resolving these errors itself when it has enought replicas to do so?
> >
> > Regards,
> > Christian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 6 requests are blocked

2015-05-11 Thread Irek Fasikhov
Patrick,
At the moment, you do not have any problems related to the slow query.

2015-05-12 8:56 GMT+03:00 Patrik Plank :

>  So ok, understand.
>
> But what can I do if the scrubbing process hangs by one page since last
> night:
>
>
> root@ceph01:~# ceph health detail
> HEALTH_OK
>
> root@ceph01:~# ceph pg dump | grep scrub
> pg_statobjectsmipdegrmispunfbyteslog
> disklogstatestate_stampvreportedupup_primary
> actingacting_primarylast_scrubscrub_stamplast_deep_scrub
> deep_scrub_stamp
> 2.5cb1010000423620608324324
> active+clean+scrubbing+deep2015-05-11 23:01:37.0567474749'324
> 4749:6524[14,10]14[14,10]144749'3182015-05-10
> 22:05:29.2528763423'3092015-05-04 21:44:46.609791
>
> Perhaps an idea?
>
>
> best regards
>
>
>  -Original message-
> *From:* Irek Fasikhov 
> *Sent:* Tuesday 12th May 2015 7:49
> *To:* Patrik Plank ; ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] HEALTH_WARN 6 requests are blocked
>
> Scrubbing greatly affects the I / O and can slow queries on OSD. For more
> information, look in the 'ceph health detail' and 'ceph pg dump | grep
> scrub'
>
> 2015-05-12 8:42 GMT+03:00 Patrik Plank :
>
>>  Hi,
>>
>>
>> is that the reason for the Health Warn or the scrubbing notification?
>>
>>
>>
>> thanks
>>
>> regards
>>
>>
>>  -Original message-
>> *From:* Irek Fasikhov 
>> *Sent:* Tuesday 12th May 2015 7:33
>> *To:* Patrik Plank 
>> *Cc:* ceph-users@lists.ceph.com >> ceph-users@lists.ceph.com <
>> ceph-users@lists.ceph.com>
>> *Subject:* Re: [ceph-users] HEALTH_WARN 6 requests are blocked
>>
>> Hi, Patrik.
>>
>> You must configure the priority of the I / O for scrubbing.
>>
>> http://dachary.org/?p=3268
>>
>>
>>
>> 2015-05-12 8:03 GMT+03:00 Patrik Plank :
>>
>>>  Hi,
>>>
>>>
>>> the ceph cluster shows always the scrubbing notifications, although he
>>> do not scrub.
>>>
>>> And what does the "Health Warn" mean.
>>>
>>> Does anybody have an idea why the warning is displayed.
>>>
>>> How can I solve this?
>>>
>>>
>>>  cluster 78227661-3a1b-4e56-addc-c2a272933ac2
>>>  health HEALTH_WARN 6 requests are blocked > 32 sec
>>>  monmap e3: 3 mons at {ceph01=
>>> 10.0.0.20:6789/0,ceph02=10.0.0.21:6789/0,ceph03=10.0.0.22:6789/0},
>>> election epoch 92, quorum 0,1,2 ceph01,ceph02,ceph03
>>>  osdmap e4749: 30 osds: 30 up, 30 in
>>>   pgmap v2321129: 4608 pgs, 2 pools, 1712 GB data, 440 kobjects
>>> 3425 GB used, 6708 GB / 10134 GB avail
>>>1 active+clean+scrubbing+deep
>>> 4607 active+clean
>>>   client io 3282 kB/s rd, 10742 kB/s wr, 182 op/s
>>>
>>>
>>> thanks
>>>
>>> best regards
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>>  С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>>
>>
>
>
> --
>  С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 6 requests are blocked

2015-05-11 Thread Patrik Plank
So ok, understand.

But what can I do if the scrubbing process hangs by one page since last night:



root@ceph01:~# ceph health detail
HEALTH_OK


root@ceph01:~# ceph pg dump | grep scrub
pg_stat    objects    mip    degr    misp    unf    bytes    log    disklog    
state    state_stamp    v    reported    up    up_primary    acting    
acting_primary    last_scrub    scrub_stamp    last_deep_scrub    
deep_scrub_stamp
2.5cb    101    0    0    0    0    423620608    324    324    
active+clean+scrubbing+deep    2015-05-11 23:01:37.056747    4749'324    
4749:6524    [14,10]    14    [14,10]    14    4749'318    2015-05-10 
22:05:29.252876    3423'309    2015-05-04 21:44:46.609791


Perhaps an idea?



best regards



-Original message-
From: Irek Fasikhov 
Sent: Tuesday 12th May 2015 7:49
To: Patrik Plank ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] HEALTH_WARN 6 requests are blocked

Scrubbing greatly affects the I / O and can slow queries on OSD. For more 
information, look in the 'ceph health detail' and 'ceph pg dump | grep scrub'

2015-05-12 8:42 GMT+03:00 Patrik Plank mailto:pat...@plank.me> >:
Hi,



is that the reason for the Health Warn or the scrubbing notification?





thanks 

regards



-Original message-
From: Irek Fasikhov mailto:malm...@gmail.com> >
Sent: Tuesday 12th May 2015 7:33
To: Patrik Plank mailto:pat...@plank.me> >
Cc: ceph-users@lists.ceph.com  >> 
ceph-users@lists.ceph.com  
mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] HEALTH_WARN 6 requests are blocked

Hi, Patrik.

You must configure the priority of the I / O for scrubbing.

http://dachary.org/?p=3268  



2015-05-12 8:03 GMT+03:00 Patrik Plank mailto:pat...@plank.me> >:
Hi,



the ceph cluster shows always the scrubbing notifications, although he do not 
scrub.

And what does the "Health Warn" mean.

Does anybody have an idea why the warning is displayed.

How can I solve this?



 cluster 78227661-3a1b-4e56-addc-c2a272933ac2
 health HEALTH_WARN 6 requests are blocked > 32 sec
 monmap e3: 3 mons at 
{ceph01=10.0.0.20:6789/0,ceph02=10.0.0.21:6789/0,ceph03=10.0.0.22:6789/0 
 }, 
election epoch 92, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e4749: 30 osds: 30 up, 30 in
  pgmap v2321129: 4608 pgs, 2 pools, 1712 GB data, 440 kobjects
    3425 GB used, 6708 GB / 10134 GB avail
   1 active+clean+scrubbing+deep
    4607 active+clean
  client io 3282 kB/s rd, 10742 kB/s wr, 182 op/s




thanks

best regards


___
 ceph-users mailing list
 ceph-users@lists.ceph.com  
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
 



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 6 requests are blocked

2015-05-11 Thread Irek Fasikhov
Scrubbing greatly affects the I / O and can slow queries on OSD. For more
information, look in the 'ceph health detail' and 'ceph pg dump | grep
scrub'

2015-05-12 8:42 GMT+03:00 Patrik Plank :

>  Hi,
>
>
> is that the reason for the Health Warn or the scrubbing notification?
>
>
>
> thanks
>
> regards
>
>
>  -Original message-
> *From:* Irek Fasikhov 
> *Sent:* Tuesday 12th May 2015 7:33
> *To:* Patrik Plank 
> *Cc:* ceph-users@lists.ceph.com >> ceph-users@lists.ceph.com <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] HEALTH_WARN 6 requests are blocked
>
> Hi, Patrik.
>
> You must configure the priority of the I / O for scrubbing.
>
> http://dachary.org/?p=3268
>
>
>
> 2015-05-12 8:03 GMT+03:00 Patrik Plank :
>
>>  Hi,
>>
>>
>> the ceph cluster shows always the scrubbing notifications, although he do
>> not scrub.
>>
>> And what does the "Health Warn" mean.
>>
>> Does anybody have an idea why the warning is displayed.
>>
>> How can I solve this?
>>
>>
>>  cluster 78227661-3a1b-4e56-addc-c2a272933ac2
>>  health HEALTH_WARN 6 requests are blocked > 32 sec
>>  monmap e3: 3 mons at {ceph01=
>> 10.0.0.20:6789/0,ceph02=10.0.0.21:6789/0,ceph03=10.0.0.22:6789/0},
>> election epoch 92, quorum 0,1,2 ceph01,ceph02,ceph03
>>  osdmap e4749: 30 osds: 30 up, 30 in
>>   pgmap v2321129: 4608 pgs, 2 pools, 1712 GB data, 440 kobjects
>> 3425 GB used, 6708 GB / 10134 GB avail
>>1 active+clean+scrubbing+deep
>> 4607 active+clean
>>   client io 3282 kB/s rd, 10742 kB/s wr, 182 op/s
>>
>>
>> thanks
>>
>> best regards
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>  С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 6 requests are blocked

2015-05-11 Thread Irek Fasikhov
Hi, Patrik.

You must configure the priority of the I / O for scrubbing.

http://dachary.org/?p=3268



2015-05-12 8:03 GMT+03:00 Patrik Plank :

>  Hi,
>
>
> the ceph cluster shows always the scrubbing notifications, although he do
> not scrub.
>
> And what does the "Health Warn" mean.
>
> Does anybody have an idea why the warning is displayed.
>
> How can I solve this?
>
>
>  cluster 78227661-3a1b-4e56-addc-c2a272933ac2
>  health HEALTH_WARN 6 requests are blocked > 32 sec
>  monmap e3: 3 mons at {ceph01=
> 10.0.0.20:6789/0,ceph02=10.0.0.21:6789/0,ceph03=10.0.0.22:6789/0},
> election epoch 92, quorum 0,1,2 ceph01,ceph02,ceph03
>  osdmap e4749: 30 osds: 30 up, 30 in
>   pgmap v2321129: 4608 pgs, 2 pools, 1712 GB data, 440 kobjects
> 3425 GB used, 6708 GB / 10134 GB avail
>1 active+clean+scrubbing+deep
> 4607 active+clean
>   client io 3282 kB/s rd, 10742 kB/s wr, 182 op/s
>
>
> thanks
>
> best regards
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_WARN 6 requests are blocked

2015-05-11 Thread Patrik Plank
Hi,



the ceph cluster shows always the scrubbing notifications, although he do not 
scrub.

And what does the "Health Warn" mean.

Does anybody have an idea why the warning is displayed.

How can I solve this?



 cluster 78227661-3a1b-4e56-addc-c2a272933ac2
 health HEALTH_WARN 6 requests are blocked > 32 sec
 monmap e3: 3 mons at 
{ceph01=10.0.0.20:6789/0,ceph02=10.0.0.21:6789/0,ceph03=10.0.0.22:6789/0}, 
election epoch 92, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e4749: 30 osds: 30 up, 30 in
  pgmap v2321129: 4608 pgs, 2 pools, 1712 GB data, 440 kobjects
    3425 GB used, 6708 GB / 10134 GB avail
   1 active+clean+scrubbing+deep
    4607 active+clean
  client io 3282 kB/s rd, 10742 kB/s wr, 182 op/s




thanks

best regards

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replicas handling

2015-05-11 Thread Anthony Levesque
Greetings,

We have been testing a full SSD Ceph cluster for a few weeks now and still 
testing.  One of the outcome(We will post a full report on our test soon but 
for now this email will only be for replicas) is that as soon as you put more 
than 1 copy of the cluster, it kills the performance by at least 2.5times.

Im curious if someone can confirm my theory on how the replication is handle:

Here is a scenarios:

3 Nodes
Each Nodes has 1 Journal (SSD) and 2 OSD (SSD)
Replica count = 3

-New object/file  is written on node1.journal which then write onto node1.osd1
-For the second copy: node1.journal will write the file on node2.journal then 
node2.journal which can write onto node2.osd1
-For the third copy: node1.journal will write the file on node3.journal then 
node3.journal which can write onto node3.osd1

Is this how ceph would handle the replication?

P.S.  I understand that the Crush algorithm will probably have not in this 
order but my question is more to confirm that to replicate, it need to be 
written on the second and third journal prior of being able to write onto those 
2nd and 3rd OSD

Many thanks

Anthony


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow Files

2015-05-11 Thread Yehuda Sadeh-Weinraub
It's the wip-rgw-orphans branch.

- Original Message -
> From: "Daniel Hoffman" 
> To: "Yehuda Sadeh-Weinraub" 
> Cc: "Ben" , "David Zafman" , 
> "ceph-users" 
> Sent: Monday, May 11, 2015 4:30:11 PM
> Subject: Re: [ceph-users] Shadow Files
> 
> Thanks.
> 
> Can you please let me know the suitable/best git version/tree to be pulling
> to compile and use this feature/patch?
> 
> Thanks
> 
> On Tue, May 12, 2015 at 4:38 AM, Yehuda Sadeh-Weinraub < yeh...@redhat.com >
> wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> From: "Daniel Hoffman" < daniel.hoff...@13andrew.com >
> To: "Yehuda Sadeh-Weinraub" < yeh...@redhat.com >
> Cc: "Ben" , "ceph-users" < ceph-us...@ceph.com >
> Sent: Sunday, May 10, 2015 5:03:22 PM
> Subject: Re: [ceph-users] Shadow Files
> 
> Any updates on when this is going to be released?
> 
> Daniel
> 
> On Wed, May 6, 2015 at 3:51 AM, Yehuda Sadeh-Weinraub < yeh...@redhat.com >
> wrote:
> 
> 
> Yes, so it seems. The librados::nobjects_begin() call expects at least a
> Hammer (0.94) backend. Probably need to add a try/catch there to catch this
> issue, and maybe see if using a different api would be better compatible
> with older backends.
> 
> Yehuda
> I cleaned up the commits a bit, but it needs to be reviewed, and it'll be
> nice to get some more testing to it before it goes on an official release.
> There's still the issue of running it against a firefly backend. I looked at
> backporting it to firefly, but it's not going to be a trivial work, so I
> think the better time usage would be to get the hammer one to work against a
> firefly backend. There are some librados api quirks that we need to flush
> out first.
> 
> Yehuda
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow Files

2015-05-11 Thread Daniel Hoffman
Thanks.

Can you please let me know the suitable/best git version/tree to be pulling
to compile and use this feature/patch?

Thanks

On Tue, May 12, 2015 at 4:38 AM, Yehuda Sadeh-Weinraub 
wrote:

>
>
> --
>
> *From: *"Daniel Hoffman" 
> *To: *"Yehuda Sadeh-Weinraub" 
> *Cc: *"Ben" , "ceph-users" 
> *Sent: *Sunday, May 10, 2015 5:03:22 PM
> *Subject: *Re: [ceph-users] Shadow Files
>
> Any updates on when this is going to be released?
>
> Daniel
>
> On Wed, May 6, 2015 at 3:51 AM, Yehuda Sadeh-Weinraub 
> wrote:
>
>> Yes, so it seems. The librados::nobjects_begin() call expects at least a
>> Hammer (0.94) backend. Probably need to add a try/catch there to catch this
>> issue, and maybe see if using a different api would be better compatible
>> with older backends.
>>
>> Yehuda
>>
> I cleaned up the commits a bit, but it needs to be reviewed, and it'll be
> nice to get some more testing to it before it goes on an official release.
> There's still the issue of running it against a firefly backend. I looked
> at backporting it to firefly, but it's not going to be a trivial work, so I
> think the better time usage would be to get the hammer one to work against
> a firefly backend. There are some librados api quirks that we need to flush
> out first.
>
> Yehuda
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Anthony D'Atri


Agree that 99+% of the inconsistent PG's I see correlate directly to disk flern.

Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find errors 
correlating.

-- Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs][ceph-fuse] cache size or memory leak?

2015-05-11 Thread Gregory Farnum
On Fri, May 8, 2015 at 1:34 AM, Yan, Zheng  wrote:
> On Fri, May 8, 2015 at 11:15 AM, Dexter Xiong  wrote:
>> I tried "echo 3 > /proc/sys/vm/drop_caches" and dentry_pinned_count dropped.
>>
>> Thanks for your help.
>>
>
> could you please try the attached patch

I haven't followed the whole conversation; is this patch something we
need to put upstream?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse options: writeback cache

2015-05-11 Thread Gregory Farnum
On Mon, May 11, 2015 at 1:57 AM, Kenneth Waegeman
 wrote:
> Hi all,
>
> I have a few questions about ceph-fuse options:
> - Is the fuse writeback cache being used? How can we see this? Can it be
> turned on with allow_wbcache somehow?

I'm not quite sure what you mean here. ceph-fuse does maintain an
internal writeback cache which you can control with the
"client_oc_size" and related config values. It is enabled by default.

>
> - What is the default of the big_writes option? (as seen in
> /usr/bin/ceph-fuse  --help) . Where can we see this?

This just enables the FUSE big-writes option. According to FUSE this
will "enable larger than 4kB writes" — that is, if FUSE has a bunch of
data to write out to a file, the call into the userspace code will
share it in larger sizes. It's a CPU and request optimization.

> If we run ceph fuse as this: ceph-fuse /mnt/ceph -o
> max_write=$((1024*1024*64)),big_writes
> we don't see any of this in the output of mount:
> ceph-fuse on /mnt/ceph type fuse.ceph-fuse
> (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
>
> Can we see this somewhere else?

What are you trying to see? There's a bunch of logging in whatever
file you've pointed ceph-fuse at (by default,
/var/log/ceph/ceph-client..log or similar).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PGs because 0 copies of objects...

2015-05-11 Thread Aaron Ten Clay
Fellow Cephers,

I'm scratching my head on this one. Somehow a bunch of objects were lost in
my cluster, which is currently ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e).

The symptoms are that "ceph -s" reports a bunch of inconsistent PGs:

cluster 8a2c9e43-9f17-42e0-92fd-88a40152303d
 health HEALTH_ERR 13 pgs inconsistent; 123 scrub errors; mds0: Client
sabnzbd:storage failing to respond to cache pressure; noout flag(s) set
 monmap e9: 3 mons at {guinan=
10.42.6.48:6789/0,tuvok=10.42.6.33:6789/0,yar=10.42.6.43:6789/0}, election
epoch 1252, quorum 0,1,2 tuvok,yar,guinan
 mdsmap e698: 1/1/1 up {0=pulaski=up:active}
 osdmap e41375: 29 osds: 29 up, 29 in
flags noout
  pgmap v22573849: 1088 pgs, 3 pools, 32175 GB data, 9529 kobjects
96663 GB used, 41779 GB / 135 TB avail
1072 active+clean
   3 active+clean+scrubbing+deep
  13 active+clean+inconsistent
  client io 1004 kB/s rd, 2 op/s


I say the objects were "lost", because grepping the logs for the OSDs
holding the affected PGs, I see lines like:

2015-05-10 06:27:34.720648 7f2df27fc700  0
filestore(/var/lib/ceph/osd/ceph-11) write couldn't open
0.176_head/adb9ff76/10006ecde46./head//0: (61) No data available
2015-05-10 15:44:34.723479 7f2df2ffd700 -1
filestore(/var/lib/ceph/osd/ceph-11) error creating
9be4ff76/10006ee7848./head//0
(/var/lib/ceph/osd/ceph-11/current/0.176_head/DIR_6/DIR_7/DIR_F/DIR_F/10006ee7848.__head_9BE4FF76__0)
in index: (61) No data available

All the affected PGs are in pool 0, which is the data pool for CephFS. The
replication setting for pool 0 is 2: "ceph osd dump | head -n 9":
epoch 41375
fsid 8a2c9e43-9f17-42e0-92fd-88a40152303d
created 2014-04-06 21:16:19.449590
modified 2015-05-10 13:57:21.376468
flags noout
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 17399 flags hashpspool
crash_replay_interval 45 min_read_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 3 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 18915 flags hashpspool
min_read_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool
min_read_recency_for_promote 1 stripe_width 0
max_osd 29


I'm a bit fuzzy on the timeline about when missing objects started
appearing. It's a tad alarming and I'd like any pointers for getting a
better understanding of the situation.

To make matters worse, I'm running CephFS and a lot of the missing objects
are strip 0 of a file, which leaves me with no idea how to find out what
the affected file was so I can delete it and restore from backups. Pointers
here would be useful as well. (My current method for mapping an object to
CephFS file is to read the xattrs on the 0th stripe object and pick out the
strings.)

Thanks in advance for any suggestions/pointers!

-- 
Aaron Ten Clay
http://www.aarontc.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC backend benchmark

2015-05-11 Thread Somnath Roy
Thanks Loic..
<< inline

Regards
Somnath
-Original Message-
From: Loic Dachary [mailto:l...@dachary.org]
Sent: Monday, May 11, 2015 3:02 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com; Ceph Development
Subject: Re: EC backend benchmark

Hi,
[Sorry I missed the body of your questions, here is my answer ;-]

On 11/05/2015 23:13, Somnath Roy wrote:> Summary :
>
> -
>
>
>
> 1. It is doing pretty good in Reads and 4 Rados Bench clients are saturating 
> 40 GB network. With more physical server, it is scaling almost linearly and 
> saturating 40 GbE on both the host.
>
>
>
> 2. As suspected with Ceph, problem is again with writes. Throughput wise it 
> is beating replicated pools in significant numbers. But, it is not scaling 
> with multiple clients and not saturating anything.
>
>
>
>
>
> So, my question is the following.
>
>
>
> 1. Probably, nothing to do with EC backend, we are suffering because of 
> filestore inefficiencies. Do you think any tunable like EC stipe size (or 
> anything else) will help here ?

I think Mark Nelson would be in a better position that me to answer as he has 
conducted many experiments with erasure coded pools.

[Somnath] Sure, Mark, any insight  ? :-)

> 2. I couldn't make fault domain as 'host', because of HW limitation. Do you 
> think will that play a role in performance for bigger k values ?

I don't see a reason why there would be a direct relationship between the 
failure domain and the values of k. Do you have a specific example in mind ?

[Somnath] Nope, other than more network hops..If failure domain is OSD, more 
than one chunks could be within a host...But, since I have 40GbE and not 
saturating network BW (for bigger cluster that probability is less), IMO it 
shouldn't matter. I thought of checking with you.

> 3. Even though it is not saturating 40 GbE for writes, do you think 
> separating out public/private network will help in terms of performance ?

I don't think so. What is the bottleneck ? CPU or disk I/O ?

[Somnath] For write, no resources (cpu/network/disk) are saturating.

Cheers

--
Loïc Dachary, Artisan Logiciel Libre





PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC backend benchmark

2015-05-11 Thread Loic Dachary
Hi,
[Sorry I missed the body of your questions, here is my answer ;-]

On 11/05/2015 23:13, Somnath Roy wrote:> Summary :
> 
> -
> 
>  
> 
> 1. It is doing pretty good in Reads and 4 Rados Bench clients are saturating 
> 40 GB network. With more physical server, it is scaling almost linearly and 
> saturating 40 GbE on both the host.
> 
>  
> 
> 2. As suspected with Ceph, problem is again with writes. Throughput wise it 
> is beating replicated pools in significant numbers. But, it is not scaling 
> with multiple clients and not saturating anything.
> 
>  
> 
>  
> 
> So, my question is the following.
> 
>  
> 
> 1. Probably, nothing to do with EC backend, we are suffering because of 
> filestore inefficiencies. Do you think any tunable like EC stipe size (or 
> anything else) will help here ?

I think Mark Nelson would be in a better position that me to answer as he has 
conducted many experiments with erasure coded pools.


> 2. I couldn’t make fault domain as ‘host’, because of HW limitation. Do you 
> think will that play a role in performance for bigger k values ?

I don't see a reason why there would be a direct relationship between the 
failure domain and the values of k. Do you have a specific example in mind ?

> 3. Even though it is not saturating 40 GbE for writes, do you think 
> separating out public/private network will help in terms of performance ?

I don't think so. What is the bottleneck ? CPU or disk I/O ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Calamari server

2015-05-11 Thread Michael Kuriger
I had an issue with my calamari server, so I built a new one from scratch.
 I¹ve been struggling trying to get the new server to start up and see my
ceph cluster.  I went so far as to remove salt and diamond from my ceph
nodes and reinstalled again.  On my calamari server, it sees the hosts
connected but doesn¹t detect a cluster.  What am I missing?  I¹ve set up
many calamari servers on different ceph clusters, but this is the first
time I¹ve tried to build a new calamari server.

Here¹s what I see on my calamari GUI:

New Calamari Installation

This appears to be the first time you have started Calamari and there are
no clusters currently configured.

33 Ceph servers are connected to Calamari, but no Ceph cluster has been
created yet. Please use ceph-deploy to create a cluster; please see the
Inktank Ceph Enterprise documentation for more details.

Thanks!
Mike Kuriger




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC backend benchmark

2015-05-11 Thread Somnath Roy
Loic,
I thought this one didn't go through !
I have sent another mail with attached doc.
This is the data with rados bench .
In case you missed it, could you please share your thoughts on the questions I 
posted (way below in the mail, not sure how so many space came along!!) below ?

Thanks & Regards
Somnath

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org] 
Sent: Monday, May 11, 2015 2:24 PM
To: Somnath Roy
Cc: Ceph Development; ceph-users
Subject: Re: EC backend benchmark

Hi,

Thanks for sharing :-) Have you published the tools that you used to gather 
these results ? It would be great to have a way to reproduce the same measures 
in different contexts.

Cheers

On 11/05/2015 23:13, Somnath Roy wrote:
>  
> 
> Hi Loic and community,
> 
>  
> 
> I have gathered the following data on EC backend (all flash). I have decided 
> to use Jerasure since space saving is the utmost priority.
> 
>  
> 
> Setup:
> 
> 
> 
>  
> 
> 41 OSDs (each on 8 TB flash), 5 node Ceph cluster. 48 core HT enabled cpu/64 
> GB RAM. Tested with Rados Bench clients.
> 
>  
> 
>  
> 
> EC plug-in
> 
>   
> 
> EC ratio
> 
>   
> 
> EC fault domain
> 
>   
> 
> Workload
> 
>   
> 
> Total clients
> 
>   
> 
> Num client Host
> 
>   
> 
> Runtime (Sec)
> 
>   
> 
> QD
> 
> (single client)
> 
>   
> 
> Latency/client
> 
> (avg/Max)
> 
>   
> 
> BW (aggregated)
> 
>   
> 
> Object_size
> 
>   
> 
> Node Cpu usage %
> 
>   
> 
> BW/HT core
> 
> Jerasure
> 
>   
> 
> 9,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1786 MB/S
> 
>   
> 
> 4M
> 
>   
> 
> 28%
> 
>   
> 
> 132 MB/s
> 
> Jerasure
> 
>   
> 
> 9,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.9/2.1
> 
>   
> 
> 2174 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 35%
> 
>   
> 
> 129 MB/s
> 
> Jerasure
> 
>   
> 
> 4,1
> 
>   
> 
> Host
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/2.3
> 
>   
> 
> 1737 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 14%
> 
>   
> 
> 258 MB/s
> 
> Jerasure
> 
>   
> 
> 4,1
> 
>   
> 
> Host
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 1.0/25 (!)
> 
>   
> 
> 1783 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 14%
> 
>   
> 
> 265 MB/s
> 
> Jerasure
> 
>   
> 
> 15,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.6/1.4
> 
>   
> 
> 1530 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 40%
> 
>   
> 
> 79 MB/s
> 
> Jerasure
> 
>   
> 
> 15,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 1.0/4.7
> 
>   
> 
> 1886 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 45%
> 
>   
> 
> 87 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1917 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 24%
> 
>   
> 
> 166 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.8/2.2
> 
>   
> 
> 2281 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 28%
> 
>   
> 
> 170 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2 (RS_r6_op)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1876 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 25%
> 
>   
> 
> 156 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2 (RS_r6_op)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.8/1.9
> 
>   
> 
> 2292 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 31%
> 
>   
> 
> 154 MB/s
> 
> *Jerasure*
> 
>   
> 
> *6,2 (cauchy_orig)*
> 
>   
> 
> *OSD*
> 
>   
> 
> *PUT*
> 
>   
> 
> *4*
> 
>   
> 
> *1*
> 
>   
> 
> *100*
> 
>   
> 
> *64*
> 
>   
> 
> *0.5/1.1*
> 
>   
> 
> *2025 MB/s*
> 
>   
> 
> *4M*
> 
>   
> 
> *18%*
> 
>   
> 
> *234 MB/s*
> 
> *Jerasure*
> 
>   
> 
> *6,2 (cauchy_orig)*
> 
>   
> 
> *OSD*
> 
>   
> 
> *PUT*
> 
>   
> 
> *8*
> 
>   
> 
> *2*
> 
>   
> 
> *100*
> 
>   
> 
> *64*
> 
>   
> 
> *0.8/1.9*
> 
>   
> 
> *2497 MB/s*
> 
>   
> 
> *4M*
> 
>   
> 
> *21%*
> 
>   
> 
> *247 MB/s*
> 
> Jerasure
> 
>   
> 
> 6,2 (cauchy_good)
> 
>   
> 
> OSD
> 
>   
>

Re: [ceph-users] EC backend benchmark

2015-05-11 Thread Loic Dachary
Hi,

Thanks for sharing :-) Have you published the tools that you used to gather 
these results ? It would be great to have a way to reproduce the same measures 
in different contexts.

Cheers

On 11/05/2015 23:13, Somnath Roy wrote:
>  
> 
> Hi Loic and community,
> 
>  
> 
> I have gathered the following data on EC backend (all flash). I have decided 
> to use Jerasure since space saving is the utmost priority.
> 
>  
> 
> Setup:
> 
> 
> 
>  
> 
> 41 OSDs (each on 8 TB flash), 5 node Ceph cluster. 48 core HT enabled cpu/64 
> GB RAM. Tested with Rados Bench clients.
> 
>  
> 
>  
> 
> EC plug-in
> 
>   
> 
> EC ratio
> 
>   
> 
> EC fault domain
> 
>   
> 
> Workload
> 
>   
> 
> Total clients
> 
>   
> 
> Num client Host
> 
>   
> 
> Runtime (Sec)
> 
>   
> 
> QD
> 
> (single client)
> 
>   
> 
> Latency/client
> 
> (avg/Max)
> 
>   
> 
> BW (aggregated)
> 
>   
> 
> Object_size
> 
>   
> 
> Node Cpu usage %
> 
>   
> 
> BW/HT core
> 
> Jerasure
> 
>   
> 
> 9,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1786 MB/S
> 
>   
> 
> 4M
> 
>   
> 
> 28%
> 
>   
> 
> 132 MB/s
> 
> Jerasure
> 
>   
> 
> 9,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.9/2.1
> 
>   
> 
> 2174 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 35%
> 
>   
> 
> 129 MB/s
> 
> Jerasure
> 
>   
> 
> 4,1
> 
>   
> 
> Host
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/2.3
> 
>   
> 
> 1737 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 14%
> 
>   
> 
> 258 MB/s
> 
> Jerasure
> 
>   
> 
> 4,1
> 
>   
> 
> Host
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 1.0/25 (!)
> 
>   
> 
> 1783 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 14%
> 
>   
> 
> 265 MB/s
> 
> Jerasure
> 
>   
> 
> 15,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.6/1.4
> 
>   
> 
> 1530 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 40%
> 
>   
> 
> 79 MB/s
> 
> Jerasure
> 
>   
> 
> 15,3
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 1.0/4.7
> 
>   
> 
> 1886 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 45%
> 
>   
> 
> 87 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1917 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 24%
> 
>   
> 
> 166 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.8/2.2
> 
>   
> 
> 2281 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 28%
> 
>   
> 
> 170 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2 (RS_r6_op)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.2
> 
>   
> 
> 1876 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 25%
> 
>   
> 
> 156 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2 (RS_r6_op)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.8/1.9
> 
>   
> 
> 2292 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 31%
> 
>   
> 
> 154 MB/s
> 
> *Jerasure*
> 
>   
> 
> *6,2 (cauchy_orig)*
> 
>   
> 
> *OSD*
> 
>   
> 
> *PUT*
> 
>   
> 
> *4*
> 
>   
> 
> *1*
> 
>   
> 
> *100*
> 
>   
> 
> *64*
> 
>   
> 
> *0.5/1.1*
> 
>   
> 
> *2025 MB/s*
> 
>   
> 
> *4M*
> 
>   
> 
> *18%*
> 
>   
> 
> *234 MB/s*
> 
> *Jerasure*
> 
>   
> 
> *6,2 (cauchy_orig)*
> 
>   
> 
> *OSD*
> 
>   
> 
> *PUT*
> 
>   
> 
> *8*
> 
>   
> 
> *2*
> 
>   
> 
> *100*
> 
>   
> 
> *64*
> 
>   
> 
> *0.8/1.9*
> 
>   
> 
> *2497 MB/s*
> 
>   
> 
> *4M*
> 
>   
> 
> *21%*
> 
>   
> 
> *247 MB/s*
> 
> Jerasure
> 
>   
> 
> 6,2 (cauchy_good)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 4
> 
>   
> 
> 1
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.5/1.3
> 
>   
> 
> 1947MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 18%
> 
>   
> 
> 225 MB/s
> 
> Jerasure
> 
>   
> 
> 6,2 (cauchy_good)
> 
>   
> 
> OSD
> 
>   
> 
> PUT
> 
>   
> 
> 8
> 
>   
> 
> 2
> 
>   
> 
> 100
> 
>   
> 
> 64
> 
>   
> 
> 0.9/8.5
> 
>   
> 
> 2336 MB/s
> 
>   
> 
> 4M
> 
>   
> 
> 21%
> 
>   
> 
> 231 MB/s
> 
> Jeras

[ceph-users] EC backend benchmark

2015-05-11 Thread Somnath Roy
Hi Loic and community,

I have gathered the following data on EC backend (all flash). I have decided to 
use Jerasure since space saving is the utmost priority.

Setup:

41 OSDs (each on 8 TB flash), 5 node Ceph cluster. 48 core HT enabled cpu/64 GB 
RAM. Tested with Rados Bench clients.

Result:
-

It is attached in the doc.

Summary :
-

1. It is doing pretty good in Reads and 4 Rados Bench clients are saturating 40 
GB network. With more physical server, it is scaling almost linearly and 
saturating 40 GbE on both the host.

2. As suspected with Ceph, problem is again with writes. Throughput wise it is 
beating replicated pools in significant numbers. But, it is not scaling with 
multiple clients and not saturating anything.

So, my question is the following.

1. Probably, nothing to do with EC backend, we are suffering because of 
filestore inefficiencies. Do you think any tunable like EC stipe size (or 
anything else) will help here ?

2. I couldn't make fault domain as 'host', because of HW limitation. Do you 
think will that play a role in performance for bigger k values ?

3. Even though it is not saturating 40 GbE for writes, do you think separating 
out public/private network will help in terms of performance ?

Any feedback on this is much appreciated.

Thanks & Regards
Somnath





PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



EC_benchmark.docx
Description: EC_benchmark.docx
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is CephFS ready for production?

2015-05-11 Thread Neil Levine
We are still laying the foundations for eventual VMware integration and
indeed the Red Hat acquisition has made this more real now.

The first step is iSCSI support and work is ongoing in the kernel to get HA
iSCSI working with LIO and kRBD. See the blueprint and CDS sessions with
Mike Christie for an update. Love it or hate it, iSCSI is still the
standard protocol supported by ESX etc and this will be the initial QA
burden.

Neil


On Tue, May 5, 2015 at 9:10 AM, Michal Kozanecki 
wrote:

> This is what I found from 2014 - slide 7
>
>
> https://www.openstack.org/assets/presentation-media/inktank-demo-theater.pptx
>
> Cheers,
>
> Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Ray Sun
> Sent: April-24-15 10:44 PM
> To: Gregory Farnum
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Is CephFS ready for production?
>
> ​I think this is what I seen in 2013 Hongkong summit. At least in Ceph
> Enterprise version.​
>
>
> Best Regards
> -- Ray
>
> On Sat, Apr 25, 2015 at 12:36 AM, Gregory Farnum  wrote:
> I think the VMWare plugin was going to be contracted out by the business
> people, and it was never going to be upstream anyway -- I've not heard
> anything since then but you'd need to ask them I think.
> -Greg
>
> On Fri, Apr 24, 2015 at 7:17 AM Marc  wrote:
> On 22/04/2015 16:04, Gregory Farnum wrote:
> > On Tue, Apr 21, 2015 at 9:53 PM, Mohamed Pakkeer 
> wrote:
> >> Hi sage,
> >>
> >> When can we expect the fully functional fsck for cephfs?. Can we get at
> next
> >> major release?. Is there any roadmap or time frame for the fully
> functional
> >> fsck release?
> > We're working on it as fast as we can, and it'll be done when it's
> > done. ;) More seriously, I'm still holding out a waning hope that
> > we'll have the "forward scrub" portion ready for Infernalis and then
> > we'll see how long it takes to assemble a working repair tool from
> > that.
> >
> > On Wed, Apr 22, 2015 at 2:20 AM, Marc  wrote:
> >> Hi everyone,
> >>
> >> I am curious about the current state of the roadmap as well. Alongside
> the
> >> already asked question Re vmware support, where are we at with cephfs'
> >> multiMDS stability and dynamic subtree partitioning?
> > Zheng has fixed a ton of bugs in these areas over the last year, but
> > both features are farther down the roadmap since we don't think we
> > need them for the earliest production users.
> > -Greg
>
> Thanks for letting us know! Due to the RedHat acquisition the ICE
> roadmap seems to have disappeared. Is a vmware driver still being worked
> on? With vmware being closed source and all, I imagine this lies mostly
> within the domain of VMware Inc., correct? Having iSCSI proxies as
> mediators is rather clunky...
>
> (And yes I am actively working on trying to get the interested parties
> to strongly look into KVM, but they have become very comfortable with
> VMware vsphere enterprise plus...)
>
>
> Thanks and have a nice weekend!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow Files

2015-05-11 Thread Yehuda Sadeh-Weinraub
- Original Message -

> From: "Daniel Hoffman" 
> To: "Yehuda Sadeh-Weinraub" 
> Cc: "Ben" , "ceph-users" 
> Sent: Sunday, May 10, 2015 5:03:22 PM
> Subject: Re: [ceph-users] Shadow Files

> Any updates on when this is going to be released?

> Daniel

> On Wed, May 6, 2015 at 3:51 AM, Yehuda Sadeh-Weinraub < yeh...@redhat.com >
> wrote:

> > Yes, so it seems. The librados::nobjects_begin() call expects at least a
> > Hammer (0.94) backend. Probably need to add a try/catch there to catch this
> > issue, and maybe see if using a different api would be better compatible
> > with older backends.
> 

> > Yehuda
> 

I cleaned up the commits a bit, but it needs to be reviewed, and it'll be nice 
to get some more testing to it before it goes on an official release. There's 
still the issue of running it against a firefly backend. I looked at 
backporting it to firefly, but it's not going to be a trivial work, so I think 
the better time usage would be to get the hammer one to work against a firefly 
backend. There are some librados api quirks that we need to flush out first. 

Yehuda 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "too many PGs per OSD" in Hammer

2015-05-11 Thread Chris Armstrong
Thanks for the help! We've lowered the number of PGs per pool to 64, so
with 12 pools and a replica count of 3, all 3 OSDs have a full 768 PGs.

If anyone has any concerns or objections (particularly folks from the
Ceph/Redhat team), please let me know.

Thanks again!

On Fri, May 8, 2015 at 1:21 PM, Somnath Roy  wrote:

>  There are 2 parameters in a pool, pg_num and pgp_num.
>
> Pg_num you can’t decrease, but, pgp_num you can. This is the total number
> of PG for placement purpose. If you reduce that, you will see rebalancing
> will start and things should settle down after it is done.
>
>
>
> But, I am not aware of any other impact of this. Generally, it is
> recommended to keep pg_num and pgp_num same.
>
>
>
> Thanks & Regards
>
> Somnath
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Daniel Hoffman
> *Sent:* Friday, May 08, 2015 4:49 AM
> *Cc:* ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] "too many PGs per OSD" in Hammer
>
>
>
> Is there a way to shrink/merge PG's on a pool without removing it?
>
> I have a pool with some data in it but the PG's were miscalculated and
> just wondering the best way to resolve it.
>
>
>
> On Fri, May 8, 2015 at 4:49 PM, Somnath Roy 
> wrote:
>
> Sorry, I didn’t read through all..It seems you have 6 OSDs, so, I would
> say 128 PGs per pool is not bad !
>
> But, if you keep on adding pools, you need to lower this number, generally
> ~64 PGs per pool should achieve good parallelism with lower number of
> OSDs..If you grow your cluster , create pools with more PGs..
>
> Again, the warning number is a ballpark number, if you have more powerful
> compute and fast disk , you can safely ignore this warning.
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* Somnath Roy
> *Sent:* Thursday, May 07, 2015 11:44 PM
> *To:* 'Chris Armstrong'
> *Cc:* Stuart Longland; ceph-users@lists.ceph.com
> *Subject:* RE: [ceph-users] "too many PGs per OSD" in Hammer
>
>
>
> Nope, 16 seems way too less for performance.
>
> How many OSDs you have ? And how many pools are you planning to create ?
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* Chris Armstrong [mailto:carmstr...@engineyard.com
> ]
> *Sent:* Thursday, May 07, 2015 11:34 PM
> *To:* Somnath Roy
> *Cc:* Stuart Longland; ceph-users@lists.ceph.com
>
>
> *Subject:* Re: [ceph-users] "too many PGs per OSD" in Hammer
>
>
>
> Thanks for the details, Somnath.
>
>
>
> So it definitely sounds like 128 pgs per pool is way too many? I lowered
> ours to 16 on a new deploy and the warning is gone. I'm not sure if this
> number is sufficient, though...
>
>
>
> On Wed, May 6, 2015 at 4:10 PM, Somnath Roy 
> wrote:
>
> Just checking, are you aware of this ?
>
> http://ceph.com/pgcalc/
>
> FYI, the warning is given based on the following logic.
>
> int per = sum_pg_up / num_in;
> if (per > g_conf->mon_pg_warn_max_per_osd) {
> //raise warning..
>}
>
> This is not considering any resources..It is solely depends on number of
> in OSDs and total number of PGs in the cluster. Default
> mon_pg_warn_max_per_osd = 300, so, in your cluster per OSD is serving > 300
> PGs it seems.
> It will be good if you assign PGs in your pool keeping the above
> calculation in mind i.e no more than 300 PGs/ OSD..
> But, if you feel you OSD is in fast disk and box has lot of compute power,
> you may want to try out with more number of PGs/OSD. In this case, raise
> the mon_pg_warn_max_per_osd to something big and warning should go away.
>
> Hope this helps,
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Stuart Longland
> Sent: Wednesday, May 06, 2015 3:48 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] "too many PGs per OSD" in Hammer
>
> On 07/05/15 07:53, Chris Armstrong wrote:
> > Thanks for the feedback. That language is confusing to me, then, since
> > the first paragraph seems to suggest using a pg_num of 128 in cases
> > where we have less than 5 OSDs, as we do here.
> >
> > The warning below that is: "As the number of OSDs increases, chosing
> > the right value for pg_num becomes more important because it has a
> > significant influence on the behavior of the cluster as well as the
> > durability of the data when something goes wrong (i.e. the probability
> > that a catastrophic event leads to data loss).", which suggests that
> > this could be an issue with more OSDs, which doesn't apply here.
> >
> > Do we know if this warning is calculated based on the resources of the
> > host? If I try with larger machines, will this warning change?
>
> I'd be interested in an answer here too.  I just did an update from Giant
> to Hammer and struck the same dreaded error message.
>
> When I initially deployed Ceph (with Emperor), I worked out according to
> the formula given on the site:
>
> > # We have: 3 OSD nodes with 2 OSDs each
> > # giving us 6 OSDs total.
> > # There are 3 replicas, 

Re: [ceph-users] civetweb lockups

2015-05-11 Thread Yehuda Sadeh-Weinraub
- Original Message -

> From: "Daniel Hoffman" 
> To: "ceph-users" 
> Sent: Sunday, May 10, 2015 10:54:21 PM
> Subject: [ceph-users] civetweb lockups

> Hi All.

> We have a wierd issue where civetweb just locks up, it just fails to respond
> to HTTP and a restart resolves the problem. This happens anywhere from every
> 60 seconds to every 4 hours with no reason behind it.

> We have run the gateway in full debug mode and there is nothing there that
> seems to be an issue.

> We run 2 gateways on 6core machines, there is no load, cpu or memory wise,
> the machines seem fine. They are load balanced behind HA proxy. We run 12
> data nodes at the moment with ~170 disks.

> We see around the 40-60MB/s into the array. Is this just too much for
> civetweb to handle? Should we look at virtual machines on the hardware/mode
> nodes?

> [client.radosgw.ceph-obj02]
> host = ceph-obj02
> keyring = /etc/ceph/keyring.radosgw.ceph-obj02
> rgw socket path = /tmp/radosgw.sock
> log file = /var/log/ceph/radosgw.log
> rgw data = /var/lib/ceph/radosgw/ceph-obj02
> rgw thread pool size = 1024
> rgw print continue = False
> debug rgw = 0
> debug ms = 0
> rgw enable ops log = False
> log to stderr = False
> rgw enable usage log = False

> Advice appreciated.

Not sure what would be the issue. I'd look at the number of threads, maybe try 
reducing it, see if it makes any difference? Also, try to see how many open fds 
are there when it hangs. 

Yehuda 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd does not start when object store is set to "newstore"

2015-05-11 Thread Srikanth Madugundi
Did not work.

$ ls -l /usr/lib64/|grep liburcu-bp
lrwxrwxrwx  1 root root   19 May 10 05:27 liburcu-bp.so ->
liburcu-bp.so.2.0.0
lrwxrwxrwx  1 root root   19 May 10 05:26 liburcu-bp.so.2 ->
liburcu-bp.so.2.0.0
-rwxr-xr-x  1 root root32112 Feb 25 20:27 liburcu-bp.so.2.0.0

Can you point me which package has the /usr/lib64/liburcu-bp.la libtools
file.

Regards
Srikanth


On Mon, May 11, 2015 at 3:30 AM, Alexandre DERUMIER 
wrote:

> >>I tries searching on internet and could not find a el7 package with
> liburcu-bp.la file, let me know which rpm package has this libtool
> archive.
>
> Hi, maybe can you try
>
> ./install-deps.sh
>
> to install needed dependencies.
>
>
>
> - Mail original -
> De: "Srikanth Madugundi" 
> À: "Somnath Roy" 
> Cc: "ceph-users" 
> Envoyé: Dimanche 10 Mai 2015 08:21:06
> Objet: Re: [ceph-users] osd does not start when object store is set to
> "newstore"
>
> Hi,
> Thanks a lot Somnath for the help. I tried to change "./autogen.sh" to
> "./do_autogen.sh -r" but see this error during building process. I tried
> searching
>
>
> CC libosd_tp_la-osd.lo
> CC libosd_tp_la-pg.lo
> CC librbd_tp_la-librbd.lo
> CC librados_tp_la-librados.lo
> CC libos_tp_la-objectstore.lo
> CCLD libosd_tp.la
> /usr/bin/grep: /usr/lib64/ liburcu-bp.la : No such file or directory
> /usr/bin/sed: can't read /usr/lib64/ liburcu-bp.la : No such file or
> directory
> libtool: link: `/usr/lib64/ liburcu-bp.la ' is not a valid libtool archive
> make[5]: *** [ libosd_tp.la ] Error 1
> make[5]: *** Waiting for unfinished jobs
> make[5]: Leaving directory
> `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src/tracing'
> make[4]: *** [all] Error 2
> make[4]: Leaving directory
> `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src/tracing'
> make[3]: *** [all-recursive] Error 1
> make[3]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src'
> make[2]: *** [all] Error 2
> make[2]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93'
> error: Bad exit status from /var/tmp/rpm-tmp.WEQTYW (%build)
>
> I have the following package installed.
>
> $ rpm -qa |grep userspace
> userspace-rcu-devel-0.8.6-1.fc23.x86_64
> userspace-rcu-0.8.6-1.fc23.x86_64
>
> I tries searching on internet and could not find a el7 package with
> liburcu-bp.la file, let me know which rpm package has this libtool
> archive.
>
> Regards
> Srikanth
>
>
> On Fri, May 8, 2015 at 10:41 AM, Somnath Roy < somnath@sandisk.com >
> wrote:
>
>
>
>
>
> I think you need to build code with rocksdb enabled if you are not already
> doing this.
>
>
>
> Go to root folder and try this..
>
>
>
> ./do_autogen.sh –r
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> From: Srikanth Madugundi [mailto: srikanth.madugu...@gmail.com ]
> Sent: Friday, May 08, 2015 10:33 AM
> To: Somnath Roy
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] osd does not start when object store is set to
> "newstore"
>
>
>
>
>
> I tried adding "enable experimental unrecoverable data corrupting features
> = newstore rocksdb" but no luck.
>
>
>
>
>
> Here is the config I am using.
>
>
>
>
>
> [global]
>
>
> .
>
>
> .
>
>
> .
>
>
> osd objectstore = newstore
>
> newstore backend = rocksdb
>
> enable experimental unrecoverable data corrupting features = newstore
> rocksdb
>
>
>
> Regards
>
> -Srikanth
>
>
>
>
>
> On Thu, May 7, 2015 at 10:59 PM, Somnath Roy < somnath@sandisk.com >
> wrote:
>
>
> I think you need to add the following..
>
>
>
> enable experimental unrecoverable data corrupting features = newstore
> rocksdb
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
>
>
> From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf
> Of Srikanth Madugundi
> Sent: Thursday, May 07, 2015 10:56 PM
> To: ceph-us...@ceph.com
> Subject: [ceph-users] osd does not start when object store is set to
> "newstore"
>
>
>
>
>
> Hi,
>
>
>
>
>
> I built and installed ceph source from (wip-newstore) branch and could not
> start osd with "newstore" as osd objectstore.
>
>
>
>
>
> $ sudo /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c
> /etc/ceph/ceph.conf --cluster ceph -f
>
>
> 2015-05-08 05:49:16.130073 7f286be01880 -1 unable to create object store
>
>
> $
>
>
>
>
>
>  ceph.config ( I have the following settings in ceph.conf)
>
>
>
>
>
> [global]
>
>
> osd objectstore = newstore
>
>
> newstore backend = rocksdb
>
>
>
>
>
> enable experimental unrecoverable data corrupting features = newstore
>
>
>
>
>
> The logs does not show much details.
>
>
>
>
>
> $ tail -f /var/log/ceph/ceph-osd.0.log
>
>
> 2015-05-08 00:01:54.331136 7fb00e07c880 0 ceph version (), process
> ceph-osd, pid 23514
>
>
> 2015-05-08 00:01:54.331202 7fb00e07c880 -1 unable to create object store
>
>
>
>
>
> Am I missing something?
>
>
>
>
>
> Regards
>
>
> Srikanth
>
>
>
>
>
>
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the d

Re: [ceph-users] very different performance on two volumes in the same pool #2

2015-05-11 Thread Alexandre DERUMIER
Hi,
I'm currently doing benchmark too, and I don't see this behavior

>>I get very nice performance of up to 200k IOPS. However once the volume is
>>written to (ie when I map it using rbd map and dd whole volume with some 
>>random data),
>>and repeat the benchmark, random performance drops to ~23k IOPS.

I can reach 200k iops with 1 osd,with datas inside the osd,and data are in 
buffer of osd.

osd cpu : 60% of 2x10cores 3,1ghz
fio-rbd cpu  :40% of 2x10 cores 3,1ghz

(So I'm not sure about performance with only 1? quad core)

When datas are read from osd, I can reach around 60k iops by ssd on intel s3500
(with disabling readahead 
echo 0 > /sys/class/block/sdX/queue/read_ahead_kb)



here my ceph.conf
-
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
osd_pool_default_min_size = 1
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
osd_op_threads = 5
filestore_op_threads = 4
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
ms_nocrc = true
ms_dispatch_throttle_bytes = 0
cephx_sign_messages = false
cephx_require_signatures = false
throttler_perf_counter = false
ms_crc_header = false
ms_crc_data = false

[osd]
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false

[client]
rbd_cache = false



- Mail original -
De: "Nikola Ciprich" 
À: "ceph-users" 
Cc: n...@linuxbox.cz
Envoyé: Lundi 11 Mai 2015 06:43:04
Objet: [ceph-users] very different performance on two volumes in the same   
pool #2

Hello ceph developers and users, 

some time ago, I posted here a question regarding very different 
performance for two volumes in one pool (backed by SSD drives). 

After some examination, I probably got to the root of the problem.. 

When I create fresh volume (ie rbd create --image-format 2 --size 51200 
ssd/test) 
and run random io fio benchmark 

fio --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 --name=test 
--pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k --iodepth=64 
--readwrite=randread 

I get very nice performance of up to 200k IOPS. However once the volume is 
written to (ie when I map it using rbd map and dd whole volume with some random 
data), 
and repeat the benchmark, random performance drops to ~23k IOPS. 

This leads me to conjecture that for unwritten (sparse) volumes, read 
is just a noop, simply returning zeroes without really having to read 
data from physical storage, and thus showing nice performance, but once 
the volume is written, performance drops due to need to physically read the 
data, right? 

However I'm a bit unhappy about the performance drop, the pool is backed 
by 3 SSD drives (each having random io performance of 100k iops) on three 
nodes, and object size is set to 3. Cluster is completely idle, nodes 
are quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 
3.18.12, 
ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 
2.4) 
Nodes are connected using 10gb ethernet, with jumbo frames enabled. 


I tried tuning following values: 

osd_op_threads = 5 
filestore_op_threads = 4 
osd_op_num_threads_per_shard = 1 
osd_op_num_shards = 25 
filestore_fd_cache_size = 64 
filestore_fd_cache_shards = 32 

I don't see anything special in perf: 

5.43% [kernel] [k] acpi_processor_ffh_cstate_enter 
2.93% libtcmalloc.so.4.2.6 [.] 0x00017d2c 
2.45% libpthread-2.12.so [.] pthread_mutex_lock 
2.37% libpthread-2.12.so [.] pthread_mutex_unlock 
2.33% [kernel] [k] do_raw_spin_lock 
2.00% libsoftokn3.so [.] 0x0001f455 
1.96% [kernel] [k] __switch_to 
1.32% [kernel] [k] __schedule 
1.24% libstdc++.so.6.0.13 [.] std::basic_ostream 
>& std::__ostream_insert 
>(std::basic_ostream 
>::xsputn(char const*, long) 
0.93% ceph-osd [.] crush_hash32_3 
0.85% libc-2.12.so [.] vfprintf 
0.84% libc-2.12.so [.] __strlen_sse42 
0.80% [kernel] [k] get_futex_key_refs 
0.80% libpthread-2.12.so [.] pthread_mutex_trylock 
0.78% libtcmalloc.so.4.2.6 [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, 
unsigned long, int) 
0.71% libstdc++.so.6.0.13 [.] std::basic_string, 
std::allocator >::basic_string(std::string const&) 
0.68% ceph-osd [.] ceph::log::Log::flush() 
0.66% libtcmalloc.so.4.2.6 [.] tc_free 
0.63% [kernel] [k] resched_curr 
0.63% [kernel] [k] page_fault 
0.62% libstdc++.so.6.0.13 [.] std::string::reserve(unsigned long) 

I'm running benchmark directly on one of nodes, which I know is not optimal, 
but it's still able to give those 200k iops for empty volume, so I guess it 
shouldn

Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Robert LeBlanc
Personally I would not just run this command automatically because as you
stated, it only copies the primary PGs to the replicas and if the primary
is corrupt, you will corrupt your secondaries.I think the monitor log shows
which OSD has the problem so if it is not your primary, then just issue the
repair command.

There was talk, and I believe work towards, Ceph storing a hash of the
object so that it can be smarter about which replica has the correct data
and automatically replicate the good data no matter where it is. I think
the first part, creating the hash and storing it, has been included in
Hammer. I'm not an authority on this so take it with a grain of salt.

Right now our procedure is to find the PG files on the OSDs, perform a MD5
on all of them and the one that doesn't match, overwrite, either by issuing
the PG repair command, or removing the bad PG files, rsyncing them with the
-X argument and then instructing a deep-scrub on the PG to clear it up in
Ceph.

I've only tested this on an idle cluster, so I don't know how well it will
work on an active cluster. Since we issue a deep-scrub, if the PGs of the
replicas change during the rsync, it should come up with an error. The idea
is to keep rsyncing until the deep-scrub is clean. Be warned that you may
be aiming your gun at your foot with this!


Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Mon, May 11, 2015 at 2:09 AM, Christian Eichelmann <
christian.eichelm...@1und1.de> wrote:

> Hi all!
>
> We are experiencing approximately 1 scrub error / inconsistent pg every
> two days. As far as I know, to fix this you can issue a "ceph pg
> repair", which works fine for us. I have a few qestions regarding the
> behavior of the ceph cluster in such a case:
>
> 1. After ceph detects the scrub error, the pg is marked as inconsistent.
> Does that mean that any IO to this pg is blocked until it is repaired?
>
> 2. Is this amount of scrub errors normal? We currently have only 150TB
> in our cluster, distributed over 720 2TB disks.
>
> 3. As far as I know, a "ceph pg repair" just copies the content of the
> primary pg to all replicas. Is this still the case? What if the primary
> copy is the one having errors? We have a 4x replication level and it
> would be cool if ceph would use one of the pg for recovery which has the
> same checksum as the majority of pgs.
>
> 4. Some of this errors are happening at night. Since ceph reports this
> as a critical error, our shift is called and wake up, just to issue a
> single command. Do you see any problems in triggering this command
> automatically via monitoring event? Is there a reason why ceph isn't
> resolving these errors itself when it has enought replicas to do so?
>
> Regards,
> Christian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs corruption, data disaster!

2015-05-11 Thread Ric Wheeler

On 05/05/2015 04:13 AM, Yujian Peng wrote:

Emmanuel Florac  writes:


Le Mon, 4 May 2015 07:00:32 + (UTC)
Yujian Peng  126.com> écrivait:


I'm encountering a data disaster. I have a ceph cluster with 145 osd.
The data center had a power problem yesterday, and all of the ceph
nodes were down. But now I find that 6 disks(xfs) in 4 nodes have
data corruption. Some disks are unable to mount, and some disks have
IO errors in syslog. mount: Structure needs cleaning
xfs_log_forece: error 5 returned
I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd
reported a leveldb error:
Error initializing leveldb: Corruption: checksum mismatch
I cannot start the 6 osds and 22 pgs is down.
This is really a tragedy for me. Can you give me some idea to
recovery the xfs? Thanks very much!

For XFS problems, ask the XFS ML: xfs  oss.sgi.com

You didn't give enough details, by far. What version of kernel and
distro are you running? If there were errors, please post extensive
logs. If you have IO errors on some disks, you probably MUST replace
them before going any further.

Why did you run xfs_repair -L ? Did you try xfs_repair without options
first? Were you running the very very latest version of xfs_repair
(3.2.2) ?


The OS is ubuntu 12.04.5 with kernel 3.13.0
uname -a
Linux ceph19 3.13.0-32-generic #57~precise1-Ubuntu SMP Tue Jul 15 03:51:20
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/issue
Ubuntu 12.04.5 LTS \n \l
xfs_repair -V
xfs_repair version 3.1.7
I've tried xfs_repair without options, but it showed me some errors, so I
used the -L option.
Thanks for your reply!



Responding quickly to a couple of things:

* xfs_repair -L wipes out the XFS log, not normally a good thing to do

* replacing disks with IO errors is not a great idea if you still need that 
data. You might want to copy the data from that disk to a new disk (same or 
greater size) and then try to repair that new disk.  A lot depends on the type 
of IO error you see - you might have cable issues, HBA issues, or fairly normal 
read issues (which are not worth replacing a disk for).


You should work with your vendor's support team if you have a support contract 
or post the the XFS devel list (copied above) for help.


Good luck!

Ric



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Find out the location of OSD Journal

2015-05-11 Thread Sebastien Han
Under the OSD directory, you can look where the symlink points. This is 
generally called ‘journal’, it should point to a device.

> On 06 May 2015, at 06:54, Patrik Plank  wrote:
> 
> Hi,
> 
> i cant remember on which drive I install which OSD journal :-||
> Is there any command to show this?
> 
> 
> thanks
> regards
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.

Sébastien Han
Cloud Architect

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72
Mail: sebastien@enovance.com
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD in ceph.conf

2015-05-11 Thread Robert LeBlanc
If you use ceph-disk (and I believe ceph-depoly) to create your OSDs, or
you go through the manual steps to set up the partition UUIDs, then yes
udev and the init script will do all the magic. Your disks can be moved to
another box without problems. I've moved disks to different ports on
controllers and it all worked just fine. I will be swapping the disks
between two boxes today to try to get to the bottom of some problems we
have been having, if it doesn't work I'll let you know.

The automagic of ceph OSDS has been refreshing for me because I was worried
about having to manage so many disks and mount points, but it is much
easier than I anticipated once I used ceph-disk.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On May 11, 2015 5:32 AM, "Georgios Dimitrakakis" 
wrote:

> Hi Robert,
>
> just to make sure I got it correctly:
>
> Do you mean that the /etc/mtab entries are completely ignored and no
> matter what the order
> of the /dev/sdX device is Ceph will just mount correctly the osd/ceph-X by
> default?
>
> In addition, assuming that an OSD node fails for a reason other than a
> disk problem (e.g. mobo/ram)
> if I put its disks on another OSD node (all disks have their journals
> with) will Ceph be able to mount
> them correctly and continue its operation?
>
> Regards,
>
> George
>
>  I have not used ceph-deploy, but it should use ceph-disk for the OSD
>> preparation.  Ceph-disk creates GPT partitions with specific
>> partition UUIDS for data and journals. When udev or init starts the
>> OSD, or mounts it to a temp location reads the whoami file and the
>> journal, then remounts it in the correct location. There is no need
>> for fstab entries or the like. This allows you to easily move OSD
>> disks between servers (if you take the journals with it). Its magic!
>> But I think I just gave away the secret.
>>
>> Robert LeBlanc
>>
>> Sent from a mobile device please excuse any typos.
>> On May 7, 2015 5:16 AM, "Georgios Dimitrakakis"  wrote:
>>
>>  Indeed it is not necessary to have any OSD entries in the Ceph.conf
>>> file
>>> but what happens in the event of a disk failure resulting in
>>> changing the mount device?
>>>
>>> For what I can see is that OSDs are mounted from entries in
>>> /etc/mtab (I am on CentOS 6.6)
>>> like this:
>>>
>>> /dev/sdj1 /var/lib/ceph/osd/ceph-8 xfs rw,noatime,inode64 0 0
>>> /dev/sdh1 /var/lib/ceph/osd/ceph-6 xfs rw,noatime,inode64 0 0
>>> /dev/sdg1 /var/lib/ceph/osd/ceph-5 xfs rw,noatime,inode64 0 0
>>> /dev/sde1 /var/lib/ceph/osd/ceph-3 xfs rw,noatime,inode64 0 0
>>> /dev/sdi1 /var/lib/ceph/osd/ceph-7 xfs rw,noatime,inode64 0 0
>>> /dev/sdf1 /var/lib/ceph/osd/ceph-4 xfs rw,noatime,inode64 0 0
>>> /dev/sdd1 /var/lib/ceph/osd/ceph-2 xfs rw,noatime,inode64 0 0
>>> /dev/sdk1 /var/lib/ceph/osd/ceph-9 xfs rw,noatime,inode64 0 0
>>> /dev/sdb1 /var/lib/ceph/osd/ceph-0 xfs rw,noatime,inode64 0 0
>>> /dev/sdc1 /var/lib/ceph/osd/ceph-1 xfs rw,noatime,inode64 0 0
>>>
>>> So in the event of a disk failure (e.g. disk SDH fails) then in the
>>> order the next one will take its place meaning that
>>> SDI will be seen as SDH upon next reboot thus it will be mounted as
>>> CEPH-6 instead of CEPH-7 and so on...resulting in a problematic
>>> configuration (I guess that lots of data will be start moving
>>> around, PGs will be misplaced etc.)
>>>
>>> Correct me if I am wrong but the proper way to mount them would be
>>> by using the UUID of the partition.
>>>
>>> Is it OK if I change the entries in /etc/mtab using the UUID=xx
>>> instead of /dev/sdX1??
>>>
>>> Does CEPH try to mount them using a different config file and
>>> perhaps exports the entries at boot in /etc/mtab (in the latter case
>>> no modification in /etc/mtab will be taken into account)??
>>>
>>> I have deployed the Ceph cluster using only the "ceph-deploy"
>>> command. Is there a parameter that I ve missed that must be used
>>> during deployment in order to specify the mount points using the
>>> UUIDs instead of the device names?
>>>
>>> Regards,
>>>
>>> George
>>>
>>> On Wed, 6 May 2015 22:36:14 -0600, Robert LeBlanc wrote:
>>>
>>>  We dont have OSD entries in our Ceph config. They are not needed
 if
 you dont have specific configs for different OSDs.

 Robert LeBlanc

 Sent from a mobile device please excuse any typos.
 On May 6, 2015 7:18 PM, "Florent MONTHEL"  wrote:

  Hi teqm,
>
> Is it necessary to indicate in ceph.conf all OSD that we have
> in the
> cluster ?
> we have today reboot a cluster (5 nodes RHEL 6.5) and some OSD
> seem
> to have change ID so crush map not mapped with the reality
> Thanks
>
> FLORENT MONTHEL
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com [1] [1]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2]
>

 Links:
 --
 [1] mailto:ceph-users@lists.ceph.com [3]
 [2] http://lists.ceph

Re: [ceph-users] very different performance on two volumes in the same pool #2

2015-05-11 Thread Mason, Michael
I had the same problem when doing benchmarks with small block sizes (<8k) to 
RBDs. These settings seemed to fix the problem for me.

sudo ceph tell osd.* injectargs '--filestore_merge_threshold 40'
sudo ceph tell osd.* injectargs '--filestore_split_multiple 8'

After you apply the settings give it a few minutes to shuffle the data around.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, May 11, 2015 3:21 AM
To: Nikola Ciprich
Cc: ceph-users; n...@linuxbox.cz
Subject: Re: [ceph-users] very different performance on two volumes in the same 
pool #2

Nik,
If you increase num_jobs  beyond 4 , is it helping further ?  Try 8 or so.
Yeah,  libsoft* is definitely consuming some cpu cycles , but I don't know how 
to resolve that.
Also, acpi_processor_ffh_cstate_enter popped up and consuming lot of cpu. Try 
disabling cstate and run cpu in maximum performance mode , this may give you 
some boost.

Thanks & Regards
Somnath

-Original Message-
From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] 
Sent: Sunday, May 10, 2015 11:32 PM
To: Somnath Roy
Cc: ceph-users; n...@linuxbox.cz
Subject: Re: [ceph-users] very different performance on two volumes in the same 
pool #2

On Mon, May 11, 2015 at 06:07:21AM +, Somnath Roy wrote:
> Yes, you need to run fio clients on a separate box, it will take quite a bit 
> of cpu.
> Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster 
> to go for active + clean state ? If you are running while rebalancing is 
> going on , the performance will be impacted.
I set noout, so there was no rebalancing, I forgot to mention that..


> 
> ~110%  cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say 
> 3 or 4 or more), io_depth =64 is fine and see if it improves performance or 
> not.
ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 
43.3K iops..

OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a 
bit of reserves.. 

> Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the 
> following settings
> 
> osd_op_num_threads_per_shard
> osd_op_num_shards
> 
> May be (1,10 / 1,15 / 2, 10 ?).

tried all those combinations, but it doesn't make almost any difference..

do you think I could get more then those 43k?

one more think that makes me wonder a bit is this line I can see in perf:
  2.21%  libsoftokn3.so [.] 0x0001ebb2

I suppose this has something to do with resolving, 2.2% seems quite a lot to 
me..
Should I be worried about it? Does it make sense to enable kernel DNS resolving 
support in ceph?

thanks for your time Somnath!

nik



> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz]
> Sent: Sunday, May 10, 2015 10:33 PM
> To: Somnath Roy
> Cc: ceph-users; n...@linuxbox.cz
> Subject: Re: [ceph-users] very different performance on two volumes in 
> the same pool #2
> 
> 
> On Mon, May 11, 2015 at 05:20:25AM +, Somnath Roy wrote:
> > Two things..
> > 
> > 1. You should always use SSD drives for benchmarking after preconditioning 
> > it.
> well, I don't really understand... ?
> 
> > 
> > 2. After creating and mapping rbd lun, you need to write data first 
> > to read it afterword otherwise fio output will be misleading. In 
> > fact, I think you will see IO is not even hitting cluster (check 
> > with ceph -s)
> yes, so this approves my conjecture. ok.
> 
> 
> > 
> > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check 
> > the following.
> > 
> > 1. Check client or OSd node cpu is saturating or not.
> On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node 
> (which is one of OSD nodes as well), I can see fio eating quite lot of CPU 
> cycles.. I tried stopping ceph-osd on this node (thus only two nodes are 
> serving data) and performance got a bit higher, to ~33k IOPS. But still I 
> think it's not very good..
> 
> 
> > 
> > 2. With 4K, hope network BW is fine
> I think it's ok..
> 
> 
> > 
> > 3. Number of PGs/pool should be ~128 or so.
> I'm using pg_num 128
> 
> 
> > 
> > 4. If you are using krbd, you might want to try latest krbd module where 
> > TCP_NODELAY problem is fixed. If you don't want that complexity, try with 
> > fio-rbd.
> I'm not using RBD (only for writing data to volume), for benchmarking, I'm 
> using fio-rbd.
> 
> anything else I could check?
> 
> 
> > 
> > Hope this helps,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
> > Behalf Of Nikola Ciprich
> > Sent: Sunday, May 10, 2015 9:43 PM
> > To: ceph-users
> > Cc: n...@linuxbox.cz
> > Subject: [ceph-users] very different performance on two volumes in 
> > the same pool #2
> > 
> > Hello ceph developers and users,
> > 
> > some time ago, I posted here a question regarding very different 
> > perform

Re: [ceph-users] OSD in ceph.conf

2015-05-11 Thread Georgios Dimitrakakis

Hi Robert,

just to make sure I got it correctly:

Do you mean that the /etc/mtab entries are completely ignored and no 
matter what the order
of the /dev/sdX device is Ceph will just mount correctly the osd/ceph-X 
by default?


In addition, assuming that an OSD node fails for a reason other than a 
disk problem (e.g. mobo/ram)
if I put its disks on another OSD node (all disks have their journals 
with) will Ceph be able to mount

them correctly and continue its operation?

Regards,

George


I have not used ceph-deploy, but it should use ceph-disk for the OSD
preparation.  Ceph-disk creates GPT partitions with specific
partition UUIDS for data and journals. When udev or init starts the
OSD, or mounts it to a temp location reads the whoami file and the
journal, then remounts it in the correct location. There is no need
for fstab entries or the like. This allows you to easily move OSD
disks between servers (if you take the journals with it). Its magic! 
But I think I just gave away the secret.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On May 7, 2015 5:16 AM, "Georgios Dimitrakakis"  wrote:


Indeed it is not necessary to have any OSD entries in the Ceph.conf
file
but what happens in the event of a disk failure resulting in
changing the mount device?

For what I can see is that OSDs are mounted from entries in
/etc/mtab (I am on CentOS 6.6)
like this:

/dev/sdj1 /var/lib/ceph/osd/ceph-8 xfs rw,noatime,inode64 0 0
/dev/sdh1 /var/lib/ceph/osd/ceph-6 xfs rw,noatime,inode64 0 0
/dev/sdg1 /var/lib/ceph/osd/ceph-5 xfs rw,noatime,inode64 0 0
/dev/sde1 /var/lib/ceph/osd/ceph-3 xfs rw,noatime,inode64 0 0
/dev/sdi1 /var/lib/ceph/osd/ceph-7 xfs rw,noatime,inode64 0 0
/dev/sdf1 /var/lib/ceph/osd/ceph-4 xfs rw,noatime,inode64 0 0
/dev/sdd1 /var/lib/ceph/osd/ceph-2 xfs rw,noatime,inode64 0 0
/dev/sdk1 /var/lib/ceph/osd/ceph-9 xfs rw,noatime,inode64 0 0
/dev/sdb1 /var/lib/ceph/osd/ceph-0 xfs rw,noatime,inode64 0 0
/dev/sdc1 /var/lib/ceph/osd/ceph-1 xfs rw,noatime,inode64 0 0

So in the event of a disk failure (e.g. disk SDH fails) then in the
order the next one will take its place meaning that
SDI will be seen as SDH upon next reboot thus it will be mounted as
CEPH-6 instead of CEPH-7 and so on...resulting in a problematic
configuration (I guess that lots of data will be start moving
around, PGs will be misplaced etc.)

Correct me if I am wrong but the proper way to mount them would be
by using the UUID of the partition.

Is it OK if I change the entries in /etc/mtab using the UUID=xx
instead of /dev/sdX1??

Does CEPH try to mount them using a different config file and
perhaps exports the entries at boot in /etc/mtab (in the latter case
no modification in /etc/mtab will be taken into account)??

I have deployed the Ceph cluster using only the "ceph-deploy"
command. Is there a parameter that I ve missed that must be used
during deployment in order to specify the mount points using the
UUIDs instead of the device names?

Regards,

George

On Wed, 6 May 2015 22:36:14 -0600, Robert LeBlanc wrote:


We dont have OSD entries in our Ceph config. They are not needed
if
you dont have specific configs for different OSDs.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On May 6, 2015 7:18 PM, "Florent MONTHEL"  wrote:


Hi teqm,

Is it necessary to indicate in ceph.conf all OSD that we have
in the
cluster ?
we have today reboot a cluster (5 nodes RHEL 6.5) and some OSD
seem
to have change ID so crush map not mapped with the reality
Thanks

FLORENT MONTHEL
___
ceph-users mailing list
ceph-users@lists.ceph.com [1] [1]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] [2]


Links:
--
[1] mailto:ceph-users@lists.ceph.com [3]
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
[3] mailto:florent.mont...@flox-arts.net [5]


___
ceph-users mailing list
ceph-users@lists.ceph.com [6]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [7]



Links:
--
[1] mailto:ceph-users@lists.ceph.com
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] mailto:ceph-users@lists.ceph.com
[4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[5] mailto:florent.mont...@flox-arts.net
[6] mailto:ceph-users@lists.ceph.com
[7] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[8] mailto:gior...@acmac.uoc.gr

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RFC: Deprecating ceph-tool commands

2015-05-11 Thread John Spray



On 09/05/2015 00:55, Joao Eduardo Luis wrote:

A command being DEPRECATED must be:

  - clearly marked as DEPRECATED in usage;
  - kept around for at least 2 major releases;
  - kept compatible for the duration of the deprecation period.

Once two major releases go by, the command will then enter the OBSOLETE
period.  This would be one major release, during which the command would
no longer work although still acknowledged.  A simple message down the
lines of 'This command is now obsolete; please check the docs' would
suffice to inform the user.

The command would no longer exist in the next major release.

This approach gives a lifespan of roughly 3 releases (at current rate,
roughly 1.5 years) before being completely dropped.  This should give
enough time to people to realize what has happened and adjust any
scripts they may have.


+1, this seems like a reasonable timescale, but I think the important 
thing is that it's deprecated in at least one LTS release before it's 
actually removed.  So maybe we should just define it like that, and say 
"two stable releases or one LTS release, whichever is longer".  But I 
guess definition of LTS is a per-downstream-vendor thing, so maybe 
harder to define -- maybe the downstream part could be a guideline to 
downstream packagers, that will require no work from them as long as 
they are generating LTS releases on at least every other stable release.


An additional thought: it might be useful to have a "strict" flag for 
processes sending commands, so that e.g. management tools in QA can set 
that and fail out when they use something deprecated. Otherwise, 
automated things would tend not to notice messages about deprecation.


Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd does not start when object store is set to "newstore"

2015-05-11 Thread Alexandre DERUMIER
>>I tries searching on internet and could not find a el7 package with 
>>liburcu-bp.la file, let me know which rpm package has this libtool archive. 

Hi, maybe can you try

./install-deps.sh

to install needed dependencies.



- Mail original -
De: "Srikanth Madugundi" 
À: "Somnath Roy" 
Cc: "ceph-users" 
Envoyé: Dimanche 10 Mai 2015 08:21:06
Objet: Re: [ceph-users] osd does not start when object store is set to  
"newstore"

Hi, 
Thanks a lot Somnath for the help. I tried to change "./autogen.sh" to 
"./do_autogen.sh -r" but see this error during building process. I tried 
searching 


CC libosd_tp_la-osd.lo 
CC libosd_tp_la-pg.lo 
CC librbd_tp_la-librbd.lo 
CC librados_tp_la-librados.lo 
CC libos_tp_la-objectstore.lo 
CCLD libosd_tp.la 
/usr/bin/grep: /usr/lib64/ liburcu-bp.la : No such file or directory 
/usr/bin/sed: can't read /usr/lib64/ liburcu-bp.la : No such file or directory 
libtool: link: `/usr/lib64/ liburcu-bp.la ' is not a valid libtool archive 
make[5]: *** [ libosd_tp.la ] Error 1 
make[5]: *** Waiting for unfinished jobs 
make[5]: Leaving directory 
`/home/srikanth/rpmbuild/BUILD/ceph-0.93/src/tracing' 
make[4]: *** [all] Error 2 
make[4]: Leaving directory 
`/home/srikanth/rpmbuild/BUILD/ceph-0.93/src/tracing' 
make[3]: *** [all-recursive] Error 1 
make[3]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src' 
make[2]: *** [all] Error 2 
make[2]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93/src' 
make[1]: *** [all-recursive] Error 1 
make[1]: Leaving directory `/home/srikanth/rpmbuild/BUILD/ceph-0.93' 
error: Bad exit status from /var/tmp/rpm-tmp.WEQTYW (%build) 

I have the following package installed. 

$ rpm -qa |grep userspace 
userspace-rcu-devel-0.8.6-1.fc23.x86_64 
userspace-rcu-0.8.6-1.fc23.x86_64 

I tries searching on internet and could not find a el7 package with 
liburcu-bp.la file, let me know which rpm package has this libtool archive. 

Regards 
Srikanth 


On Fri, May 8, 2015 at 10:41 AM, Somnath Roy < somnath@sandisk.com > wrote: 





I think you need to build code with rocksdb enabled if you are not already 
doing this. 



Go to root folder and try this.. 



./do_autogen.sh –r 



Thanks & Regards 

Somnath 



From: Srikanth Madugundi [mailto: srikanth.madugu...@gmail.com ] 
Sent: Friday, May 08, 2015 10:33 AM 
To: Somnath Roy 
Cc: ceph-us...@ceph.com 
Subject: Re: [ceph-users] osd does not start when object store is set to 
"newstore" 





I tried adding "enable experimental unrecoverable data corrupting features = 
newstore rocksdb" but no luck. 





Here is the config I am using. 





[global] 


. 


. 


. 


osd objectstore = newstore 

newstore backend = rocksdb 

enable experimental unrecoverable data corrupting features = newstore rocksdb 



Regards 

-Srikanth 





On Thu, May 7, 2015 at 10:59 PM, Somnath Roy < somnath@sandisk.com > wrote: 


I think you need to add the following.. 



enable experimental unrecoverable data corrupting features = newstore rocksdb 



Thanks & Regards 

Somnath 





From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of 
Srikanth Madugundi 
Sent: Thursday, May 07, 2015 10:56 PM 
To: ceph-us...@ceph.com 
Subject: [ceph-users] osd does not start when object store is set to "newstore" 





Hi, 





I built and installed ceph source from (wip-newstore) branch and could not 
start osd with "newstore" as osd objectstore. 





$ sudo /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c 
/etc/ceph/ceph.conf --cluster ceph -f 


2015-05-08 05:49:16.130073 7f286be01880 -1 unable to create object store 


$ 





 ceph.config ( I have the following settings in ceph.conf) 





[global] 


osd objectstore = newstore 


newstore backend = rocksdb 





enable experimental unrecoverable data corrupting features = newstore 





The logs does not show much details. 





$ tail -f /var/log/ceph/ceph-osd.0.log 


2015-05-08 00:01:54.331136 7fb00e07c880 0 ceph version (), process ceph-osd, 
pid 23514 


2015-05-08 00:01:54.331202 7fb00e07c880 -1 unable to create object store 





Am I missing something? 





Regards 


Srikanth 







PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies). 








___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_

Re: [ceph-users] Crush rule freeze cluster

2015-05-11 Thread Georgios Dimitrakakis

Oops... to fast to answer...

G.

On Mon, 11 May 2015 12:13:48 +0300, Timofey Titovets wrote:

Hey! I catch it again. Its a kernel bug. Kernel crushed if i try to
map rbd device with map like above!
Hooray!

2015-05-11 12:11 GMT+03:00 Timofey Titovets :

FYI and history
Rule:
# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step choose firstn 0 type room
  step choose firstn 0 type rack
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}

And after reset node, i can't find any usable info. Cluster works 
fine

and data just rebalanced by osd disks.
syslog:
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network 
Time

Synchronization...
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network 
Time

Synchronization.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
installed, discarding output)
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
software="rsyslogd" swVersion="7.4.4" x-pid="689"
x-info="http://www.rsyslog.com";] start
May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid 
changed to 103
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid 
changed to 100


Sorry for noise, guys. Georgios, in any way, thanks for helping.

2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis 
:

Timofey,

may be your best chance is to connect directly at the server and 
see what is

going on.
Then you can try debug why the problem occurred. If you don't want 
to wait

until tomorrow
you may try to see what is going on using the server's direct 
remote console

access.
The majority of the servers provide you with that just with a 
different name
each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up 
and

running you can use that.

I think this should be your starting point and you can take it on 
from

there.

I am sorry I cannot help you further with the Crush rules and the 
reason why

it crashed since I am far from being an expert in the field :-(

Regards,

George


Georgios, oh, sorry for my poor english _-_, may be I poor 
expressed

what i want =]

i know how to write simple Crush rule and how use it, i want 
several

things things:
1. Understand why, after inject bad map, my test node make 
offline.

This is unexpected.
2. May be somebody can explain what and why happens with this map.
3. This is not a problem to write several crushmap or/and switch 
it

while cluster running.
But, in production, we have several nfs servers, i think about 
moving

it to ceph, but i can't disable more then 1 server for maintenance
simultaneously. I want avoid data disaster while setup and moving 
data

to ceph, case like "Use local data replication, if only one node
exist" looks usable as temporally solution, while i not add second
node _-_.
4. May be some one also have test cluster and can test that happen
with clients, if crushmap like it was injected.

2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis 
:


Hi Timofey,

assuming that you have more than one OSD hosts and that the 
replicator
factor is equal (or less) to the number of the hosts why don't 
you just

change the crushmap to host replication?

You just need to change the default CRUSHmap rule from

step chooseleaf firstn 0 type osd

to

step chooseleaf firstn 0 type host

I believe that this is the easiest way to do have replication 
across OSD

nodes unless you have a much more "sophisticated" setup.

Regards,

George




Hi list,
i had experiments with crush maps, and I've try to get raid1 
like
behaviour (if cluster have 1 working osd node, duplicate data 
across
local disk, for avoiding data lose in case local disk failure 
and

allow client working, because this is not a degraded state)
(
  in best case, i want dynamic rule, like:
  if has only one host -> spread data over local disks;
  else if host count > 1 -> spread over hosts (rack o something 
else);

)

i write rule, like below:

rule test {
  ruleset 0
  type replicated
  min_size 0
  max_size 10
  step take default
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}

I've inject it in cluster and client node, now looks like have 
get
kernel panic, I've lost my connection with it. No ssh, no ping, 
this

is remote node and i can't see what happens until Monday.
Yes, it looks like I've shoot in my foot.
This is just a test setup and cluster destruction, not a 
problem, but
i think, what broken rules, must not crush something else and in 
worst

case, must be just ignored by cluster/crushtool compiler.

May be someone can explain, how this rule can crush system? May 
be

this is a crazy mistake somewhere?




--
___
ceph-users mailing list
ceph-u

Re: [ceph-users] Crush rule freeze cluster

2015-05-11 Thread Georgios Dimitrakakis

Timofey,

glad that you 've managed to get it working :-)

Best,

George


FYI and history
Rule:
# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step choose firstn 0 type room
  step choose firstn 0 type rack
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}

And after reset node, i can't find any usable info. Cluster works 
fine

and data just rebalanced by osd disks.
syslog:
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network 
Time

Synchronization...
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network Time
Synchronization.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
installed, discarding output)
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
software="rsyslogd" swVersion="7.4.4" x-pid="689"
x-info="http://www.rsyslog.com";] start
May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid
changed to 103
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid
changed to 100

Sorry for noise, guys. Georgios, in any way, thanks for helping.

2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis 
:

Timofey,

may be your best chance is to connect directly at the server and see 
what is

going on.
Then you can try debug why the problem occurred. If you don't want 
to wait

until tomorrow
you may try to see what is going on using the server's direct remote 
console

access.
The majority of the servers provide you with that just with a 
different name
each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up 
and

running you can use that.

I think this should be your starting point and you can take it on 
from

there.

I am sorry I cannot help you further with the Crush rules and the 
reason why

it crashed since I am far from being an expert in the field :-(

Regards,

George


Georgios, oh, sorry for my poor english _-_, may be I poor 
expressed

what i want =]

i know how to write simple Crush rule and how use it, i want 
several

things things:
1. Understand why, after inject bad map, my test node make offline.
This is unexpected.
2. May be somebody can explain what and why happens with this map.
3. This is not a problem to write several crushmap or/and switch it
while cluster running.
But, in production, we have several nfs servers, i think about 
moving

it to ceph, but i can't disable more then 1 server for maintenance
simultaneously. I want avoid data disaster while setup and moving 
data

to ceph, case like "Use local data replication, if only one node
exist" looks usable as temporally solution, while i not add second
node _-_.
4. May be some one also have test cluster and can test that happen
with clients, if crushmap like it was injected.

2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis 
:


Hi Timofey,

assuming that you have more than one OSD hosts and that the 
replicator
factor is equal (or less) to the number of the hosts why don't you 
just

change the crushmap to host replication?

You just need to change the default CRUSHmap rule from

step chooseleaf firstn 0 type osd

to

step chooseleaf firstn 0 type host

I believe that this is the easiest way to do have replication 
across OSD

nodes unless you have a much more "sophisticated" setup.

Regards,

George




Hi list,
i had experiments with crush maps, and I've try to get raid1 like
behaviour (if cluster have 1 working osd node, duplicate data 
across

local disk, for avoiding data lose in case local disk failure and
allow client working, because this is not a degraded state)
(
  in best case, i want dynamic rule, like:
  if has only one host -> spread data over local disks;
  else if host count > 1 -> spread over hosts (rack o something 
else);

)

i write rule, like below:

rule test {
  ruleset 0
  type replicated
  min_size 0
  max_size 10
  step take default
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}

I've inject it in cluster and client node, now looks like have 
get
kernel panic, I've lost my connection with it. No ssh, no ping, 
this

is remote node and i can't see what happens until Monday.
Yes, it looks like I've shoot in my foot.
This is just a test setup and cluster destruction, not a problem, 
but
i think, what broken rules, must not crush something else and in 
worst

case, must be just ignored by cluster/crushtool compiler.

May be someone can explain, how this rule can crush system? May 
be

this is a crazy mistake somewhere?




--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists

Re: [ceph-users] Crush rule freeze cluster

2015-05-11 Thread Timofey Titovets
Hey! I catch it again. Its a kernel bug. Kernel crushed if i try to
map rbd device with map like above!
Hooray!

2015-05-11 12:11 GMT+03:00 Timofey Titovets :
> FYI and history
> Rule:
> # rules
> rule replicated_ruleset {
>   ruleset 0
>   type replicated
>   min_size 1
>   max_size 10
>   step take default
>   step choose firstn 0 type room
>   step choose firstn 0 type rack
>   step choose firstn 0 type host
>   step chooseleaf firstn 0 type osd
>   step emit
> }
>
> And after reset node, i can't find any usable info. Cluster works fine
> and data just rebalanced by osd disks.
> syslog:
> May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
> May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network Time
> Synchronization...
> May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network Time
> Synchronization.
> May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
> May  9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
> installed, discarding output)
> May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
> software="rsyslogd" swVersion="7.4.4" x-pid="689"
> x-info="http://www.rsyslog.com";] start
> May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid changed to 
> 103
> May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid changed to 
> 100
>
> Sorry for noise, guys. Georgios, in any way, thanks for helping.
>
> 2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis :
>> Timofey,
>>
>> may be your best chance is to connect directly at the server and see what is
>> going on.
>> Then you can try debug why the problem occurred. If you don't want to wait
>> until tomorrow
>> you may try to see what is going on using the server's direct remote console
>> access.
>> The majority of the servers provide you with that just with a different name
>> each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up and
>> running you can use that.
>>
>> I think this should be your starting point and you can take it on from
>> there.
>>
>> I am sorry I cannot help you further with the Crush rules and the reason why
>> it crashed since I am far from being an expert in the field :-(
>>
>> Regards,
>>
>> George
>>
>>
>>> Georgios, oh, sorry for my poor english _-_, may be I poor expressed
>>> what i want =]
>>>
>>> i know how to write simple Crush rule and how use it, i want several
>>> things things:
>>> 1. Understand why, after inject bad map, my test node make offline.
>>> This is unexpected.
>>> 2. May be somebody can explain what and why happens with this map.
>>> 3. This is not a problem to write several crushmap or/and switch it
>>> while cluster running.
>>> But, in production, we have several nfs servers, i think about moving
>>> it to ceph, but i can't disable more then 1 server for maintenance
>>> simultaneously. I want avoid data disaster while setup and moving data
>>> to ceph, case like "Use local data replication, if only one node
>>> exist" looks usable as temporally solution, while i not add second
>>> node _-_.
>>> 4. May be some one also have test cluster and can test that happen
>>> with clients, if crushmap like it was injected.
>>>
>>> 2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis :

 Hi Timofey,

 assuming that you have more than one OSD hosts and that the replicator
 factor is equal (or less) to the number of the hosts why don't you just
 change the crushmap to host replication?

 You just need to change the default CRUSHmap rule from

 step chooseleaf firstn 0 type osd

 to

 step chooseleaf firstn 0 type host

 I believe that this is the easiest way to do have replication across OSD
 nodes unless you have a much more "sophisticated" setup.

 Regards,

 George



> Hi list,
> i had experiments with crush maps, and I've try to get raid1 like
> behaviour (if cluster have 1 working osd node, duplicate data across
> local disk, for avoiding data lose in case local disk failure and
> allow client working, because this is not a degraded state)
> (
>   in best case, i want dynamic rule, like:
>   if has only one host -> spread data over local disks;
>   else if host count > 1 -> spread over hosts (rack o something else);
> )
>
> i write rule, like below:
>
> rule test {
>   ruleset 0
>   type replicated
>   min_size 0
>   max_size 10
>   step take default
>   step choose firstn 0 type host
>   step chooseleaf firstn 0 type osd
>   step emit
> }
>
> I've inject it in cluster and client node, now looks like have get
> kernel panic, I've lost my connection with it. No ssh, no ping, this
> is remote node and i can't see what happens until Monday.
> Yes, it looks like I've shoot in my foot.
> This is just a test setup and cluster destructio

Re: [ceph-users] Crush rule freeze cluster

2015-05-11 Thread Timofey Titovets
FYI and history
Rule:
# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step choose firstn 0 type room
  step choose firstn 0 type rack
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}

And after reset node, i can't find any usable info. Cluster works fine
and data just rebalanced by osd disks.
syslog:
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting Network Time
Synchronization...
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started Network Time
Synchronization.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
installed, discarding output)
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
software="rsyslogd" swVersion="7.4.4" x-pid="689"
x-info="http://www.rsyslog.com";] start
May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupid changed to 103
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's userid changed to 100

Sorry for noise, guys. Georgios, in any way, thanks for helping.

2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis :
> Timofey,
>
> may be your best chance is to connect directly at the server and see what is
> going on.
> Then you can try debug why the problem occurred. If you don't want to wait
> until tomorrow
> you may try to see what is going on using the server's direct remote console
> access.
> The majority of the servers provide you with that just with a different name
> each (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it up and
> running you can use that.
>
> I think this should be your starting point and you can take it on from
> there.
>
> I am sorry I cannot help you further with the Crush rules and the reason why
> it crashed since I am far from being an expert in the field :-(
>
> Regards,
>
> George
>
>
>> Georgios, oh, sorry for my poor english _-_, may be I poor expressed
>> what i want =]
>>
>> i know how to write simple Crush rule and how use it, i want several
>> things things:
>> 1. Understand why, after inject bad map, my test node make offline.
>> This is unexpected.
>> 2. May be somebody can explain what and why happens with this map.
>> 3. This is not a problem to write several crushmap or/and switch it
>> while cluster running.
>> But, in production, we have several nfs servers, i think about moving
>> it to ceph, but i can't disable more then 1 server for maintenance
>> simultaneously. I want avoid data disaster while setup and moving data
>> to ceph, case like "Use local data replication, if only one node
>> exist" looks usable as temporally solution, while i not add second
>> node _-_.
>> 4. May be some one also have test cluster and can test that happen
>> with clients, if crushmap like it was injected.
>>
>> 2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis :
>>>
>>> Hi Timofey,
>>>
>>> assuming that you have more than one OSD hosts and that the replicator
>>> factor is equal (or less) to the number of the hosts why don't you just
>>> change the crushmap to host replication?
>>>
>>> You just need to change the default CRUSHmap rule from
>>>
>>> step chooseleaf firstn 0 type osd
>>>
>>> to
>>>
>>> step chooseleaf firstn 0 type host
>>>
>>> I believe that this is the easiest way to do have replication across OSD
>>> nodes unless you have a much more "sophisticated" setup.
>>>
>>> Regards,
>>>
>>> George
>>>
>>>
>>>
 Hi list,
 i had experiments with crush maps, and I've try to get raid1 like
 behaviour (if cluster have 1 working osd node, duplicate data across
 local disk, for avoiding data lose in case local disk failure and
 allow client working, because this is not a degraded state)
 (
   in best case, i want dynamic rule, like:
   if has only one host -> spread data over local disks;
   else if host count > 1 -> spread over hosts (rack o something else);
 )

 i write rule, like below:

 rule test {
   ruleset 0
   type replicated
   min_size 0
   max_size 10
   step take default
   step choose firstn 0 type host
   step chooseleaf firstn 0 type osd
   step emit
 }

 I've inject it in cluster and client node, now looks like have get
 kernel panic, I've lost my connection with it. No ssh, no ping, this
 is remote node and i can't see what happens until Monday.
 Yes, it looks like I've shoot in my foot.
 This is just a test setup and cluster destruction, not a problem, but
 i think, what broken rules, must not crush something else and in worst
 case, must be just ignored by cluster/crushtool compiler.

 May be someone can explain, how this rule can crush system? May be
 this is a crazy mistake somewhere?
>>>
>>>
>>>
>>> --
>>> _

Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Chris Hoy Poy
Hi Christian

In my experience, inconsistent PGs are almost always related back to a bad 
drive somewhere. They are going to keep happening, and with that many drives 
you still need to be diligent/aggressive in dropping bad drives and replacing 
them. 

If a drive returns an incorrect read, it can't be trusted from that point. Deep 
scrubs just serve to churn your bits and make sure you catch these errors early 
on. 

/Chris 


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Eichelmann
Sent: Monday, 11 May 2015 4:10 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Scrub Error / How does ceph pg repair work?

Hi all!

We are experiencing approximately 1 scrub error / inconsistent pg every two 
days. As far as I know, to fix this you can issue a "ceph pg repair", which 
works fine for us. I have a few qestions regarding the behavior of the ceph 
cluster in such a case:

1. After ceph detects the scrub error, the pg is marked as inconsistent.
Does that mean that any IO to this pg is blocked until it is repaired?

2. Is this amount of scrub errors normal? We currently have only 150TB in our 
cluster, distributed over 720 2TB disks.

3. As far as I know, a "ceph pg repair" just copies the content of the primary 
pg to all replicas. Is this still the case? What if the primary copy is the one 
having errors? We have a 4x replication level and it would be cool if ceph 
would use one of the pg for recovery which has the same checksum as the 
majority of pgs.

4. Some of this errors are happening at night. Since ceph reports this as a 
critical error, our shift is called and wake up, just to issue a single 
command. Do you see any problems in triggering this command automatically via 
monitoring event? Is there a reason why ceph isn't resolving these errors 
itself when it has enought replicas to do so?

Regards,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse options: writeback cache

2015-05-11 Thread Kenneth Waegeman

Hi all,

I have a few questions about ceph-fuse options:
- Is the fuse writeback cache being used? How can we see this? Can it be 
turned on with allow_wbcache somehow?


- What is the default of the big_writes option? (as seen in 
/usr/bin/ceph-fuse  --help) . Where can we see this?
If we run ceph fuse as this: ceph-fuse /mnt/ceph -o 
max_write=$((1024*1024*64)),big_writes

we don't see any of this in the output of mount:
ceph-fuse on /mnt/ceph type fuse.ceph-fuse 
(rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)


Can we see this somewhere else?

Many thanks!!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Christian Eichelmann
Hi all!

We are experiencing approximately 1 scrub error / inconsistent pg every
two days. As far as I know, to fix this you can issue a "ceph pg
repair", which works fine for us. I have a few qestions regarding the
behavior of the ceph cluster in such a case:

1. After ceph detects the scrub error, the pg is marked as inconsistent.
Does that mean that any IO to this pg is blocked until it is repaired?

2. Is this amount of scrub errors normal? We currently have only 150TB
in our cluster, distributed over 720 2TB disks.

3. As far as I know, a "ceph pg repair" just copies the content of the
primary pg to all replicas. Is this still the case? What if the primary
copy is the one having errors? We have a 4x replication level and it
would be cool if ceph would use one of the pg for recovery which has the
same checksum as the majority of pgs.

4. Some of this errors are happening at night. Since ceph reports this
as a critical error, our shift is called and wake up, just to issue a
single command. Do you see any problems in triggering this command
automatically via monitoring event? Is there a reason why ceph isn't
resolving these errors itself when it has enought replicas to do so?

Regards,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] very different performance on two volumes in the same pool #2

2015-05-11 Thread Somnath Roy
Nik,
If you increase num_jobs  beyond 4 , is it helping further ?  Try 8 or so.
Yeah,  libsoft* is definitely consuming some cpu cycles , but I don't know how 
to resolve that.
Also, acpi_processor_ffh_cstate_enter popped up and consuming lot of cpu. Try 
disabling cstate and run cpu in maximum performance mode , this may give you 
some boost.

Thanks & Regards
Somnath

-Original Message-
From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] 
Sent: Sunday, May 10, 2015 11:32 PM
To: Somnath Roy
Cc: ceph-users; n...@linuxbox.cz
Subject: Re: [ceph-users] very different performance on two volumes in the same 
pool #2

On Mon, May 11, 2015 at 06:07:21AM +, Somnath Roy wrote:
> Yes, you need to run fio clients on a separate box, it will take quite a bit 
> of cpu.
> Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster 
> to go for active + clean state ? If you are running while rebalancing is 
> going on , the performance will be impacted.
I set noout, so there was no rebalancing, I forgot to mention that..


> 
> ~110%  cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say 
> 3 or 4 or more), io_depth =64 is fine and see if it improves performance or 
> not.
ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 
43.3K iops..

OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a 
bit of reserves.. 

> Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the 
> following settings
> 
> osd_op_num_threads_per_shard
> osd_op_num_shards
> 
> May be (1,10 / 1,15 / 2, 10 ?).

tried all those combinations, but it doesn't make almost any difference..

do you think I could get more then those 43k?

one more think that makes me wonder a bit is this line I can see in perf:
  2.21%  libsoftokn3.so [.] 0x0001ebb2

I suppose this has something to do with resolving, 2.2% seems quite a lot to 
me..
Should I be worried about it? Does it make sense to enable kernel DNS resolving 
support in ceph?

thanks for your time Somnath!

nik



> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz]
> Sent: Sunday, May 10, 2015 10:33 PM
> To: Somnath Roy
> Cc: ceph-users; n...@linuxbox.cz
> Subject: Re: [ceph-users] very different performance on two volumes in 
> the same pool #2
> 
> 
> On Mon, May 11, 2015 at 05:20:25AM +, Somnath Roy wrote:
> > Two things..
> > 
> > 1. You should always use SSD drives for benchmarking after preconditioning 
> > it.
> well, I don't really understand... ?
> 
> > 
> > 2. After creating and mapping rbd lun, you need to write data first 
> > to read it afterword otherwise fio output will be misleading. In 
> > fact, I think you will see IO is not even hitting cluster (check 
> > with ceph -s)
> yes, so this approves my conjecture. ok.
> 
> 
> > 
> > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check 
> > the following.
> > 
> > 1. Check client or OSd node cpu is saturating or not.
> On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node 
> (which is one of OSD nodes as well), I can see fio eating quite lot of CPU 
> cycles.. I tried stopping ceph-osd on this node (thus only two nodes are 
> serving data) and performance got a bit higher, to ~33k IOPS. But still I 
> think it's not very good..
> 
> 
> > 
> > 2. With 4K, hope network BW is fine
> I think it's ok..
> 
> 
> > 
> > 3. Number of PGs/pool should be ~128 or so.
> I'm using pg_num 128
> 
> 
> > 
> > 4. If you are using krbd, you might want to try latest krbd module where 
> > TCP_NODELAY problem is fixed. If you don't want that complexity, try with 
> > fio-rbd.
> I'm not using RBD (only for writing data to volume), for benchmarking, I'm 
> using fio-rbd.
> 
> anything else I could check?
> 
> 
> > 
> > Hope this helps,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
> > Behalf Of Nikola Ciprich
> > Sent: Sunday, May 10, 2015 9:43 PM
> > To: ceph-users
> > Cc: n...@linuxbox.cz
> > Subject: [ceph-users] very different performance on two volumes in 
> > the same pool #2
> > 
> > Hello ceph developers and users,
> > 
> > some time ago, I posted here a question regarding very different 
> > performance for two volumes in one pool (backed by SSD drives).
> > 
> > After some examination, I probably got to the root of the problem..
> > 
> > When I create fresh volume (ie rbd create --image-format 2 --size
> > 51200 ssd/test) and run random io fio benchmark
> > 
> > fio  --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 
> > --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k
> > --iodepth=64 --readwrite=randread
> > 
> > I get very nice performance of up to 200k IOPS. However once the volume is 
> > written to (ie when I map it using rbd map and dd whole volume with some 
> > random data), and repe