Re: [ceph-users] [Jewel] Crash Osd with void Hit_set_trim

2017-10-23 Thread Brad Hubbard
On Mon, Oct 23, 2017 at 4:51 PM, pascal.pu...@pci-conseil.net <
pascal.pu...@pci-conseil.net> wrote:

> Hello,
> Le 23/10/2017 à 02:05, Brad Hubbard a écrit :
>
> 2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14 pg_epoch: 72024
> pg[37.1c( v 71593'41657 (60849'38594,71593'41657] local-les=72023 n=13
> ec=7037 les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0
> lpr=72022 crt=71593'41657 lcod 0'
> 0 mlcod 0'0 active+clean] hit_set_trim 
> 37:3800:.ceph-internal::hit_set_37.1c_archive_2017-08-31
> 01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head not found
> 2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc: In
> function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
> unsigned int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
> osd/ReplicatedPG.cc: 11782: FAILED assert(obc)
>
> It appears to be looking for (and failing to find) a hitset object with a
> timestamp from August? Does that sound right to you? Of course, it appears
> an object for that timestamp does not exist.
>
> How is-it possible ? How to fix it. I am sure, if I run a lot of read,
> other objects like this will crash other osd.
> (Cluster is OK now, I will probably destroy OSD 14 and recreate it).
> How to find this object ?
>

You should be able to do a find on the OSDs filestore and grep the output
for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs responsible
for pg 37.1c and then move on to the others if it's feasible.

Let us know the results.


> For information : All ceph server are NTP time synchrone.
>
> What are the settings for this cache tier?
>
>
> Just Tier in "backwrite" on erasure pool 2+1.
>
> # ceph osd pool get cache-nvme-data all
> size: 3
> min_size: 2
> crash_replay_interval: 0
> pg_num: 512
> pgp_num: 512
> crush_ruleset: 10
> hashpspool: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> hit_set_type: bloom
> hit_set_period: 14400
> hit_set_count: 12
> hit_set_fpp: 0.05
> use_gmt_hitset: 1
> auid: 0
> target_max_objects: 100
> target_max_bytes: 1000
> cache_target_dirty_ratio: 0.4
> cache_target_dirty_high_ratio: 0.6
> cache_target_full_ratio: 0.8
> cache_min_flush_age: 600
> cache_min_evict_age: 1800
> min_read_recency_for_promote: 1
> min_write_recency_for_promote: 1
> fast_read: 0
> hit_set_grade_decay_rate: 0
> hit_set_search_last_n: 0
>
> #  ceph osd pool get raid-2-1-data all
> size: 3
> min_size: 2
> crash_replay_interval: 0
> pg_num: 1024
> pgp_num: 1024
> crush_ruleset: 8
> hashpspool: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> auid: 0
> erasure_code_profile: raid-2-1
> min_write_recency_for_promote: 0
> fast_read: 0
>
> # ceph osd erasure-code-profile get raid-2-1
> jerasure-per-chunk-alignment=false
> k=2
> m=1
> plugin=jerasure
> ruleset-failure-domain=host
> ruleset-root=default
> technique=reed_sol_van
> w=8
>
> Could you check your logs for any errors from the 'agent_load_hit_sets'
> function?
>
>
> join log : #  pdsh -R exec -w ceph-osd-01,ceph-osd-02,ceph-osd-03,ceph-osd-04
> ssh -x  %h 'zgrep -B10 -A10 agent_load_hit_sets
> /var/log/ceph/ceph-osd.*gz'|less > log_agent_load_hit_sets.log
>
> On 19 October, I restarted on morning OSD 14.
>
> thanks for your help.
>
> regards,
>
>
> On Mon, Oct 23, 2017 at 2:41 AM, pascal.pu...@pci-conseil.net <
> pascal.pu...@pci-conseil.net> wrote:
>
>> Hello,
>>
>> I ran today a lot read IO with an simple rsync... and again, an OSD
>> crashed :
>>
>> But as before, I can't restart OSD. It continue crashing again. So OSD is
>> out, cluster is recovering.
>>
>> I had just time to increase OSD log.
>>
>> # ceph tell osd.14 injectargs --debug-osd 5/5
>>
>> Join log :
>>
>> # grep -B100 -100 objdump /var/log/ceph/ceph-osd.14.log
>>
>> If I ran another read, an other OSD willl probably crash.
>>
>> Any Idee ?
>>
>> I will probably plan to move data from erasure pool to replicat 3x pool.
>> It's becoming unstable without any change.
>>
>> Regards,
>>
>> PS: Last sunday, I lost RBD header during remove of cache tier... a lot
>> of thanks to http://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recove
>> ry/, to recreate it and resurrect RBD disk :)
>> Le 19/10/2017 à 00:19, Brad Hubbard a écrit :
>>
>> On Wed, Oct 18, 2017 at 11:16 PM, 
>> pascal.pu...@pci-conseil.net 
>>  wrote:
>>
>> hello,
>>
>> For 2 week, I lost sometime some OSD :
>> Here trace :
>>
>> 0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In
>> function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&,
>> unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962
>> osd/ReplicatedPG.cc: 11782: FAILED assert(obc)
>>
>> Can you try to capture a log with debug_osd set to 10 or greater as
>> per 

Re: [ceph-users] Inconsistent PG won't repair

2017-10-23 Thread Richard Bade
What I'm thinking about trying is using the ceph-objectstore-tool to
remove the offending clone metadata. From the help the syntax is this:
ceph-objectstore-tool ...  remove-clone-metadata 
i.e. something like for my object and expected clone from the log message
ceph-objectstore-tool rbd_data.19cdf512ae8944a.0001bb56
remove-clone-metadata 148d2
Anyone had experience with this? I'm not 100% sure if this will
resolve the issue or cause much the same situation (since it's already
expecting a clone that's not there currently).

Rich

On 21 October 2017 at 14:13, Brad Hubbard  wrote:
> On Sat, Oct 21, 2017 at 1:59 AM, Richard Bade  wrote:
>> Hi Lincoln,
>> Yes the object is 0-bytes on all OSD's. Has the same filesystem
>> date/time too. Before I removed the rbd image (migrated disk to
>> different pool) it was 4MB on all the OSD's and md5 checksum was the
>> same on all so it seems that only metadata is inconsistent.
>> Thanks for your suggestion, I just looked into this as I thought maybe
>> I can delete the object (since it's empty anyway). But I just get file
>> not found:
>> ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X
>>  error stat-ing
>> tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such
>> file or directory
>
> Maybe try downing the osds involved?
>
>>
>> Regards,
>> Rich
>>
>> On 21 October 2017 at 04:32, Lincoln Bryant  wrote:
>>> Hi Rich,
>>>
>>> Is the object inconsistent and 0-bytes on all OSDs?
>>>
>>> We ran into a similar issue on Jewel, where an object was empty across the 
>>> board but had inconsistent metadata. Ultimately it was resolved by doing a 
>>> "rados get" and then a "rados put" on the object. *However* that was a last 
>>> ditch effort after I couldn't get any other repair option to work, and I 
>>> have no idea if that will cause any issues down the road :)
>>>
>>> --Lincoln
>>>
 On Oct 20, 2017, at 10:16 AM, Richard Bade  wrote:

 Hi Everyone,
 In our cluster running 0.94.10 we had a pg pop up as inconsistent
 during scrub. Previously when this has happened running ceph pg repair
 [pg_num] has resolved the problem. This time the repair runs but it
 remains inconsistent.
 ~$ ceph health detail
 HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set
 pg 3.f05 is active+clean+inconsistent, acting [171,23,131]
 1 scrub errors

 The error in the logs is:
 cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub
 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir
 expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2

 Now, I've tried several things to resolve this. I've tried stopping
 each of the osd's in turn and running a repair. I've located the rbd
 image and removed it to empty out the object. The object is now zero
 bytes but still inconsistent. I've tried stopping each osd, removing
 the object and starting the osd again. It correctly identifies the
 object as missing and repair works to fix this but it still remains
 inconsistent.
 I've run out of ideas.
 The object is now zero bytes:
 ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name
 "*19cdf512ae8944a.0001bb56*" -ls
 537598582  0 -rw-r--r--   1 root root0 Oct 21
 03:54 
 /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3

 How can I resolve this? Is there some way to remove the empty object
 completely? I saw reference to ceph-objectstore-tool which has some
 options to remove-clone-metadata but I don't know how to use this.
 Will using this to remove the mentioned 148d2 expected clone resolve
 this? Or would this do the opposite as it would seem that it can't
 find that clone?
 Documentation on this tool is sparse.

 Any help here would be appreciated.

 Regards,
 Rich
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Retrieve progress of volume flattening using RBD python library

2017-10-23 Thread Xavier Trilla
Hi Jason,

Thanks for your reply.

Ok, well,  we’ll look into it then ;)

Thanks,
Xavier


El 23 oct 2017, a las 17:23, Jason Dillaman 
> escribió:

The current RBD python API does not expose callbacks from the wrapped
C API so it is not currently possible to retrieve the flatten, remove,
etc progress indications. Improvements to the API are always welcomed.

On Mon, Oct 23, 2017 at 11:06 AM, Xavier Trilla
> wrote:
Hi guys,



No ideas about how to do that? Does anybody know where we could ask about
librbd python library usage?



Thanks!

Xavier.



De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
Xavier Trilla
Enviado el: martes, 17 de octubre de 2017 11:55
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Retrieve progress of volume flattening using RBD python
library



Hi,



Does anybody know if there is a way to inspect the progress of a volume
flattening while using the python rbd library?



I mean, using the CLI is it possible to see the progress of the flattening,
but when calling volume.flatten() it just blocks until it’s done.



Is there any way to infer the progress?



Hope somebody may help.



Thanks!

Xavier




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread Jorge Pinilla López
If you use a OSD failure domain, if a node goes down you can lose your
data and the cluster wont be able to work.

If you restart the OSD it might work, but you could even lose your data
as your cluster can't rebuild itself.

You can try to know where the CRUSH rule is going to set your data but I
wouldn't risk so much.

If you have 8 nodes, maybe you could have M=8 and K=2, divided by nodes
so you would have 6 nodes with 1 chunk and 2 nodes with 2 chunks,  so if
you unlucky lose a 2 chunks node, you can still rebuild the data.


El 23/10/2017 a las 21:53, David Turner escribió:
> This can be changed to a failure domain of OSD in which case it could
> satisfy the criteria.  The problem with a failure domain of OSD, is
> that all of your data could reside on a single host and you could lose
> access to your data after restarting a single host.
>
> On Mon, Oct 23, 2017 at 3:23 PM LOPEZ Jean-Charles  > wrote:
>
> Hi,
>
> the default failure domain if not specified on the CLI at the
> moment you create your EC profile is set to HOST. So you need 14
> OSDs spread across 14 different nodes by default. And you only
> have 8 different nodes.
>
> Regards
> JC
>
>> On 23 Oct 2017, at 21:13, Karun Josy > > wrote:
>>
>> Thank you for the reply.
>>
>> There are 8 OSD nodes with 23 OSDs in total. (However, they are
>> not distributed equally on all nodes)
>>
>> So it satisfies that criteria, right?
>>
>>
>>
>> Karun Josy
>>
>> On Tue, Oct 24, 2017 at 12:30 AM, LOPEZ Jean-Charles
>> > wrote:
>>
>> Hi,
>>
>> yes you need as many OSDs that k+m is equal to. In your
>> example you need a minimum of 14 OSDs for each PG to become
>> active+clean.
>>
>> Regards
>> JC
>>
>>> On 23 Oct 2017, at 20:29, Karun Josy >> > wrote:
>>>
>>> Hi,
>>>
>>> While creating a pool with erasure code profile k=10, m=4, I
>>> get PG status as
>>> "200 creating+incomplete"
>>>
>>> While creating pool with profile k=5, m=3 it works fine.
>>>
>>> Cluster has 8 OSDs with total 23 disks.
>>>
>>> Is there any requirements for setting the first profile ?
>>>
>>> Karun 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread David Turner
This can be changed to a failure domain of OSD in which case it could
satisfy the criteria.  The problem with a failure domain of OSD, is that
all of your data could reside on a single host and you could lose access to
your data after restarting a single host.

On Mon, Oct 23, 2017 at 3:23 PM LOPEZ Jean-Charles 
wrote:

> Hi,
>
> the default failure domain if not specified on the CLI at the moment you
> create your EC profile is set to HOST. So you need 14 OSDs spread across 14
> different nodes by default. And you only have 8 different nodes.
>
> Regards
> JC
>
> On 23 Oct 2017, at 21:13, Karun Josy  wrote:
>
> Thank you for the reply.
>
> There are 8 OSD nodes with 23 OSDs in total. (However, they are not
> distributed equally on all nodes)
>
> So it satisfies that criteria, right?
>
>
>
> Karun Josy
>
> On Tue, Oct 24, 2017 at 12:30 AM, LOPEZ Jean-Charles 
> wrote:
>
>> Hi,
>>
>> yes you need as many OSDs that k+m is equal to. In your example you need
>> a minimum of 14 OSDs for each PG to become active+clean.
>>
>> Regards
>> JC
>>
>> On 23 Oct 2017, at 20:29, Karun Josy  wrote:
>>
>> Hi,
>>
>> While creating a pool with erasure code profile k=10, m=4, I get PG
>> status as
>> "200 creating+incomplete"
>>
>> While creating pool with profile k=5, m=3 it works fine.
>>
>> Cluster has 8 OSDs with total 23 disks.
>>
>> Is there any requirements for setting the first profile ?
>>
>> Karun
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread LOPEZ Jean-Charles
Hi,

the default failure domain if not specified on the CLI at the moment you create 
your EC profile is set to HOST. So you need 14 OSDs spread across 14 different 
nodes by default. And you only have 8 different nodes.

Regards
JC

> On 23 Oct 2017, at 21:13, Karun Josy  wrote:
> 
> Thank you for the reply.
> 
> There are 8 OSD nodes with 23 OSDs in total. (However, they are not 
> distributed equally on all nodes)
> 
> So it satisfies that criteria, right?
> 
> 
> 
> Karun Josy
> 
> On Tue, Oct 24, 2017 at 12:30 AM, LOPEZ Jean-Charles  > wrote:
> Hi,
> 
> yes you need as many OSDs that k+m is equal to. In your example you need a 
> minimum of 14 OSDs for each PG to become active+clean.
> 
> Regards
> JC
> 
>> On 23 Oct 2017, at 20:29, Karun Josy > > wrote:
>> 
>> Hi,
>> 
>> While creating a pool with erasure code profile k=10, m=4, I get PG status as
>> "200 creating+incomplete"
>> 
>> While creating pool with profile k=5, m=3 it works fine.
>> 
>> Cluster has 8 OSDs with total 23 disks.
>> 
>> Is there any requirements for setting the first profile ?
>> 
>> Karun 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread Karun Josy
Thank you for the reply.

There are 8 OSD nodes with 23 OSDs in total. (However, they are not
distributed equally on all nodes)

So it satisfies that criteria, right?



Karun Josy

On Tue, Oct 24, 2017 at 12:30 AM, LOPEZ Jean-Charles 
wrote:

> Hi,
>
> yes you need as many OSDs that k+m is equal to. In your example you need a
> minimum of 14 OSDs for each PG to become active+clean.
>
> Regards
> JC
>
> On 23 Oct 2017, at 20:29, Karun Josy  wrote:
>
> Hi,
>
> While creating a pool with erasure code profile k=10, m=4, I get PG status
> as
> "200 creating+incomplete"
>
> While creating pool with profile k=5, m=3 it works fine.
>
> Cluster has 8 OSDs with total 23 disks.
>
> Is there any requirements for setting the first profile ?
>
> Karun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread Jorge Pinilla López
I have one question, what can or can't do a cluster working on degraded
mode?

With K=10 + M = 4 if one of my OSDs node fails it will start working on
degraded mode, but can I still do writes and reads from that pool?


El 23/10/2017 a las 21:01, Ronny Aasen escribió:
> On 23.10.2017 20:29, Karun Josy wrote:
>> Hi,
>>
>> While creating a pool with erasure code profile k=10, m=4, I get PG
>> status as
>> "200 creating+incomplete"
>>
>> While creating pool with profile k=5, m=3 it works fine.
>>
>> Cluster has 8 OSDs with total 23 disks.
>>
>> Is there any requirements for setting the first profile ?
>
>
> you need K+M+X  osd nodes. K and M comes from the profile, X is how
> many nodes you want to be able to tolerate failure of, without
> becoming degraded. (how many failed nodes ceph should be able to
> automatically heal)
>
> so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault
> tolerance (a single failure = a degreded cluster)  so you have to
> scramble to replace the node to get HEALTH OK again.  if you have 15
> nodes you can loose 1 node and cehp will automatically rebalance to
> the 14 needed nodes, and you can replace the lost node at your leisure.
>
> kind regards
> Ronny Aasen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread Ronny Aasen

On 23.10.2017 20:29, Karun Josy wrote:

Hi,

While creating a pool with erasure code profile k=10, m=4, I get PG 
status as

"200 creating+incomplete"

While creating pool with profile k=5, m=3 it works fine.

Cluster has 8 OSDs with total 23 disks.

Is there any requirements for setting the first profile ?



you need K+M+X  osd nodes. K and M comes from the profile, X is how many 
nodes you want to be able to tolerate failure of, without becoming 
degraded. (how many failed nodes ceph should be able to automatically heal)


so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault 
tolerance (a single failure = a degreded cluster)  so you have to 
scramble to replace the node to get HEALTH OK again.  if you have 15 
nodes you can loose 1 node and cehp will automatically rebalance to the 
14 needed nodes, and you can replace the lost node at your leisure.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profile

2017-10-23 Thread LOPEZ Jean-Charles
Hi,

yes you need as many OSDs that k+m is equal to. In your example you need a 
minimum of 14 OSDs for each PG to become active+clean.

Regards
JC

> On 23 Oct 2017, at 20:29, Karun Josy  wrote:
> 
> Hi,
> 
> While creating a pool with erasure code profile k=10, m=4, I get PG status as
> "200 creating+incomplete"
> 
> While creating pool with profile k=5, m=3 it works fine.
> 
> Cluster has 8 OSDs with total 23 disks.
> 
> Is there any requirements for setting the first profile ?
> 
> Karun 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure code profile

2017-10-23 Thread Karun Josy
Hi,

While creating a pool with erasure code profile k=10, m=4, I get PG status
as
"200 creating+incomplete"

While creating pool with profile k=5, m=3 it works fine.

Cluster has 8 OSDs with total 23 disks.

Is there any requirements for setting the first profile ?

Karun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Qs on caches, and cephfs

2017-10-23 Thread John Spray
On Mon, Oct 23, 2017 at 7:50 AM, Jeff  wrote:
> Hey everyone,
>
> Long time listener first time caller.
> Thank you to everyone who works on Ceph, docs and code, I'm loving Ceph.
> I've been playing with Ceph for awhile and have a few Qs.
>
> Ceph cache tiers, can you have multiple tiered caches?
>
> Also with cache tiers, can you have one cache pool for multiple backing
> storage pools? The docs seem to be very careful about specifying one
> pool so I suspect I know the answer already.
>
> For CephFS, how do you execute a manual install and manual removal for MDS?
>
> The docs explain how to use ceph-deploy for MDS installs, but I'm trying
> to do everything manually right now to get a better understanding of it
> all.

You can pretty much look at what ceph-deploy has done and replicate
it.  An MDS just needs its keyring file in the correct /var/lib/ceph
subdirectory.

MDS removal is simply a case of removing its /var/lib/ceph directory.

> The ceph docs seem to be version controlled but I can't seem to find the
> repo to update, if you can point me to it I'd be happy to submit patches
> to it.

https://github.com/ceph/ceph/tree/master/doc

John

>
> Thnx in advance!
> Jeff.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Marco Baldini - H.S. Amiata

Hi

Thanks for reply, but my servers have various networks so I think I have 
to tell ceph what network should use.




Il 23/10/2017 18:10, Denes Dolhay ha scritto:

Hi,

I only have a virtual PoC cluster created by ceph-deploy, using only 
one network, same as you.


I just checked, it's configuration does not contain either public nor 
cluster network. I guess when there is only one there is no point...



Denes.

On 10/23/2017 05:52 PM, Marco Baldini - H.S. Amiata wrote:

Hi

I used the tool pveceph provided with Proxmox to initialize ceph, I 
can change but in that case should I put only public network or only 
cluster network in ceph.conf?


Thanks





Il 23/10/2017 17:33, Denes Dolhay ha scritto:

Hi,

So, you are running both the public and the cluster on the same 
network, this is supported, but in this case you do not have to 
specify any of the networks in the configuration. It is just a wild 
guess, but maybe this is the cause of your problem!



Denes.


On 10/23/2017 04:26 PM, Alwin Antreich wrote:

Hi Marco,

On Mon, Oct 23, 2017 at 04:10:34PM +0200, Marco Baldini - H.S. 
Amiata wrote:

Thanks for reply

My ceph.conf:

    [global]
  auth client required = none
  auth cluster required = none
  auth service required = none
  bluestore_block_db_size = 64424509440
  *cluster network = 10.10.10.0/24*
  fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
  keyring = /etc/pve/priv/$cluster.$name.keyring
  mon allow pool delete = true
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  *public network = 10.10.10.0/24*

    [client]
  rbd cache = true
  rbd cache max dirty = 134217728
  rbd cache max dirty age = 2
  rbd cache size = 268435456
  rbd cache target dirty = 67108864
  rbd cache writethrough until flush = true

    [osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring

    [mon.pve-hs-3]
  host = pve-hs-3
  mon addr = 10.10.10.253:6789

    [mon.pve-hs-main]
  host = pve-hs-main
  mon addr = 10.10.10.251:6789

    [mon.pve-hs-2]
  host = pve-hs-2
  mon addr = 10.10.10.252:6789


Each node has two ethernet cards in LACP bond on network 10.10.10.x

auto bond0
iface bond0 inet static
 address  10.10.10.252
 netmask  255.255.255.0
 slaves enp4s0 enp4s1
 bond_miimon 100
 bond_mode 802.3ad
 bond_xmit_hash_policy layer3+4
#CLUSTER BOND


The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show 
run"


#
interface gigabitEthernet 1/0/1

   channel-group 4 mode active
#
interface gigabitEthernet 1/0/2

   channel-group 4 mode active
#
interface gigabitEthernet 1/0/3

   channel-group 2 mode active
#
interface gigabitEthernet 1/0/4

   channel-group 2 mode active
#
interface gigabitEthernet 1/0/5

   channel-group 3 mode active
#
interface gigabitEthernet 1/0/6

   channel-group 3 mode active
#
interface gigabitEthernet 1/0/7

#
interface gigabitEthernet 1/0/8


Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 
5 and 6



Routing table, show with "ip -4 route show  table all"

default via 192.168.2.1 dev vmbr0 onlink
*10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 
linkdown

192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
*broadcast 10.10.10.0 dev bond0 table local proto kernel scope 
link src

10.10.10.252*
*local 10.10.10.252 dev bond0 table local proto kernel scope host src
10.10.10.252*
*broadcast 10.10.10.255 dev bond0 table local proto kernel scope 
link src

10.10.10.252*
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 
127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 
127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 
127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope 
link src 127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope 
link src 192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host 
src 192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope 
link src 192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope 
link src 192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host 
src 192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope 
link src 192.168.2.252



Network configuration

*$ ip -4 a*
1: lo:  mtu 65536 qdisc noqueue state 
UNKNOWN group default qlen 1000

 inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
6: vmbr1:  mtu 1500 qdisc 
noqueue 

Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Denes Dolhay

Hi,

I only have a virtual PoC cluster created by ceph-deploy, using only one 
network, same as you.


I just checked, it's configuration does not contain either public nor 
cluster network. I guess when there is only one there is no point...



Denes.

On 10/23/2017 05:52 PM, Marco Baldini - H.S. Amiata wrote:

Hi

I used the tool pveceph provided with Proxmox to initialize ceph, I 
can change but in that case should I put only public network or only 
cluster network in ceph.conf?


Thanks





Il 23/10/2017 17:33, Denes Dolhay ha scritto:

Hi,

So, you are running both the public and the cluster on the same 
network, this is supported, but in this case you do not have to 
specify any of the networks in the configuration. It is just a wild 
guess, but maybe this is the cause of your problem!



Denes.


On 10/23/2017 04:26 PM, Alwin Antreich wrote:

Hi Marco,

On Mon, Oct 23, 2017 at 04:10:34PM +0200, Marco Baldini - H.S. 
Amiata wrote:

Thanks for reply

My ceph.conf:

    [global]
  auth client required = none
  auth cluster required = none
  auth service required = none
  bluestore_block_db_size = 64424509440
  *cluster network = 10.10.10.0/24*
  fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
  keyring = /etc/pve/priv/$cluster.$name.keyring
  mon allow pool delete = true
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  *public network = 10.10.10.0/24*

    [client]
  rbd cache = true
  rbd cache max dirty = 134217728
  rbd cache max dirty age = 2
  rbd cache size = 268435456
  rbd cache target dirty = 67108864
  rbd cache writethrough until flush = true

    [osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring

    [mon.pve-hs-3]
  host = pve-hs-3
  mon addr = 10.10.10.253:6789

    [mon.pve-hs-main]
  host = pve-hs-main
  mon addr = 10.10.10.251:6789

    [mon.pve-hs-2]
  host = pve-hs-2
  mon addr = 10.10.10.252:6789


Each node has two ethernet cards in LACP bond on network 10.10.10.x

auto bond0
iface bond0 inet static
 address  10.10.10.252
 netmask  255.255.255.0
 slaves enp4s0 enp4s1
 bond_miimon 100
 bond_mode 802.3ad
 bond_xmit_hash_policy layer3+4
#CLUSTER BOND


The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"

#
interface gigabitEthernet 1/0/1

   channel-group 4 mode active
#
interface gigabitEthernet 1/0/2

   channel-group 4 mode active
#
interface gigabitEthernet 1/0/3

   channel-group 2 mode active
#
interface gigabitEthernet 1/0/4

   channel-group 2 mode active
#
interface gigabitEthernet 1/0/5

   channel-group 3 mode active
#
interface gigabitEthernet 1/0/6

   channel-group 3 mode active
#
interface gigabitEthernet 1/0/7

#
interface gigabitEthernet 1/0/8


Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 
and 6



Routing table, show with "ip -4 route show  table all"

default via 192.168.2.1 dev vmbr0 onlink
*10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 
linkdown

192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
*broadcast 10.10.10.0 dev bond0 table local proto kernel scope link 
src

10.10.10.252*
*local 10.10.10.252 dev bond0 table local proto kernel scope host src
10.10.10.252*
*broadcast 10.10.10.255 dev bond0 table local proto kernel scope 
link src

10.10.10.252*
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 
127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 
127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 
127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope 
link src 127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link 
src 192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host 
src 192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope 
link src 192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link 
src 192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host 
src 192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope 
link src 192.168.2.252



Network configuration

*$ ip -4 a*
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN 
group default qlen 1000

 inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
6: vmbr1:  mtu 1500 qdisc 
noqueue state DOWN group default qlen 1000

 inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
    valid_lft forever preferred_lft forever
*7: bond0: 

Re: [ceph-users] Qs on caches, and cephfs

2017-10-23 Thread David Turner
Multiple cached tiers? 2 tiers to 1 pool or a cache tier to a cache tier?
Neither are discussed or mentioned anywhere. At best it might work, but
isn't tested for a new release.

One cache to multiple pools? Same as above.

The luminous docs for cache tiering was updated with "A Word of Caution"
where they try to explain which use cases do and don't make sense for cache
tiering.  There are definitely some use cases that benefit from cache
tiering, but most will benefit more from other osd stack tweaks.

To configure an MDS manually, I pieced the following together from around
the net a couple months back and put this together from my history.  It
assumes that `hostname -s` is the name you plan to name the mds daemon,
that you use systemd, runs the daemon in your window to watch it running
and troubleshoot anything if you need to, and ultimately starting it as a
service.

  host=$(hostname -s)
  mkdir -p /var/lib/ceph/mds/ceph-$host
  ceph-authtool --create-keyring /var/lib/ceph/mds/ceph-$host/keyring
--gen-key -n mds.$host
  chmod 640 /var/lib/ceph/mds/ceph-$host/keyring
  touch /var/lib/ceph/mds/ceph-$host/systemd
  ceph auth add mds.$host osd "allow rwx" mds "allow" mon "allow profile
mds" -i /var/lib/ceph/mds/ceph-$host/keyring
  chown -R ceph:ceph /var/lib/ceph/mds/
  ceph-mds -i $host --setuser ceph --setgroup ceph
# ctrl+c to exit the daemon so you can start it as a service
  systemctl enable ceph-mds@$host.service
  systemctl start ceph-mds@$host.service


On Mon, Oct 23, 2017 at 2:00 AM Jeff  wrote:

> Hey everyone,
>
> Long time listener first time caller.
> Thank you to everyone who works on Ceph, docs and code, I'm loving Ceph.
> I've been playing with Ceph for awhile and have a few Qs.
>
> Ceph cache tiers, can you have multiple tiered caches?
>
> Also with cache tiers, can you have one cache pool for multiple backing
> storage pools? The docs seem to be very careful about specifying one
> pool so I suspect I know the answer already.
>
> For CephFS, how do you execute a manual install and manual removal for MDS?
>
> The docs explain how to use ceph-deploy for MDS installs, but I'm trying
> to do everything manually right now to get a better understanding of it
> all.
>
> The ceph docs seem to be version controlled but I can't seem to find the
> repo to update, if you can point me to it I'd be happy to submit patches
> to it.
>
> Thnx in advance!
> Jeff.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Marco Baldini - H.S. Amiata

Hello

ceph-mon services do not restart in any node, yesterday I manually 
restarted ceph-mon and ceph-mgr on every node and since them they did 
not restart


*pve-hs-2$ systemctl status ceph-mon@pve-hs-2.service*
 ceph-mon@pve-hs-2.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
preset: enabled)
  Drop-In: /lib/systemd/system/ceph-mon@.service.d
   └─ceph-after-pve-cluster.conf
   Active:*active (running) since Sun 2017-10-22 12:04:22 CEST; 1 day 5h ago*
 Main PID: 24825 (ceph-mon)
Tasks: 23
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@pve-hs-2.service
   └─24825 /usr/bin/ceph-mon -f --cluster ceph --id pve-hs-2 --setuser 
ceph --setgroup ceph

Oct 22 12:04:22 pve-hs-2 systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 22 12:04:22 pve-hs-2 systemd[1]: Started Ceph cluster monitor daemon.

*pve-hs-main$ systemctl status ceph-mon@pve-hs-main.service*
 ceph-mon@pve-hs-main.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
preset: enabled)
  Drop-In: /lib/systemd/system/ceph-mon@.service.d
   └─ceph-after-pve-cluster.conf
   Active:*active (running) since Sun 2017-10-22 12:08:59 CEST; 1 day 5h ago*
 Main PID: 24857 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@pve-hs-main.service
   └─24857 /usr/bin/ceph-mon -f --cluster ceph --id pve-hs-main 
--setuser ceph --setgroup ceph

Oct 22 12:08:59 pve-hs-main systemd[1]: Started Ceph cluster monitor daemon.

*pve-hs-3$ systemctl status ceph-mon@pve-hs-3.service*
 ceph-mon@pve-hs-3.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
preset: enabled)
  Drop-In: /lib/systemd/system/ceph-mon@.service.d
   └─ceph-after-pve-cluster.conf
   Active:*active (running) since Sun 2017-10-22 12:07:43 CEST; 1 day 5h ago*
 Main PID: 13077 (ceph-mon)
Tasks: 23
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@pve-hs-3.service
   └─13077 /usr/bin/ceph-mon -f --cluster ceph --id pve-hs-3 --setuser 
ceph --setgroup ceph


At 17:28 I have this in syslog / journal of pve-hs-2

Oct 23 17:38:47 pve-hs-2 kernel: [255282.309979] libceph: mon1 
10.10.10.252:6789 session lost, hunting for new mon

On same node, my ceph-mon.pve-hs-2.log at 17:38 is 
https://pastebin.com/8BCUm5Mr


Thanks




Il 23/10/2017 16:26, Alwin Antreich ha scritto:

Does the ceph-mon services restart when the session is lost?
What do you see in the ceph-mon.log on the failing mon node?

--
Cheers,
Alwin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for help with debugging cephfs snapshots

2017-10-23 Thread David Turner
purged_snaps is persistent indefinitely.  If the list gets too large it
abbreviates it a bit, but it can cause your osd-map to get a fair bit
larger because it keeps track of them.

On Sun, Oct 22, 2017 at 10:39 PM Eric Eastman 
wrote:

> On Sun, Oct 22, 2017 at 8:05 PM, Yan, Zheng  wrote:
>
>> On Mon, Oct 23, 2017 at 9:35 AM, Eric Eastman
>>  wrote:
>> > With help from the list we recently recovered one of our Jewel based
>> > clusters that started failing when we got to about 4800 cephfs
>> snapshots.
>> > We understand that cephfs snapshots are still marked experimental.   We
>> are
>> > running a single active MDS with 2 standby MDS. We only have a single
>> file
>> > system, we are only taking snapshots from the top level directory, and
>> we
>> > are now planning on limiting snapshots to a few hundred. Currently we
>> have
>> > removed all snapshots from this system, using rmdir on each snapshot
>> > directory, and the system is reporting that it is healthy:
>> >
>> > ceph -s
>> > cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
>> >  health HEALTH_OK
>> >  monmap e1: 3 mons at
>> > {mon01=
>> 10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>> > election epoch 202, quorum 0,1,2 mon01,mon02,mon03
>> >   fsmap e18283: 1/1/1 up {0=mds01=up:active}, 2 up:standby
>> >  osdmap e342543: 93 osds: 93 up, 93 in
>> > flags sortbitwise,require_jewel_osds
>> >   pgmap v38759308: 11336 pgs, 9 pools, 23107 GB data, 12086 kobjects
>> > 73956 GB used, 209 TB / 281 TB avail
>> >11336 active+clean
>> >   client io 509 kB/s rd, 2548 B/s wr, 0 op/s rd, 1 op/s wr
>> >
>> > The snapshots were removed several days ago, but just as an experiment I
>> > decided to query a few PGs in the cephfs data  storage pool, and I am
>> seeing
>> > they are all listing:
>> >
>> > “purged_snaps": "[2~12cd,12d0~12c9]",
>>
>> purged_snaps IDs of snapshots whose data have been completely purged.
>> Currently purged_snap set is append only, osd never remove ID from it.
>
>
>
> Thank you for the quick reply.
> So it is normal to have "purged_snaps" listed on a system that all
> snapshots have been deleted.
> Eric
>
>>
>>
>>
> >
>> > Here is an example:
>> >
>> > ceph pg 1.72 query
>> > {
>> > "state": "active+clean",
>> > "snap_trimq": "[]",
>> > "epoch": 342540,
>> > "up": [
>> > 75,
>> > 77,
>> > 82
>> > ],
>> > "acting": [
>> > 75,
>> > 77,
>> > 82
>> > ],
>> > "actingbackfill": [
>> > "75",
>> > "77",
>> > "82"
>> > ],
>> > "info": {
>> > "pgid": "1.72",
>> > "last_update": "342540'261039",
>> > "last_complete": "342540'261039",
>> > "log_tail": "341080'260697",
>> > "last_user_version": 261039,
>> > "last_backfill": "MAX",
>> > "last_backfill_bitwise": 1,
>> > "purged_snaps": "[2~12cd,12d0~12c9]",
>> > …
>> >
>> > Is this an issue?
>> > I am not seeing any recent trim activity.
>> > Are there any procedures documented for looking at snapshots to see if
>> there
>> > are any issues?
>> >
>> > Before posting this, I have reread the cephfs and snapshot pages in at:
>> > http://docs.ceph.com/docs/master/cephfs/
>> > http://docs.ceph.com/docs/master/dev/cephfs-snapshots/
>> >
>> > Looked at the slides:
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/2017-03-23%20Vault%20Snapshots.pdf
>> >
>> > Watched the video “Ceph Snapshots for Fun and Profit” given at the last
>> > OpenStack conference.
>> >
>> > And I still can’t find much on info on debugging snapshots.
>> >
>> > Here is some addition information on the cluster:
>> >
>> > ceph df
>> > GLOBAL:
>> > SIZE AVAIL RAW USED %RAW USED
>> > 281T  209T   73955G 25.62
>> > POOLS:
>> > NAMEID USED   %USED MAX AVAIL
>>  OBJECTS
>> > rbd 0  16 056326G
>>   3
>> > cephfs_data 1  22922G 28.9256326G
>>  12279871
>> > cephfs_metadata 2  89260k 056326G
>> 45232
>> > cinder  9147G  0.2656326G
>> 41420
>> > glance  10  0 056326G
>>   0
>> > cinder-backup   11  0 056326G
>>   0
>> > cinder-ssltest  23  1362M 056326G
>> 431
>> > IDMT-dfgw02 27  2552M 056326G
>> 758
>> > dfbackup28 33987M  0.0656326G
>>  8670
>> >
>> >
>> > Recent tickets and posts on problems with this cluster
>> > http://tracker.ceph.com/issues/21761
>> > http://tracker.ceph.com/issues/21412
>> > https://www.spinics.net/lists/ceph-devel/msg38203.html
>> >
>> > ceph -v
>> > ceph version 10.2.10 

Re: [ceph-users] Retrieve progress of volume flattening using RBD python library

2017-10-23 Thread Jason Dillaman
The current RBD python API does not expose callbacks from the wrapped
C API so it is not currently possible to retrieve the flatten, remove,
etc progress indications. Improvements to the API are always welcomed.

On Mon, Oct 23, 2017 at 11:06 AM, Xavier Trilla
 wrote:
> Hi guys,
>
>
>
> No ideas about how to do that? Does anybody know where we could ask about
> librbd python library usage?
>
>
>
> Thanks!
>
> Xavier.
>
>
>
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
> Xavier Trilla
> Enviado el: martes, 17 de octubre de 2017 11:55
> Para: ceph-users@lists.ceph.com
> Asunto: [ceph-users] Retrieve progress of volume flattening using RBD python
> library
>
>
>
> Hi,
>
>
>
> Does anybody know if there is a way to inspect the progress of a volume
> flattening while using the python rbd library?
>
>
>
> I mean, using the CLI is it possible to see the progress of the flattening,
> but when calling volume.flatten() it just blocks until it’s done.
>
>
>
> Is there any way to infer the progress?
>
>
>
> Hope somebody may help.
>
>
>
> Thanks!
>
> Xavier
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-23 Thread David Turner
We recently deleted a bucket that was no longer needed that had 400TB of
data in it to help as our cluster is getting quite full.  That should free
up about 30% of our cluster used space, but in the last week we haven't
seen nearly a fraction of that free up yet.  I left the cluster with this
running over the weekend to try to help `radosgw-admin --rgw-realm=local gc
process`, but it didn't seem to put a dent into it.  Our regular ingestion
is faster than how fast the garbage collection is cleaning stuff up, but
our regular ingestion is less than 2% growth at it's maximum.

As of yesterday our gc list was over 350GB when dumped into a file (I had
to stop it as the disk I was redirecting the output to was almost full).
In the future I will use the --bypass-gc option to avoid the cleanup, but
is there a way to speed up the gc once you're in this position?  There were
about 8M objects that were deleted from this bucket.  I've come across a
few references to the rgw-gc settings in the config, but nothing that
explained the times well enough for me to feel comfortable doing anything
with them.

On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell 
wrote:

> Excellent, thank you!  It does exist in 0.94.10!  :)
>
>
>
> Bryan
>
>
>
> *From: *Pavan Rallabhandi 
> *Date: *Tuesday, July 25, 2017 at 11:21 AM
>
>
> *To: *Bryan Stillwell , "ceph-users@lists.ceph.com"
> 
> *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> I’ve just realized that the option is present in Hammer (0.94.10) as well,
> you should try that.
>
>
>
> *From: *Bryan Stillwell 
> *Date: *Tuesday, 25 July 2017 at 9:45 PM
> *To: *Pavan Rallabhandi , "
> ceph-users@lists.ceph.com" 
> *Subject: *EXT: Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> Unfortunately, we're on hammer still (0.94.10).  That option looks like it
> would work better, so maybe it's time to move the upgrade up in the
> schedule.
>
>
>
> I've been playing with the various gc options and I haven't seen any
> speedups like we would need to remove them in a reasonable amount of time.
>
>
>
> Thanks,
>
> Bryan
>
>
>
> *From: *Pavan Rallabhandi 
> *Date: *Tuesday, July 25, 2017 at 3:00 AM
> *To: *Bryan Stillwell , "ceph-users@lists.ceph.com"
> 
> *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in
> radosgw-admin, which would remove the tails objects as well without marking
> them to be GCed.
>
>
>
> Thanks,
>
>
>
> On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" <
> ceph-users-boun...@lists.ceph.com on behalf of bstillw...@godaddy.com>
> wrote:
>
>
>
> I'm in the process of cleaning up a test that an internal customer did
> on our production cluster that produced over a billion objects spread
> across 6000 buckets.  So far I've been removing the buckets like this:
>
>
>
> printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin
> bucket rm --bucket={} --purge-objects
>
>
>
> However, the disk usage doesn't seem to be getting reduced at the same
> rate the objects are being removed.  From what I can tell a large number of
> the objects are waiting for garbage collection.
>
>
>
> When I first read the docs it sounded like the garbage collector would
> only remove 32 objects every hour, but after looking through the logs I'm
> seeing about 55,000 objects removed every hour.  That's about 1.3 million a
> day, so at this rate it'll take a couple years to clean up the rest!  For
> comparison, the purge-objects command above is removing (but not GC'ing)
> about 30 million objects a day, so a much more manageable 33 days to finish.
>
>
>
> I've done some digging and it appears like I should be changing these
> configuration options:
>
>
>
> rgw gc max objs (default: 32)
>
> rgw gc obj min wait (default: 7200)
>
> rgw gc processor max time (default: 3600)
>
> rgw gc processor period (default: 3600)
>
>
>
> A few questions I have though are:
>
>
>
> Should 'rgw gc processor max time' and 'rgw gc processor period'
> always be set to the same value?
>
>
>
> Which would be better, increasing 'rgw gc max objs' to something like
> 1024, or reducing the 'rgw gc processor' times to something like 60 seconds?
>
>
>
> Any other guidance on the best way to adjust these values?
>
>
>
> Thanks,
>
> Bryan
>
>
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] Retrieve progress of volume flattening using RBD python library

2017-10-23 Thread Xavier Trilla
Hi guys,

No ideas about how to do that? Does anybody know where we could ask about 
librbd python library usage?

Thanks!
Xavier.

De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: martes, 17 de octubre de 2017 11:55
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Retrieve progress of volume flattening using RBD python 
library

Hi,

Does anybody know if there is a way to inspect the progress of a volume 
flattening while using the python rbd library?

I mean, using the CLI is it possible to see the progress of the flattening, but 
when calling volume.flatten() it just blocks until it's done.

Is there any way to infer the progress?

Hope somebody may help.

Thanks!
Xavier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] UID Restrictions

2017-10-23 Thread Keane Wolter
Hi Gregory,

I did set the cephx caps for the client to:

caps: [mds] allow r, allow rw uid=100026 path=/user, allow rw path=/project
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_osiris, allow rw pool=cephfs_users

Keane

On Fri, Oct 20, 2017 at 5:35 PM, Gregory Farnum  wrote:

> What did you actually set the cephx caps to for that client?
>
> On Fri, Oct 20, 2017 at 8:01 AM Keane Wolter  wrote:
>
>> Hello all,
>>
>> I am trying to limit what uid/gid a client is allowed to run as (similar
>> to NFS' root squashing). I have referenced this email,
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/
>> 2017-February/016173.html, with no success.  After generating the
>> keyring, moving it to a client machine, and mounting the filesystem with
>> ceph-fuse, I am still able to create files with the UID and GID of root.
>>
>> Is there something I am missing or can do to prevent root from working
>> with a ceph-fuse mounted filesystem?
>>
>> Thanks,
>> Keane
>> wolt...@umich.edu
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Alwin Antreich
Hi Marco,

On Mon, Oct 23, 2017 at 04:10:34PM +0200, Marco Baldini - H.S. Amiata wrote:
> Thanks for reply
>
> My ceph.conf:
>
>[global]
>  auth client required = none
>  auth cluster required = none
>  auth service required = none
>  bluestore_block_db_size = 64424509440
>  *cluster network = 10.10.10.0/24*
>  fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
>  keyring = /etc/pve/priv/$cluster.$name.keyring
>  mon allow pool delete = true
>  osd journal size = 5120
>  osd pool default min size = 2
>  osd pool default size = 3
>  *public network = 10.10.10.0/24*
>
>[client]
>  rbd cache = true
>  rbd cache max dirty = 134217728
>  rbd cache max dirty age = 2
>  rbd cache size = 268435456
>  rbd cache target dirty = 67108864
>  rbd cache writethrough until flush = true
>
>[osd]
>  keyring = /var/lib/ceph/osd/ceph-$id/keyring
>
>[mon.pve-hs-3]
>  host = pve-hs-3
>  mon addr = 10.10.10.253:6789
>
>[mon.pve-hs-main]
>  host = pve-hs-main
>  mon addr = 10.10.10.251:6789
>
>[mon.pve-hs-2]
>  host = pve-hs-2
>  mon addr = 10.10.10.252:6789
>
>
> Each node has two ethernet cards in LACP bond on network 10.10.10.x
>
> auto bond0
> iface bond0 inet static
> address  10.10.10.252
> netmask  255.255.255.0
> slaves enp4s0 enp4s1
> bond_miimon 100
> bond_mode 802.3ad
> bond_xmit_hash_policy layer3+4
> #CLUSTER BOND
>
>
> The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"
>
> #
> interface gigabitEthernet 1/0/1
>
>   channel-group 4 mode active
> #
> interface gigabitEthernet 1/0/2
>
>   channel-group 4 mode active
> #
> interface gigabitEthernet 1/0/3
>
>   channel-group 2 mode active
> #
> interface gigabitEthernet 1/0/4
>
>   channel-group 2 mode active
> #
> interface gigabitEthernet 1/0/5
>
>   channel-group 3 mode active
> #
> interface gigabitEthernet 1/0/6
>
>   channel-group 3 mode active
> #
> interface gigabitEthernet 1/0/7
>
> #
> interface gigabitEthernet 1/0/8
>
>
> Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6
>
>
> Routing table, show with "ip -4 route show  table all"
>
> default via 192.168.2.1 dev vmbr0 onlink
> *10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
> 192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown
> 192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
> *broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src
> 10.10.10.252*
> *local 10.10.10.252 dev bond0 table local proto kernel scope host src
> 10.10.10.252*
> *broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src
> 10.10.10.252*
> broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
> local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
> local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
> broadcast 127.255.255.255 dev lo table local proto kernel scope link src 
> 127.0.0.1
> broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 
> 192.168.1.252 linkdown
> local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 
> 192.168.1.252
> broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 
> 192.168.1.252 linkdown
> broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 
> 192.168.2.252
> local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 
> 192.168.2.252
> broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 
> 192.168.2.252
>
>
> Network configuration
>
> *$ ip -4 a*
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default qlen 1000
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> 6: vmbr1:  mtu 1500 qdisc noqueue state 
> DOWN group default qlen 1000
> inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
>valid_lft forever preferred_lft forever
> *7: bond0:  mtu 1500 qdisc noqueue
> state UP group default qlen 1000inet 10.10.10.252/24 brd 10.10.10.255
> scope global bond0valid_lft forever preferred_lft forever***8: vmbr0:
>  mtu 1500 qdisc noqueue state UP group
> default qlen 1000
> inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
>valid_lft forever preferred_lft forever
>
> *$ ip -4 link*
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode 
> DEFAULT group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: enp2s0:  mtu 1500 qdisc pfifo_fast master 
> vmbr0 

Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Marco Baldini - H.S. Amiata

Thanks for reply

My ceph.conf:

   [global]
 auth client required = none
 auth cluster required = none
 auth service required = none
 bluestore_block_db_size = 64424509440
 *cluster network = 10.10.10.0/24*
 fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
 keyring = /etc/pve/priv/$cluster.$name.keyring
 mon allow pool delete = true
 osd journal size = 5120
 osd pool default min size = 2
 osd pool default size = 3
 *public network = 10.10.10.0/24*

   [client]
 rbd cache = true
 rbd cache max dirty = 134217728
 rbd cache max dirty age = 2
 rbd cache size = 268435456
 rbd cache target dirty = 67108864
 rbd cache writethrough until flush = true

   [osd]
 keyring = /var/lib/ceph/osd/ceph-$id/keyring

   [mon.pve-hs-3]
 host = pve-hs-3
 mon addr = 10.10.10.253:6789

   [mon.pve-hs-main]
 host = pve-hs-main
 mon addr = 10.10.10.251:6789

   [mon.pve-hs-2]
 host = pve-hs-2
 mon addr = 10.10.10.252:6789


Each node has two ethernet cards in LACP bond on network 10.10.10.x

auto bond0
iface bond0 inet static
address  10.10.10.252
netmask  255.255.255.0
slaves enp4s0 enp4s1
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer3+4
#CLUSTER BOND


The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"

#
interface gigabitEthernet 1/0/1

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/2

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/3

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/4

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/5

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/6

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/7

#
interface gigabitEthernet 1/0/8


Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6


Routing table, show with "ip -4 route show  table all"

default via 192.168.2.1 dev vmbr0 onlink
*10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown
192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
*broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src 
10.10.10.252*
*local 10.10.10.252 dev bond0 table local proto kernel scope host src 
10.10.10.252*
*broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src 
10.10.10.252*

broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 
127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 
192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 
192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 
192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 
192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 
192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 
192.168.2.252


Network configuration

*$ ip -4 a*
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
6: vmbr1:  mtu 1500 qdisc noqueue state DOWN 
group default qlen 1000
inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
   valid_lft forever preferred_lft forever
*7: bond0:  mtu 1500 qdisc 
noqueue state UP group default qlen 1000inet 10.10.10.252/24 brd 10.10.10.255 scope global bond0valid_lft forever preferred_lft forever***8: vmbr0:  mtu 1500 qdisc noqueue state UP group default qlen 1000

inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
   valid_lft forever preferred_lft forever

*$ ip -4 link*
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode 
DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0:  mtu 1500 qdisc pfifo_fast master 
vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
*3: enp4s0:  mtu 1500 qdisc 
pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff4: enp4s1: 

Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Denes Dolhay

Hi,

Maybe some routing issue?


"CEPH has public and cluster network on 10.10.10.0/24"

This means that the nodes have public and cluster network separately 
both on 10.10.10.0/24, or that you did not specify a separate cluster 
network?


Please provide route table, ifconfig, ceph.conf


Regards,

Denes


On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote:


Hello

I have a CEPH cluster with 3 nodes, each with 3 OSDs, running Proxmox, 
CEPH  versions:


{
 "mon": {
 "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
 },
 "mgr": {
 "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
 },
 "osd": {
 "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 9
 },
 "mds": {},
 "overall": {
 "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 15
 }
}

CEPH has public and cluster network on 10.10.10.0/24, the three nodes 
are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is working 
good (I kept ping from one of the nodes to the others two running for 
hours and had 0 packet loss)


On one node with ip 10.10.10.252 I get strange message in dmesg

kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established

and that is going on all the day.

In ceph -w I get

2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0

pve-hs-main is the host with ip 10.10.10.251

Actually CEPH storage is very low on usage, on average 200 kB/s read 
or write (as shown with ceph -s) so I don't think it's a problem about 
load average of the cluster.


The strange is that I see mon1 10.10.10.252:6789 session lost and 
that's from log of node 10.10.10.252 so it's losing connection with 
the monitor on the same node, I don't think it's network related.


I already tried with nodes reboot, ceph-mon and ceph-mgr restart, but 
the problem is still there.


Any ideas?

Thanks




--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:0577-779396
Cellulare:  335-8765169
WEB:www.hsamiata.it 
EMAIL:  mbald...@hsamiata.it 



___

[ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

2017-10-23 Thread Marco Baldini - H.S. Amiata

Hello

I have a CEPH cluster with 3 nodes, each with 3 OSDs, running Proxmox, 
CEPH  versions:


{
"mon": {
"ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
},
"mgr": {
"ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
},
"osd": {
"ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 9
},
"mds": {},
"overall": {
"ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 15
}
}

CEPH has public and cluster network on 10.10.10.0/24, the three nodes 
are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is working 
good (I kept ping from one of the nodes to the others two running for 
hours and had 0 packet loss)


On one node with ip 10.10.10.252 I get strange message in dmesg

kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established

and that is going on all the day.

In ceph -w I get

2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0

pve-hs-main is the host with ip 10.10.10.251

Actually CEPH storage is very low on usage, on average 200 kB/s read or 
write (as shown with ceph -s) so I don't think it's a problem about load 
average of the cluster.


The strange is that I see mon1 10.10.10.252:6789 session lost and that's 
from log of node 10.10.10.252 so it's losing connection with the monitor 
on the same node, I don't think it's network related.


I already tried with nodes reboot, ceph-mon and ceph-mgr restart, but 
the problem is still there.


Any ideas?

Thanks




--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:0577-779396
Cellulare:  335-8765169
WEB:www.hsamiata.it 
EMAIL:  mbald...@hsamiata.it 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-23 Thread Russell Glaue
The two newest machines have the LSI MegaRAID SAS-3 3008 [Fury]. The first
one performs the best of the four. The second one is the problem host. The
Non-RAID option just takes RAID configuration out of the picture so ceph
can have direct access to the disk. We need that to have ceph's support of
the SSD clipping function, in the future. The controllers having RAID-only
do not support SSD clipping, and we are forced to have a RAID for every
disk, which we don't like.

If you know of problems with LSI MegaRAID, please elaborate.
Thanks.
-RG


On Fri, Oct 20, 2017 at 10:04 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Fri, 20 Oct 2017 13:35:55 -0500 Russell Glaue wrote:
>
> > On the machine in question, the 2nd newest, we are using the LSI MegaRAID
> > SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no
> battery.
> > The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
> > earlier, each single drive configured as RAID0.
> >
> There you go then, that's your explanation.
>
> And also the reason that these SSDs perform so "well" in the RAID0
> config despite my doubts about their suitability with Ceph.
>
> If you were to put Intel DC S36xx, S37xx or Samsung SM 863 in the IT mode
> host you'd likely get the speed you want if not better.
>
>
> Christian
>
> > Thanks for everyone's help.
> > I am going to run a 32 thread bench test after taking the 2nd machine out
> > of the cluster with noout.
> > After it is out of the cluster, I am expecting the slow write issue will
> > not surface.
> >
> >
> > On Fri, Oct 20, 2017 at 5:27 AM, David Turner 
> wrote:
> >
> > > I can attest that the battery in the raid controller is a thing. I'm
> used
> > > to using lsi controllers, but my current position has hp raid
> controllers
> > > and we just tracked down 10 of our nodes that had >100ms await pretty
> much
> > > always were the only 10 nodes in the cluster with failed batteries on
> the
> > > raid controllers.
> > >
> > > On Thu, Oct 19, 2017, 8:15 PM Christian Balzer  wrote:
> > >
> > >>
> > >> Hello,
> > >>
> > >> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
> > >>
> > >> > That is a good idea.
> > >> > However, a previous rebalancing processes has brought performance
> of our
> > >> > Guest VMs to a slow drag.
> > >> >
> > >>
> > >> Never mind that I'm not sure that these SSDs are particular well
> suited
> > >> for Ceph, your problem is clearly located on that one node.
> > >>
> > >> Not that I think it's the case, but make sure your PG distribution is
> not
> > >> skewed with many more PGs per OSD on that node.
> > >>
> > >> Once you rule that out my first guess is the RAID controller, you're
> > >> running the SSDs are single RAID0s I presume?
> > >> If so a either configuration difference or a failed BBU on the
> controller
> > >> could result in the writeback cache being disabled, which would
> explain
> > >> things beautifully.
> > >>
> > >> As for a temporary test/fix (with reduced redundancy of course), set
> noout
> > >> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
> > >> off.
> > >>
> > >> This should result in much better performance than you have now and of
> > >> course be the final confirmation of that host being the culprit.
> > >>
> > >> Christian
> > >>
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez <
> jelo...@redhat.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi Russell,
> > >> > >
> > >> > > as you have 4 servers, assuming you are not doing EC pools, just
> stop
> > >> all
> > >> > > the OSDs on the second questionable server, mark the OSDs on that
> > >> server as
> > >> > > out, let the cluster rebalance and when all PGs are active+clean
> just
> > >> > > replay the test.
> > >> > >
> > >> > > All IOs should then go only to the other 3 servers.
> > >> > >
> > >> > > JC
> > >> > >
> > >> > > On Oct 19, 2017, at 13:49, Russell Glaue  wrote:
> > >> > >
> > >> > > No, I have not ruled out the disk controller and backplane making
> the
> > >> > > disks slower.
> > >> > > Is there a way I could test that theory, other than swapping out
> > >> hardware?
> > >> > > -RG
> > >> > >
> > >> > > On Thu, Oct 19, 2017 at 3:44 PM, David Turner <
> drakonst...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Have you ruled out the disk controller and backplane in the
> server
> > >> > >> running slower?
> > >> > >>
> > >> > >> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue 
> > >> wrote:
> > >> > >>
> > >> > >>> I ran the test on the Ceph pool, and ran atop on all 4 storage
> > >> servers,
> > >> > >>> as suggested.
> > >> > >>>
> > >> > >>> Out of the 4 servers:
> > >> > >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU
> wait.
> > >> > >>> Momentarily spiking up to 50% on one server, and 80% on another
> > >> > >>> The 2nd newest server was almost averaging 90% disk %busy and
> 150%
> > >> CPU
> > >> > >>> wait. 

Re: [ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-23 Thread Jiri Horky
Hi John,

On 10/23/2017 02:59 PM, John Spray wrote:
> On Tue, Oct 17, 2017 at 9:42 PM, Jiri Horky  wrote:
>> Hi list,
>>
>> we are thinking of building relatively big CEPH-based object storage for
>> storage of our sample files - we have about 700M files ranging from very
>> small (1-4KiB) files to pretty big ones (several GiB). Median of file
>> size is 64KiB. Since the required space is relatively large (1PiB of
>> usable storage), we are thinking of utilizing erasure coding for this
>> case. On the other hand, we need to achieve at least 1200MiB/s
>> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>>
>> Since the EC is per-object, the small objects will be stripped to even
>> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
>> object in this scenario -> number of required IOPS when using EC is
>> relatively high. Some vendors (such as Hitachi, but I believe EMC as
>> well) do offline, predefined-chunk size EC instead. The idea is to first
>> write objects with replication factor of 3, wait for enough objects to
>> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
>> less computationally intensive, and repairs much faster, but it also
>> allows reading majority of the small objects directly by reading just
>> part of one of the chunk from it (assuming non degraded state) - one
>> chunk actually contains the whole object.
> How does the client know the name of the larger/bulk object, given the
> name of one of the small objects within it?  Presumably, there is some
> index?
The point is that the client does not need to care. The bulking for more
efficient EC storage is done by underlying object store/storage system.
So the clients access objects the ordinary way, whereas the storage
layer takes care of tracking in which EC bulk the individual object is
stored. I understood this is completely different thinking the what
RADOS nowadays uses.
>
>> I wonder if something similar is already possible with CEPH and/or is
>> planned. For our use case of very small objects, it would mean near 3-4x
>> performance boosts in terms of required IOPS performance.
>>
>> Another option how to get out of this situation is to be able to specify
>> different storage pools/policies based on file size - i.e. to do 3x
>> replication of the very small files and only use EC for bigger files,
>> where the performance hit with 4x IOPS won't be that painful. But I I am
>> afraid this is not possible...
> Surely there is nothing stopping you writing your small objects in one
> pool and your large objects in another?  Am I missing something?
Exept, that all the clients accessing the shared storage would need to
have that logic inside. It would be just better if I could make it
transparent for the clients.

Jiri Horky
>
> John
>
>> Any other hint is sincerely welcome.
>>
>> Thank you
>> Jiri Horky
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-23 Thread John Spray
On Tue, Oct 17, 2017 at 9:42 PM, Jiri Horky  wrote:
> Hi list,
>
> we are thinking of building relatively big CEPH-based object storage for
> storage of our sample files - we have about 700M files ranging from very
> small (1-4KiB) files to pretty big ones (several GiB). Median of file
> size is 64KiB. Since the required space is relatively large (1PiB of
> usable storage), we are thinking of utilizing erasure coding for this
> case. On the other hand, we need to achieve at least 1200MiB/s
> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>
> Since the EC is per-object, the small objects will be stripped to even
> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
> object in this scenario -> number of required IOPS when using EC is
> relatively high. Some vendors (such as Hitachi, but I believe EMC as
> well) do offline, predefined-chunk size EC instead. The idea is to first
> write objects with replication factor of 3, wait for enough objects to
> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
> less computationally intensive, and repairs much faster, but it also
> allows reading majority of the small objects directly by reading just
> part of one of the chunk from it (assuming non degraded state) - one
> chunk actually contains the whole object.

How does the client know the name of the larger/bulk object, given the
name of one of the small objects within it?  Presumably, there is some
index?

> I wonder if something similar is already possible with CEPH and/or is
> planned. For our use case of very small objects, it would mean near 3-4x
> performance boosts in terms of required IOPS performance.
>
> Another option how to get out of this situation is to be able to specify
> different storage pools/policies based on file size - i.e. to do 3x
> replication of the very small files and only use EC for bigger files,
> where the performance hit with 4x IOPS won't be that painful. But I I am
> afraid this is not possible...

Surely there is nothing stopping you writing your small objects in one
pool and your large objects in another?  Am I missing something?

John

>
> Any other hint is sincerely welcome.
>
> Thank you
> Jiri Horky
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High osd cpu usage ( luminous )

2017-10-23 Thread Yair Magnezi
Hello Guys

We  have a fresh 'luminous'  (  12.2.0 )
(32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)   ( installed using ceph-ansible )

cluster contains 6 *  Intel  server board  S2600WTTR  (  96 osds and  3
mons )

We have 6 nodes  ( Intel server board  S2600WTTR ) , Mem - 64G , CPU
-> Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz , 32 cores .
Each server   has 16 * 1.6TB  Dell SSD drives ( SSDSC2BB016T7R )  , total
of 96 osds , 3 mons

Main usage  are rbd's for our  openstack environment ( Okata )

We're in the beginning of our production tests and it looks like the  osd's
are too busy although  we don't generate  too much iops at this stage (
almost nothing )
All ceph-osds using 50% of cpu usage and i can't figure out why are they so
busy :

top - 07:41:55 up 49 days,  2:54,  2 users,  load average: 6.85, 6.40, 6.37

Tasks: 518 total,   1 running, 517 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.8 us,  4.3 sy,  0.0 ni, 80.3 id,  0.0 wa,  0.0 hi,  0.6 si,
0.0 st
KiB Mem : 65853584 total, 23953788 free, 40342680 used,  1557116 buff/cache
KiB Swap:  3997692 total,  3997692 free,0 used. 18020584 avail Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
  36713 ceph  20   0 3869588 2.826g  28896 S  47.2  4.5   6079:20
ceph-osd
  53981 ceph  20   0 3998732 2.666g  28628 S  45.8  4.2   5939:28
ceph-osd
  55879 ceph  20   0 3707004 2.286g  28844 S  44.2  3.6   5854:29
ceph-osd
  46026 ceph  20   0 3631136 1.930g  29100 S  43.2  3.1   6008:50
ceph-osd
  39021 ceph  20   0 4091452 2.698g  28936 S  42.9  4.3   5687:39
ceph-osd
  47210 ceph  20   0 3598572 1.871g  29092 S  42.9  3.0   5759:19
ceph-osd
  52763 ceph  20   0 3843216 2.410g  28896 S  42.2  3.8   5540:11
ceph-osd
  49317 ceph  20   0 3794760 2.142g  28932 S  41.5  3.4   5872:24
ceph-osd
  42653 ceph  20   0 3915476 2.489g  28840 S  41.2  4.0   5605:13
ceph-osd
  41560 ceph  20   0 3460900 1.801g  28660 S  38.5  2.9   5128:01
ceph-osd
  50675 ceph  20   0 3590288 1.827g  28840 S  37.9  2.9   5196:58
ceph-osd
  37897 ceph  20   0 4034180 2.814g  29000 S  34.9  4.5   4789:10
ceph-osd
  50237 ceph  20   0 3379780 1.930g  28892 S  34.6  3.1   4846:36
ceph-osd
  48608 ceph  20   0 3893684 2.721g  28880 S  33.9  4.3   4752:43
ceph-osd
  40323 ceph  20   0 4227864 2.959g  28800 S  33.6  4.7   4712:36
ceph-osd
  44638 ceph  20   0 3656780 2.437g  28896 S  33.2  3.9   4793:58
ceph-osd
  61639 ceph  20   0  527512 114300  20988 S   2.7  0.2   2722:03
ceph-mgr
  31586 ceph  20   0  765672 304140  21816 S   0.7  0.5 409:06.09
ceph-mon
 68 root  20   0   0  0  0 S   0.3  0.0   3:09.69
ksoftirqd/12

strace  doesn't show anything suspicious

root@ecprdbcph10-opens:~# strace -p 36713
strace: Process 36713 attached
futex(0x563343c56764, FUTEX_WAIT_PRIVATE, 1, NUL

Ceph logs doesn't reveal anything ?
Is this "normal" behavior in Luminous ?
Looking out in older threads i can only find a thread about time gaps which
is not our case

Thanks In advance


Yair

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librbd on CentOS7

2017-10-23 Thread Jason Dillaman
Feel free to update the CentOS client libraries as well. The base EL7
packages are updated on an as-needed basis and due to layered product
dependencies, sometimes it takes a lot of push to get them to be
updated. I'd suspect that the packages will be updated again at some
point during the lifetime of EL7 release (they were already updated
from Firefly to Hammer about a year or so ago).

On Mon, Oct 23, 2017 at 6:41 AM, Wolfgang Lendl
 wrote:
> Hello,
>
> we're testing KVM on CentOS 7 as Ceph (luminous) client.
> CentOS 7 has a librbd package in its base repository with version 0.94.5
>
> the question is (aside from feature support) if we should install a
> recent librbd from the ceph repositories (12.2.x) or stay with the
> default one.
> my main concern is performance and I'm not sure how the librbd version
> has impact on it.
>
>
> wolfgang
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph index is not complete

2017-10-23 Thread vyyy杨雨阳
Hello,
I  found a bucket that some objects in this bucket can not list!

Bucket stats shows there are 3182 objects, but swift list or s3 only shows 2028 
 objects,
Listomapkeys also shows 2028 entries exclude multipart
I have run  radosgw-admin  bucket check  --fix --check-objects   
--bucket=originalData, seems no use
Those ‘missing’ objects can really downloaded indeed.

Any help is appreciated


radosgw-admin  bucket stats --bucket=originalData
{
"bucket": "originalData",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.58735832.21",
"marker": "default.58735832.21",
"owner": "originalData",
"ver": "0#933671",
"master_ver": "0#0",
"mtime": "2017-05-22 11:30:55.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 12178836,
"size_kb_actual": 12183236,
"num_objects": 3182
},
"rgw.multimeta": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 55
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

[ceph@SVR7794HW2285 test]$ rados -p .rgw.buckets.index listomapkeys 
.dir.default.58735832.21 |grep -v _multipart_ |wc
   20282028  200818

[ceph@SVR7794HW2285 ~]$ swift list originalData |wc
   20282028  200818
[ceph@SVR7794HW2285 ~]$



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librbd on CentOS7

2017-10-23 Thread Wolfgang Lendl
Hello,

we're testing KVM on CentOS 7 as Ceph (luminous) client.
CentOS 7 has a librbd package in its base repository with version 0.94.5

the question is (aside from feature support) if we should install a
recent librbd from the ceph repositories (12.2.x) or stay with the
default one.
my main concern is performance and I'm not sure how the librbd version
has impact on it.


wolfgang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Drive write cache recommendations for Luminous/Bluestore

2017-10-23 Thread Hans van den Bogert
Hi All,

For Jewel there is this page about drive cache:
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/#hard-drive-prep

For Bluestore I can't find any documentation or discussions about drive
write cache, while I can imagine that revisiting this subject might be
necessary.

For our cluster specifically, we use HP gen 9 with a b140i controller where
disks are directly attached (i.e., not RAID). Often with the Linux kernel,
controllers automatically enable drive write cache for directly attached
harddisks since this *should* be safe as long as harddisks correctly adhere
to flush semantics . In the case of the b140i controller; I can confirm
with `hdparm -W /dev/sdx` that drive write cache is *not* enabled by
default.

So my main two questions are:

1. Did anybody do extensive testing, with luminous in combination with
drive write cache enabled, or at least elaborate on the subject.
2. Depending on item 1., could and should I enable drive write cache for
the disks attached to a HP b140i controller.

Thanks!

Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with CORS

2017-10-23 Thread Rudenko Aleksandr
Thank you David for your suggestion.

We add our domain(Origin) to zonegroup’s endpoints and hostnames:

{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [
"https://console.{our_domain}.ru;,
],
"hostnames": [
"https://console.{our_domain}.ru;,
],
"hostnames_s3website": [],
"master_zone": "default",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "9c7666df-132d-4db0-988e-6b28767ff3cf"
}

But it’s not solve our problems.

In RGW logs:

2017-10-23 10:51:25.301934 7f39f2e73700  1 == starting new request 
req=0x7f39f2e6d190 =
2017-10-23 10:51:25.301956 7f39f2e73700  2 req 22:0.22::OPTIONS 
/aaa::initializing for trans_id = 
tx00016-0059ed9f7d-fc80-default
2017-10-23 10:51:25.301993 7f39f2e73700  2 req 22:0.58:s3:OPTIONS 
/aaa::getting op 6
2017-10-23 10:51:25.302004 7f39f2e73700  2 req 22:0.71:s3:OPTIONS 
/aaa:options_cors:verifying requester
2017-10-23 10:51:25.302013 7f39f2e73700  2 req 22:0.80:s3:OPTIONS 
/aaa:options_cors:normalizing buckets and tenants
2017-10-23 10:51:25.302018 7f39f2e73700  2 req 22:0.84:s3:OPTIONS 
/aaa:options_cors:init permissions
2017-10-23 10:51:25.302065 7f39f2e73700  2 req 22:0.000131:s3:OPTIONS 
/aaa:options_cors:recalculating target
2017-10-23 10:51:25.302070 7f39f2e73700  2 req 22:0.000136:s3:OPTIONS 
/aaa:options_cors:reading permissions
2017-10-23 10:51:25.302075 7f39f2e73700  2 req 22:0.000141:s3:OPTIONS 
/aaa:options_cors:init op
2017-10-23 10:51:25.302076 7f39f2e73700  2 req 22:0.000143:s3:OPTIONS 
/aaa:options_cors:verifying op mask
2017-10-23 10:51:25.302078 7f39f2e73700  2 req 22:0.000144:s3:OPTIONS 
/aaa:options_cors:verifying op permissions
2017-10-23 10:51:25.302080 7f39f2e73700  2 req 22:0.000146:s3:OPTIONS 
/aaa:options_cors:verifying op params
2017-10-23 10:51:25.302081 7f39f2e73700  2 req 22:0.000148:s3:OPTIONS 
/aaa:options_cors:pre-executing
2017-10-23 10:51:25.302111 7f39f2e73700  2 req 22:0.000149:s3:OPTIONS 
/aaa:options_cors:executing
2017-10-23 10:51:25.302124 7f39f2e73700  2 No CORS configuration set yet for 
this bucket
2017-10-23 10:51:25.302126 7f39f2e73700  2 req 22:0.000193:s3:OPTIONS 
/aaa:options_cors:completing
2017-10-23 10:51:25.302191 7f39f2e73700  2 req 22:0.000258:s3:OPTIONS 
/aaa:options_cors:op status=-13
2017-10-23 10:51:25.302198 7f39f2e73700  2 req 22:0.000264:s3:OPTIONS 
/aaa:options_cors:http status=403
2017-10-23 10:51:25.302203 7f39f2e73700  1 == req done req=0x7f39f2e6d190 
op status=-13 http_status=403 ==
2017-10-23 10:51:25.302260 7f39f2e73700  1 civetweb: 0x7f3a30c7c000: 
172.20.41.101 - - [23/Oct/2017:10:51:25 +0300] "OPTIONS /aaa HTTP/1.1" 1 0 - 
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0) Gecko/20100101 
Firefox/56.0

OPTIONS requests failed with 403.

Robin thank you so much!

We have plans to use haproxy with Civetweb and your rules solve our problem 
with OPTIONS requests!

Thank you guys!


> On 22 Oct 2017, at 23:10, Robin H. Johnson  wrote:
> 
> On Sun, Oct 22, 2017 at 01:31:03PM +, Rudenko Aleksandr wrote:
>> In past we rewrite http response header by Apache rules for our
>> web-interface and pass CORS check. But now it’s impossible to solve on
>> balancer level.
> You CAN modify the CORS responses at the load-balancer level.
> 
> Find below the snippets needed to do it in HAProxy w/ Jewel-Civetweb;
> specifically, this completely overrides the CORS if the Origin matches some
> strings.
> 
> We use this to override the CORS for access via our customer interface panel,
> so regardless of what CORS they set on the bucket, the panel always works.
> 
> frontend ...
>  # Store variable for using later in the response.
>  http-request set-var(txn.origin) req.hdr(Origin)
>  acl override_cors var(txn.origin) -m end -i SOMEDOMAIN
>  acl override_cors var(txn.origin) -m sub -i SOMEDOMAIN
>  # Export fact as a boolean
>  http-request set-var(txn.override_cors) bool(true) if override_cors
>  http-request set-var(txn.override_cors) bool(false) unless override_cors
> 
> backend ...
>  # We inject Origin headers for ..., so we must declare to the client
>  # that the might be different in other requests.
>  http-response add-header Vary Origin if { var(txn.origin) -m len gt 1
>  # If the origin is the Panel, then override the CORS headers
>  acl override_cors var(txn.override_cors),bool
>  # 1. if OPTIONS: Override any 403 error to say it's ok instead
>  # 403 means 

Re: [ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-23 Thread Gregory Farnum
On Mon, Oct 23, 2017 at 9:37 AM Jiri Horky  wrote:

> Hi Greg,
>
>
> On 10/17/2017 11:49 PM, Gregory Farnum wrote:
>
> On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky  wrote:
>
>> Hi list,
>>
>> we are thinking of building relatively big CEPH-based object storage for
>> storage of our sample files - we have about 700M files ranging from very
>> small (1-4KiB) files to pretty big ones (several GiB). Median of file
>> size is 64KiB. Since the required space is relatively large (1PiB of
>> usable storage), we are thinking of utilizing erasure coding for this
>> case. On the other hand, we need to achieve at least 1200MiB/s
>> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>>
>> Since the EC is per-object, the small objects will be stripped to even
>> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
>> object in this scenario -> number of required IOPS when using EC is
>> relatively high. Some vendors (such as Hitachi, but I believe EMC as
>> well) do offline, predefined-chunk size EC instead. The idea is to first
>> write objects with replication factor of 3, wait for enough objects to
>> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
>> less computationally intensive, and repairs much faster, but it also
>> allows reading majority of the small objects directly by reading just
>> part of one of the chunk from it (assuming non degraded state) - one
>> chunk actually contains the whole object.
>> I wonder if something similar is already possible with CEPH and/or is
>> planned. For our use case of very small objects, it would mean near 3-4x
>> performance boosts in terms of required IOPS performance.
>>
>> Another option how to get out of this situation is to be able to specify
>> different storage pools/policies based on file size - i.e. to do 3x
>> replication of the very small files and only use EC for bigger files,
>> where the performance hit with 4x IOPS won't be that painful. But I I am
>> afraid this is not possible...
>>
>>
> Unfortunately any logic like this would need to be handled in your
> application layer. Raw RADOS does not do object sharding or aggregation on
> its own.
> CERN did contribute the libradosstriper, which will break down your
> multi-gigabyte objects into more typical sizes, but a generic system for
> packing many small objects into larger ones is tough — the choices depend
> so much on likely access patterns and such.
>
> I would definitely recommend working out something like that, though!
> -Greg
>
> this is unfortunate. I believe that for storage of small objects, this
> would be a deal breaker. Hitachi claims they can do 20+6 erasure coding
> when using predefined-size EC, which is something hardly imaginable on the
> current CEPH implementation. Actually, for us, I am afraid that lack of
> this feature actually mean we would buy an object store instead of building
> it on open source technology :-/
>
> From technical side, I don't see why the access pattern of such objects
> would change the storage strategy. If you would leave the bulk blocksize
> configurable, it should be enough, shouldn't it?
>

Well, there's two different things. If you're doing replicated writes and
then erasure coding data, you assume the data changes slowly enough for
that to work, or at least that the cost of erasure coding it is worthwhile.

That's not a bad bet, but the RADOS architecture simply doesn't support
doing anything like that internally; all decisions about replication versus
erasure coding and data placement happen on the level of a pool, not on
objects inside of them. So bulk packing of objects isn't really possible
for RADOS to do on its own, and the application has to drive any data
movement. That requires understanding patterns to select the right coding
chunks (so that objects tend to exist in one chunk), to know when is a good
time to physically read and write the data, etc.

This use case you're describing is certainly useful, but so far as I know
it's not implemented in any open-source storage solutions because it's
pretty specialized and requires a lot of backend investment that doesn't
pay off incrementally.
-Greg


>
> Regards
>
> Jiri Horky
>
>
>
>
>
>
>> Any other hint is sincerely welcome.
>>
>> Thank you
>> Jiri Horky
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-23 Thread Jiri Horky
Hi Greg,

On 10/17/2017 11:49 PM, Gregory Farnum wrote:
> On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky  > wrote:
>
> Hi list,
>
> we are thinking of building relatively big CEPH-based object
> storage for
> storage of our sample files - we have about 700M files ranging
> from very
> small (1-4KiB) files to pretty big ones (several GiB). Median of file
> size is 64KiB. Since the required space is relatively large (1PiB of
> usable storage), we are thinking of utilizing erasure coding for this
> case. On the other hand, we need to achieve at least 1200MiB/s
> throughput on reads. The working assumption is 4+2 EC (thus 50%
> overhead).
>
> Since the EC is per-object, the small objects will be stripped to even
> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
> object in this scenario -> number of required IOPS when using EC is
> relatively high. Some vendors (such as Hitachi, but I believe EMC as
> well) do offline, predefined-chunk size EC instead. The idea is to
> first
> write objects with replication factor of 3, wait for enough objects to
> fill 4x 64MiB chunks and only do EC on that. This not only makes
> the EC
> less computationally intensive, and repairs much faster, but it also
> allows reading majority of the small objects directly by reading just
> part of one of the chunk from it (assuming non degraded state) - one
> chunk actually contains the whole object.
> I wonder if something similar is already possible with CEPH and/or is
> planned. For our use case of very small objects, it would mean
> near 3-4x
> performance boosts in terms of required IOPS performance.
>
> Another option how to get out of this situation is to be able to
> specify
> different storage pools/policies based on file size - i.e. to do 3x
> replication of the very small files and only use EC for bigger files,
> where the performance hit with 4x IOPS won't be that painful. But
> I I am
> afraid this is not possible...
>
>
> Unfortunately any logic like this would need to be handled in your
> application layer. Raw RADOS does not do object sharding or
> aggregation on its own.
> CERN did contribute the libradosstriper, which will break down your
> multi-gigabyte objects into more typical sizes, but a generic system
> for packing many small objects into larger ones is tough — the choices
> depend so much on likely access patterns and such.
>
> I would definitely recommend working out something like that, though!
> -Greg
this is unfortunate. I believe that for storage of small objects, this
would be a deal breaker. Hitachi claims they can do 20+6 erasure coding
when using predefined-size EC, which is something hardly imaginable on
the current CEPH implementation. Actually, for us, I am afraid that lack
of this feature actually mean we would buy an object store instead of
building it on open source technology :-/

From technical side, I don't see why the access pattern of such objects
would change the storage strategy. If you would leave the bulk blocksize
configurable, it should be enough, shouldn't it?

Regards
Jiri Horky



>  
>
> Any other hint is sincerely welcome.
>
> Thank you
> Jiri Horky
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk

2017-10-23 Thread Wido den Hollander

> Op 22 oktober 2017 om 18:45 schreef Sean Sullivan :
> 
> 
> On freshly installed ubuntu 16.04 servers with the HWE kernel selected
> (4.10). I can not use ceph-deploy or ceph-disk to provision osd.
> 
> 
>  whenever I try I get the following::
> 
> ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys
> --bluestore --cluster ceph --fs-type xfs -- /dev/sdy
> command: Running command: /usr/bin/ceph-osd --cluster=ceph
> --show-config-value=fsid
> get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
> set_type: Will colocate block with data on /dev/sdy
> command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> --lookup bluestore_block_size
> [command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> --lookup bluestore_block_db_size
> command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> --lookup bluestore_block_size
> command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> --lookup bluestore_block_wal_size
> get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
> get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
> get_dm_uuid: get_dm_uuid /dev/sdy uuid path is /sys/dev/block/65:128/dm/uuid
> Traceback (most recent call last):
>   File "/usr/sbin/ceph-disk", line 9, in 
> load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in
> run
> main(sys.argv[1:])
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in
> main
> args.func(args)
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091, in
> main
> Prepare.factory(args).prepare()
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080, in
> prepare
> self._prepare()
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154, in
> _prepare
> self.lockbox.prepare()
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842, in
> prepare
> verify_not_in_use(self.args.lockbox, check_partitions=True)
>   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950, in
> verify_not_in_use
> raise Error('Device is mounted', partition)
> ceph_disk.main.Error: Error: Device is mounted: /dev/sdy5
> 
> unmounting the disk does not seem to help either. I'm assuming something is
> triggering too early but i'm not sure how to delay or figure that out.
> 
> has anyone deployed on xenial with the 4.10 kernel? Am I missing something
> important?

Yes I have without any issues, I've did:

$ ceph-disk prepare /dev/sdb

Luminous default to BlueStore and that worked just fine.

Yes, this is with a 4.10 HWE kernel from Ubuntu 16.04.

Wido

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Qs on caches, and cephfs

2017-10-23 Thread Jeff
Hey everyone,

Long time listener first time caller.
Thank you to everyone who works on Ceph, docs and code, I'm loving Ceph.
I've been playing with Ceph for awhile and have a few Qs.

Ceph cache tiers, can you have multiple tiered caches?

Also with cache tiers, can you have one cache pool for multiple backing
storage pools? The docs seem to be very careful about specifying one
pool so I suspect I know the answer already.

For CephFS, how do you execute a manual install and manual removal for MDS?

The docs explain how to use ceph-deploy for MDS installs, but I'm trying
to do everything manually right now to get a better understanding of it
all.

The ceph docs seem to be version controlled but I can't seem to find the
repo to update, if you can point me to it I'd be happy to submit patches
to it.

Thnx in advance!
Jeff.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com