Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Craig Chi
Hi,

What is your OS? The permission of journal partition should be changed by udev 
rules: /lib/udev/rules.d/95-ceph-osd.rules
In this file, it is described as:
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
OWNER:="ceph", GROUP:="ceph", MODE:="660", \
RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"

You can also use udevadm command to test whether the partition has been 
processed by the correct udev rule. Like following:

#>udevadm test /sys/block/sdb/sdb2

...
starting 'probe-bcache -o udev /dev/sdb2'
Process 'probe-bcache -o udev /dev/sdb2' succeeded.
OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
/lib/udev/rules.d/95-ceph-osd.rules:16
...

Then /dev/sdb2 will have ceph:ceph permission automatically.

#>ls -l /dev/sdb2
brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2

Sincerely,
Craig Chi

On 2017-02-13 19:06, Piotr Dzionek<piotr.dzio...@seqr.com>wrote:
> 
> Hi,
> 
> 
> I am running ceph Jewel 10.2.5 with separate journals - ssd disks. It runs 
> pretty smooth, however I stumble upon an issue after system reboot. Journal 
> disks become owned by root and ceph failed to start.
> 
> 
> starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 
> /var/lib/ceph/osd/ceph-4/journal
> 2017-02-10 16:24:29.924126 7fd07ab40800 -1 
> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal 
> /var/lib/ceph/osd/ceph-4/journal: (13) Permission denied
> 2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init: unable to mount 
> object store
> 2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR: osd init 
> failed: (13) Permission denied#033[0m
> 
> 
> I fixed this issue by finding journal disks in /dev dir and chown to 
> ceph:ceph. I remember that I had a similar issue after I installed it for a 
> first time. Is it a bug ? or do I have to set some kind of udev rules for 
> this disks?
> 
> 
> FYI, I have this issue after every restart now.
> 
> 
> Kind regards,
> Piotr Dzionek
> 
> 
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating data from a Ceph clusters to another

2017-02-09 Thread Craig Chi
Hi,

Sorry I gave the wrong feature.
rbd mirroring method can only be used on rbd with "journaling" feature (not 
layering).

Sincerely,
Craig Chi

On 2017-02-09 16:41, Craig Chi<craig...@synology.com>wrote:
> Hi John,
>   
> rbd mirroring can configured by 
> pool.http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
> However the rbd mirroring method can only be used on rbd with layering 
> feature, it can not mirror objects other than rbd for you.
>   
> Sincerely,
> Craig Chi
>   
> On 2017-02-09 16:24, Irek Fasikhov<malm...@gmail.com>wrote:
> > Hi.
> > I recommend using rbd import/export.
> >   
> >   
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> >   
> >   
> >   
> >   
> > 2017-02-09 11:13 GMT+03:00 
> > 林自均<johnl...@gmail.com(mailto:johnl...@gmail.com)>:
> > > Hi,
> > >   
> > > I have 2 Ceph clusters, cluster A and cluster B. I want to move all the 
> > > pools on A to B. The pool names don't conflict between clusters. I guess 
> > > it's like RBD mirroring, except that it's pool mirroring.Is there any 
> > > proper ways to do it?
> > >   
> > > Thanks for any suggestions.
> > >   
> > > Best,
> > > John Lin
> > >   
> > >   
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
> > >  ceph-users mailing list ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating data from a Ceph clusters to another

2017-02-09 Thread Craig Chi
Hi John,

rbd mirroring can configured by 
pool.http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
However the rbd mirroring method can only be used on rbd with layering feature, 
it can not mirror objects other than rbd for you.

Sincerely,
Craig Chi

On 2017-02-09 16:24, Irek Fasikhov<malm...@gmail.com>wrote:
> Hi.
> I recommend using rbd import/export.
>   
>   
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>   
>   
>   
>   
> 2017-02-09 11:13 GMT+03:00 林自均<johnl...@gmail.com(mailto:johnl...@gmail.com)>:
> > Hi,
> >   
> > I have 2 Ceph clusters, cluster A and cluster B. I want to move all the 
> > pools on A to B. The pool names don't conflict between clusters. I guess 
> > it's like RBD mirroring, except that it's pool mirroring.Is there any 
> > proper ways to do it?
> >   
> > Thanks for any suggestions.
> >   
> > Best,
> > John Lin
> >   
> >   
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
> >  ceph-users mailing list ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg stuck in peering while power failure

2017-01-10 Thread Craig Chi
Hi Sam,

Thank you for your precise inspection.

I reviewed the log at the time, and I discovered that the cluster failed a OSD 
just after I shut the first unit down. Thus as you said, the pg can't finish 
peering due to the second unit was then shut off suddenly.

Much appreciate your advice, but I aim to keep my cluster working when 2 
storage nodes are down. The unexpected OSD failed with the following log just 
at the time I shut the first unit down:

2017-01-10 12:30:07.905562 mon.1 172.20.1.3:6789/0 28484 : cluster [INF] 
osd.153 172.20.3.2:6810/26796 failed (2 reporters from different host after 
20.072026>= grace 20.00)

But that OSD was not dead actually, more likely had slow response to 
heartbeats. What I think is increasing the osd_heartbeat_grace may somehow 
mitigate the issue.

Sincerely,
Craig Chi

On 2017-01-11 00:08, Samuel Just<sj...@redhat.com>wrote:
> { "name": "Started\/Primary\/Peering", "enter_time": "2017-01-10 
> 13:43:34.933074", "past_intervals": [ { "first": 75858, "last": 75860, 
> "maybe_went_rw": 1, "up": [ 345, 622, 685, 183, 792, 2147483647, 2147483647, 
> 401, 516 ], "acting": [ 345, 622, 685, 183, 792, 2147483647, 2147483647, 401, 
> 516 ], "primary": 345, "up_primary":345 }, Between 75858 and 75860, 345, 622, 
> 685, 183, 792, 2147483647, 2147483647, 401, 516 was the acting set. The 
> current acting set 345, 622, 685, 183, 2147483647, 2147483647, 153, 401, 516 
> needs *all 7* of the osds from epochs 75858 through 75860 to ensure that it 
> has any writes completed during that time. You can make transient situations 
> like that less of a problem by setting min_size to 8 (though it'll prevent 
> writes with 2 failures until backfill completes). A possible enhancement for 
> an EC pool would be to gather the infos from those osds anyway and use that 
> rule outwrites (if they actually happened, you'd still be stuck). -Sam On 
> Tue, Jan 10, 20
 17 at 5:

36 AM, Craig Chi<craig...@synology.com>wrote:>Hi List,>>I am testing the 
stability of my Ceph cluster with power failure.>>I brutally powered off 2 Ceph 
units with each 90 OSDs on it while the client>I/O was continuing.>>Since then, 
some of the pgs of my cluster stucked in peering>>pgmap v3261136: 17408 pgs, 4 
pools, 176 TB data, 5082 kobjects>236 TB used, 5652 TB / 5889 TB 
avail>8563455/38919024 objects degraded (22.003%)>13526 
active+undersized+degraded>3769 active+clean>104 down+remapped+peering>9 
down+peering>>I queried the peering pg (all on EC pool with 7+2) and got 
blocked>information (full query: http://pastebin.com/pRkaMG2h 
)>>"probing_osds": 
[>"153(6)",>"183(3)",>"345(0)",>"401(7)",>"516(8)",>"622(1)",>"685(2)">],>"blocked":
 "peering is blocked due to down osds",>"down_osds_we_would_probe": 
[>792>],>"peering_blocked_by": [>{>"osd": 792,>"current_lost_at": 0,>"comment": 
"starting or marking this osd lost may let us>proceed">}>]>>>osd.792 is exactly 
on one of the unit
 s I powe

red off. And I think the I/O>associated with this pg is paused too.>>I have 
checked the troubleshooting page on Ceph website 
(>http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/>), 
it says that start the OSD or mark it lost can make the procedure>continue.>>I 
am sure that my cluster was healthy before the power outage occurred. I 
am>wondering if the power outage really happens in production environment, 
will>it also freeze my client I/O if I don't do anything? Since I just lost 
2>redundancies (I have erasure code with 7+2), I think it should still 
serve>normal functionality.>>Or if I am doing something wrong? Please give me 
some suggestions, thanks.>>Sincerely,>Craig 
Chi>>___>ceph-users mailing 
list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg stuck in peering while power failure

2017-01-10 Thread Craig Chi
Hi List,

I am testing the stability of my Ceph cluster with power failure.

I brutally powered off 2 Ceph units with each 90 OSDs on it while the client 
I/O was continuing.

Since then, some of the pgs of my cluster stucked in peering

pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
236 TB used, 5652 TB / 5889 TB avail
8563455/38919024 objects degraded (22.003%)
13526 active+undersized+degraded
3769 active+clean
104 down+remapped+peering
9 down+peering

I queried the peering pg (all on EC pool with 7+2) and got blocked information 
(full query:http://pastebin.com/pRkaMG2h)

"probing_osds": [
"153(6)",
"183(3)",
"345(0)",
"401(7)",
"516(8)",
"622(1)",
"685(2)"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
792
],
"peering_blocked_by": [
{
"osd": 792,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]


osd.792 is exactly on one of the units I powered off. And I think the I/O 
associated with this pg is paused too.

I have checked the troubleshooting page on Ceph website 
(http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/), 
it says that start the OSD or mark it lost can make the procedure continue.

I am sure that my cluster was healthy before the power outage occurred. I am 
wondering if the power outage really happens in production environment, will it 
also freeze my client I/O if I don't do anything? Since I just lost 2 
redundancies (I have erasure code with 7+2), I think it should still serve 
normal functionality.

Or if I am doing something wrong? Please give me some suggestions, thanks.

Sincerely,
Craig Chi___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High OSD apply latency right after new year (the leap second?)

2017-01-05 Thread Craig Chi
Hi ,

I'm glad to know that it happened not only to me.
Though it is unharmful, it seems like kind of bug...
Are there any Ceph developers who know how exactly is the implementation of 
"ceph osd perf" command?
Is the leap second really responsible for this behavior?
Thanks.

Sincerely,
Craig Chi

On 2017-01-04 19:55, Alexandre DERUMIER<aderum...@odiso.com>wrote:
> yes, same here on 3 productions clusters. no impact, but a nice happy new 
> year alert ;) Seem that google provide ntp servers to avoid brutal 1 second 
> leap https://developers.google.com/time/smear - Mail original - De: 
> "Craig Chi"<craig...@synology.com>À: 
> "ceph-users"<ceph-users@lists.ceph.com>Envoyé: Mercredi 4 Janvier 2017 
> 11:26:21 Objet: [ceph-users] High OSD apply latency right after new year (the 
> leap second?) Hi List, Three of our Ceph OSDs got unreasonably high latency 
> right after the first second of the new year (2017/01/01 00:00:00 UTC, I have 
> attached the metrics and I am in UTC+8 timezone). There is exactly a pg 
> (size=3) just contains these 3 OSDs. The OSD apply latency is usually up to 
> 25 minutes, and I can also see this large number randomly when I execute 
> "ceph osd perf" command. But the 3 OSDs does not have strange behavior and 
> are performing fine so far. I have no idea how "ceph osd perf" is 
> implemented, but does it have relation to the leap second this year? Since 
> the cluster is not on production, and the developers were all celebrating new 
> year at that time, I can not think of other possibilities. Do your cluster 
> also get this interestingly unexpected new year's gift too? Sincerely, Craig 
> Chi ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High OSD apply latency right after new year (the leap second?)

2017-01-04 Thread Craig Chi
Hi List,

Three of our Ceph OSDs got unreasonably high latency right after the first 
second of the new year (2017/01/01 00:00:00 UTC, I have attached the metrics 
and I am in UTC+8 timezone). There is exactly a pg (size=3) just contains these 
3 OSDs.

The OSD apply latency is usually up to 25 minutes, and I can also see this 
large number randomly when I execute "ceph osd perf" command. But the 3 OSDs 
does not have strange behavior and are performing fine so far.

I have no idea how "ceph osd perf" is implemented, but does it have relation to 
the leap second this year? Since the cluster is not on production, and the 
developers were all celebrating new year at that time, I can not think of other 
possibilities.

Do your cluster also get this interestingly unexpected new year's gift too?

Sincerely,
Craig Chi___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph - Health and Monitoring

2017-01-02 Thread Craig Chi
Hello,

I suggest Prometheus 
withceph_exporter(https://github.com/digitalocean/ceph_exporter)and Grafana 
(UI). It can also monitor the node's health and any other services you want.
And it has a beautiful UI.

Sincerely,
Craig Chi

On 2017-01-02 21:32, ulem...@polarzone.de wrote:
> Hi Andre, I use check_ceph_dash on top of ceph-dash for this (is an 
> nagios/icinga Plugin). https://github.com/Crapworks/ceph-dash 
> https://github.com/Crapworks/check_ceph_dash ceph-dash provide an simple 
> clear overview as web-dashbord. Udo Am 2017-01-02 12:42, schrieb Andre 
> Forigato:>Hello,>>I am responsible with the health of the servers and the 
> entire Ceph>system.>What should I use to monitor the entire Celp 
> environment?>Monitor all objects.>>Which one is the best?>Is it SNMP 
> only?>>>Thanks.>>Andre>___>ceph-users
>  mailing 
> list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why I don't see "mon osd min down reports" in "config show" report result?

2016-12-23 Thread Craig Chi
Hi Stéphane Klein,

Two mail threads sent from you are similar situations.

1. Just a reminder, if you are using Jewel, you should look for the page of 
jewel in the URL. For example, you should 
seehttp://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/instead
 of hammer.

2. The available configuration options and the default values of them are 
changing all the time, but the document of Ceph is not always up-to-date. For 
the correct available options and accurate default values of your Ceph version, 
please refer to the corresponding version tag of the source code.

https://github.com/ceph/ceph/blob/master/src/common/config_opts.h

You should switch to the branch you are using.

Sincerely,
Craig Chi

On 2016-12-23 18:55, Stéphane Klein<cont...@stephane-klein.info>wrote:
> Hi,
> when I execute:
>   
> ```
> root@ceph-mon-1:/home/vagrant# ceph --admin-daemon 
> /var/run/ceph/ceph-mon.ceph-mon-1.asok config show | grep "down"
> "mon_osd_adjust_down_out_interval": "true",
> "mon_osd_down_out_interval": "300",
> "mon_osd_down_out_subtree_limit": "rack",
> "mon_pg_check_down_all_threshold": "0.5",
> "mon_warn_on_osd_down_out_interval_zero": "true",
> "mon_osd_min_down_reporters": "2",
> "mds_shutdown_check": "0",
> "mds_mon_shutdown_timeout": "5",
> "osd_max_markdown_period": "600",
> "osd_max_markdown_count": "5",
> "osd_mon_shutdown_timeout": "5",
> ```I don't see:
>   
> monosdmindownreports
>   
> Why? This field is present 
> here:http://docs.ceph.com/docs/hammer/rados/configuration/mon-osd-interaction/
>   
> Best regards,
> Stéphane
> --
> Stéphane 
> Klein<cont...@stephane-klein.info(mailto:cont...@stephane-klein.info)>
> blog:http://stephane-klein.info
> cv :http://cv.stephane-klein.info
> Twitter:http://twitter.com/klein_stephane___
>  ceph-users mailing list ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD creation and sequencing.

2016-12-16 Thread Craig Chi
Hi Daniel,

If you deploy your cluster by manual method, you can specify the OSD number as 
you wish.
Here are the steps of manual deployment: 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-osds

Sincerely,
Craig Chi

On 2016-12-16 21:51, Daniel Corley<r...@southernpenguin.com>wrote:
> 
> Is there a way to specify and OSD number on creation ?We run into situations 
> where we have nodes that if the OSds are not created sequentially following 
> the sda,sdb naming convention then the numbers areless than easy to correlate 
> to hardware.In the example shown below we know OSD #341 has gone out.But it 
> requires more research to find out what node and what drive are the issue.I 
> know a UUID can be specified for the partition name but its not very helpful 
> (Or i amusing it wrong).
> 
> 
> Apologies for not specifying version used or host OS as we are still ina pre 
> deployment testing we are trying several versions from Hammer to jewel with 
> different Debian and RHEL hosts to measure what fits best for our environment.
> 
> > 
> > IDWEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -10 root default
> > 00 osd.0up1.01.0
> > 10 osd.1up1.01.0
> > 20 osd.2up1.01.0
> > 30 osd.3up1.01.0
> > 40 osd.4up1.01.0
> > 50 osd.5up1.01.0
> > 60 osd.6up1.01.0
> > 
> > 
> > ~
> > 3340 osd.334up1.01.0
> > 3350 osd.335up1.01.0
> > 3360 osd.336up1.01.0
> > 3370 osd.337up1.01.0
> > 3380 osd.338up1.01.0
> > 3390 osd.339up1.01.0
> > 3400 osd.340up1.01.0
> > 3410 osd.341down01.0
> > 3420 osd.342up1.01.0
> > 3430 osd.343up1.01.0
> > 3440 osd.344up1.01.0
> > 3450 osd.345up1.01.0
> > 3460 osd.346up1.01.0
> > 
> 
> 
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Wrong pg count when pg number is large

2016-12-12 Thread Craig Chi
Hi Greg,

Sorry I didn't reserve the environment due to urgent needs.

However I think you are right because at that time I just purged all pools and 
re-create them in a short time, thank you very much!

Sincerely,
Craig Chi

On 2016-12-13 14:21, Gregory Farnum<gfar...@redhat.com>wrote:
> On Thu, Dec 1, 2016 at 8:35 AM, Craig Chi<craig...@synology.com>wrote:>Hi 
> list,>>I am testing the Ceph cluster with unpractical pg numbers to do 
> some>experiments.>>But when I use ceph -w to watch my cluster status, I see 
> pg numbers doubled.>From my ceph -w>>root@mon1:~# ceph -w>cluster 
> 1c33bf75-e080-4a70-9fd8-860ff216f595>health HEALTH_WARN>too many PGs per OSD 
> (514>max 300)>noout,noscrub,nodeep-scrub,sortbitwise flag(s) set>monmap e1: 3 
> mons 
> at>{mon1=172.20.1.2:6789/0,mon2=172.20.1.3:6789/0,mon3=172.20.1.4:6789/0}>election
>  epoch 634, quorum 0,1,2 mon1,mon2,mon3>osdmap e48791: 420 osds: 420 up, 420 
> in>flags noout,noscrub,nodeep-scrub,sortbitwise>pgmap v892347: 25600 pgs, 4 
> pools, 14321 GB data, 3579 kobjects>23442 GB used, 3030 TB / 3053 TB 
> avail>25600 active+clean>>2016-12-01 17:26:20.358407 mon.0 [INF] pgmap 
> v892346: 51200 pgs: 51200>active+clean; 16973 GB data, 24609 GB used, 4556 TB 
> / 4580 TB avail>2016-12-01 17:26:22.877765 mon.0 [INF] pgmap v892347: 51200 
> pgs: 51200>a
 ctive+cl

ean; 16973 GB data, 24610 GB used, 4556 TB / 4580 TB avail>>From my ceph osd 
pool ls detail>>pool 81 'vms' replicated size 3 min_size 2 crush_ruleset 0 
object_hash>rjenkins pg_num 512 pgp_num 512 last_change 48503 
flags>hashpspool,nodelete,nopgchange,nosizechange stripe_width 0>pool 82 
'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash>rjenkins 
pg_num 512 pgp_num 512 last_change 48507 
flags>hashpspool,nodelete,nopgchange,nosizechange stripe_width 0>pool 85 
'objects' erasure size 20 min_size 17 crush_ruleset 1 object_hash>rjenkins 
pg_num 8192 pgp_num 8192 last_change 48778 
flags>hashpspool,nodelete,nopgchange,nosizechange stripe_width 4352>pool 86 
'volumes' replicated size 3 min_size 2 crush_ruleset 0 object_hash>rjenkins 
pg_num 16384 pgp_num 16384 last_change 48786 
flags>hashpspool,nodelete,nopgchange,nosizechange stripe_width 0>>I think I 
created 25600 pgs totally, but ceph -s reported 25600 / 51200>randomly. However 
ceph -w always reported 51200 on the latest
  line.>>

If this a kind of bug or just I was doing something wrong? Feel free to let>me 
know if you need more information. Are you still seeing this? It certainly 
sounds like a bug, but I think the output you're seeing could also happen if 
you had some deleted pools which the OSDs hadn't been able to purge yet. -Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to start/restart osd and mon manually (not by init script or systemd)

2016-12-12 Thread Craig Chi
Hi Wang,

Did you ever check if there are error logs in /var/log/ceph/ceph-mon.xt2.log or 
/var/log/ceph/ceph-osd.0.log?

BTW, just out of curiosity, why don't you just use systemd to start your osd 
and mon? systemd handles automatically restart of the processes after few kinds 
of simple failure.

Sincerely,
Craig Chi

On 2016-12-11 10:18, WANG Siyuan<wangsiyuanb...@gmail.com>wrote:
> Hi, all
> I want to deploy ceph manually. When I finish config, I need to start mon and 
> osd manually.
> I used these command. I found these command in systemd/ceph-mon@.service 
> andsystemd/ceph-osd@.service:
> 
> ceph-mon --id xt2 --setuser ceph --setgroup ceph
> ceph-osd --cluster ceph --id 0 --setuser ceph --setgroup ceph
> 
> xt2 is the hostname(and domain name) of mon server.
> 
> But there is problem: I can't add new OSDs to the ceph cluster. Kill and 
> start MON not work.
> 
> So I want to know the right command to start/restart OSD and MON. Thanks very 
> much.
> 
> 
> Yours sincerely,
> WANG Siyuan___ ceph-users mailing 
> list ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Wrong pg count when pg number is large

2016-12-01 Thread Craig Chi
Hi list,

I am testing the Ceph cluster with unpractical pg numbers to do some 
experiments.

But when I use ceph -w to watch my cluster status, I see pg numbers doubled. 
From my ceph -w

root@mon1:~# ceph -w
cluster 1c33bf75-e080-4a70-9fd8-860ff216f595
health HEALTH_WARN
too many PGs per OSD (514>max 300)
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
monmap e1: 3 mons at 
{mon1=172.20.1.2:6789/0,mon2=172.20.1.3:6789/0,mon3=172.20.1.4:6789/0}
election epoch 634, quorum 0,1,2 mon1,mon2,mon3
osdmap e48791: 420 osds: 420 up, 420 in
flags noout,noscrub,nodeep-scrub,sortbitwise
pgmap v892347:25600 pgs, 4 pools, 14321 GB data, 3579 kobjects
23442 GB used, 3030 TB / 3053 TB avail
25600 active+clean

2016-12-01 17:26:20.358407 mon.0 [INF] pgmap v892346:51200 pgs: 51200 
active+clean;16973 GB data, 24609 GB used, 4556 TB / 4580 TB avail
2016-12-01 17:26:22.877765 mon.0 [INF] pgmap v892347: 51200 pgs: 51200 
active+clean; 16973 GB data, 24610 GB used, 4556 TB / 4580 TB avail

>From my ceph osd pool ls detail

pool 81 'vms' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 512 pgp_num 512 last_change 48503 flags 
hashpspool,nodelete,nopgchange,nosizechange stripe_width 0
pool 82 'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 512 pgp_num 512 last_change 48507 flags 
hashpspool,nodelete,nopgchange,nosizechange stripe_width 0
pool 85 'objects' erasure size 20 min_size 17 crush_ruleset 1 object_hash 
rjenkins pg_num 8192 pgp_num 8192 last_change 48778 flags 
hashpspool,nodelete,nopgchange,nosizechange stripe_width 4352
pool 86 'volumes' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 16384 pgp_num 16384 last_change 48786 flags 
hashpspool,nodelete,nopgchange,nosizechange stripe_width 0

I think I created 25600 pgs totally, but ceph -s reported 25600 / 51200 
randomly. However ceph -w always reported 51200 on the latest line.

If this a kind of bug or just I was doing something wrong? Feel free to let me 
know if you need more information.

Thanks.

Sincerely,
Craig Chi



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-28 Thread Craig Chi
Hi Brad,

We fully understood the hardware we currently use are under Ceph's 
recommendation, so we are seeking for a method to lower or restrict the 
resources needed by OSD. Definitely losing some performance is acceptable for 
us.

The reason why we did these experiments and discuss causes is that we want to 
find the true factors that reflect the memory usage. I think it is beneficial 
for the Ceph community and we can convince our customers and other Ceph users 
to realize the feasibility and stability of Ceph on different hardware 
infrastructure for production.

With your comments, we have more confidence on the memory consumption of Ceph 
OSD.

Hope there still exist some methods or workarounds to bound the memory 
consumption (tuning configs?), or we would just accept the recommendations on 
the website. (also, could we say 1GB / 1TB is the maximum requirement? or just 
enough under normal circumstances?)

Thank you very much.

Sincerely,
Craig Chi

On 2016-11-29 10:27, Brad Hubbard<bhubb...@redhat.com>wrote:
>   
>   
> On Tue, Nov 29, 2016 at 3:12 AM, Craig 
> Chi<craig...@synology.com(mailto:craig...@synology.com)>wrote:
> > Hi guys,
> >   
> > Thanks to both of your suggestions, we had some progression on this issue.
> >   
> > I tuned vm.min_free_kbytes to 16GB and raised vm.vfs_cache_pressure to 200, 
> > and I did observe that the OS keep releasing cache while the OSDs want more 
> > and more memory.
>   
> vfs_cache_pressure is a percentage so values>100 have always seemed odd to me.
> >   
> > OK. Now we are going to reproduce the hanging issue.
> >   
> > 1. set the cluster with noup flag
> > 2. restart all ceph-osd process (then we can see all OSDs are down from 
> > ceph monitor)
> > 3. unset noup flag
> >   
> > As expected the OSDs started to consume memory, and eventually the kernel 
> > still hanged without response.
> >   
> > Therefore I learned to gather the vmcore and tried to investigate further 
> > as Brad advised.
> >   
> > The vmcore dump file was unbeliviably huge -- about 6 GB per dump. However 
> > it's helpful that we quickly found the following abnormal things:
> >   
> > 1. The memory was exhausted as expected.
> >   
> > crash>kmem -i
> > PAGESTOTALPERCENTAGE
> > TOTAL MEM63322527241.6 GB
> > FREE6764462.6 GB1% of TOTAL MEM
> > USED62646081239 GB98% of TOTAL MEM
> > SHARED6213362.4 GB0% of TOTAL MEM
> > BUFFERS47307184.8 MB0% of TOTAL MEM
> > CACHED3762051.4 GB0% of TOTAL MEM
> > SLAB4554001.7 GB0% of TOTAL MEM
> >   
> > TOTAL SWAP488703918.6 GB
> > SWAP USED385593814.7 GB78% of TOTAL SWAP
> > SWAP FREE10311013.9 GB21% of TOTAL SWAP
> >   
> > COMMIT LIMIT36548302139.4 GB
> > COMMITTED92434847352.6 GB252% of TOTAL LIMIT
>   
> As Nick already mentioned,90x8TB disks is 720Tb of storage and, according 
> tohttp://docs.ceph.com/docs/jewel/start/hardware-recommendations/#ramduring 
> recovery you may require ~1GB per 1TB of storage per daemon.
> >   
> >   
> > 2. Each OSD used a lot of memory. (We have only total 256 GB RAM but there 
> > are 90 OSDs in a node)
> >   
> > # Find 10 largest memory consumption processes
> > crash>ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = 
> > $8/1024" MB"; print}' | tail -10
> > 100864 1 12 883a43e1b700 IN 1.1 7484884 2973.33 MB ceph-osd
> > 87400 1 27 8838538ae040 IN 1.1 7557500 3036.92 MB ceph-osd
> > 108126 1 22 882bcca91b80 IN 1.2 7273068 3045.8 MB ceph-osd
> > 39787 1 28 883f468ab700 IN 1.2 7300756 3067.88 MB ceph-osd
> > 44861 1 20 883cf925 IN 1.2 7327496 3067.89 MB ceph-osd
> > 30486 1 23 883f59e1c4c0 IN 1.2 7332828 3083.58 MB ceph-osd
> > 125239 1 15 882687018000 IN 1.2 6965560 3103.36 MB ceph-osd
> > 123807 1 19 88275d90ee00 IN 1.2 7314484 3173.48 MB ceph-osd
> > 116445 1 1 882863926e00 IN 1.2 7279040 3269.09 MB ceph-osd
> > 94442 1 0 882ed2d01b80 IN 1.3 7566148 3418.69 MB ceph-osd
>   
> Based on the information above this is not excessive memory usage AFAICS.
> >   
> >   
> > 3. The excessive amount of message threads.
> >   
> > crash>ps | grep ms_pipe_read | wc -l
> > 144112
> > crash>ps | grep ms_pipe_write | wc -l
> > 146692
> >   
> > Totally up to 290k threads in ms_pipe_*.
> >   
> >   
> > 4. Several tries we had, and we luckily got some memory profiles before oom 
> > killer started to work.
> >   
> > # Parse the smaps of a ceph-osd process 
> > byparse_smaps.py(https://github.com/craig08/parse_smaps)
> &

Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-28 Thread Craig Chi
13279168 (12.7 MiB) Bytes in transfer cache freelist
MALLOC: +96438792 (92.0 MiB) Bytes in thread cache freelists
MALLOC: +25817248 (24.6 MiB) Bytes in malloc metadata
MALLOC:
MALLOC: =6035574944 ( 5756.0 MiB) Actual memory used (physical + swap)
MALLOC: +35741696 (34.1 MiB) Bytes released to OS (aka unmapped)
MALLOC:
MALLOC: =6071316640 ( 5790.1 MiB) Virtual address space used
MALLOC:
MALLOC:357627Spans in use
MALLOC:89Thread heaps in use
MALLOC:8192Tcmalloc page size



6. google-pprof the heap dump
Total: 1916.6 MB
1036.954.1%54.1%1036.954.1% ceph::buffer::create_aligned
313.916.4%70.5%313.916.4% ceph::buffer::list::append@a78c00
220.011.5%82.0%220.011.5% std::_Rb_tree::_M_emplace_hint_unique
130.06.8%88.7%130.06.8% leveldb::ReadBlock
129.86.8%95.5%129.86.8% std::vector::_M_default_append
22.11.2%96.7%53.42.8% PG::add_log_entry
7.40.4%97.1%7.40.4% ceph::buffer::list::crc32c
7.00.4%97.4%7.00.4% ceph::log::Log::create_entry
5.10.3%97.7%5.10.3% OSD::get_tracked_conf_keys
4.70.2%97.9%4.80.2% Pipe::Pipe
3.30.2%98.1%9.00.5% decode_message
3.20.2%98.3%6.70.3% SimpleMessenger::add_accept_pipe
3.20.2%98.4%4.50.2% OSD::_make_pg
...


We have some hypotheses after discussion:

1. We observed that the number of connection (counted by `netstat -ant | grep 
ESTABLISHED | wc -l`) rises rapidly along with the average memory used by 
ceph-osd, especially [heap] section.
>Is there any relation between memory usage and the number of network 
>connection?

2. After unset noup flag, the number of connection bursts to over 200k in few 
seconds.
>We have an EC pool created with k=17, m=3. Is the large combination of (k,m) 
>responsible for these connections?
>We have average 300 pgs per OSD in the crash experiment. Is the high pgs per 
>OSD responsible for these connections?

3. With simple messenger we forked two message threads for single network 
connection.
>We think 290k message threads in the same time are hard to work normally and 
>efficiently.
>Will it be better with async messenger? Tried with async messenger, we saw 
>threads decreased but the number of network connection still high and the 
>kernel hang issue continued.


Now we are still struggling with this problem.

Please kindly instruct us if you have any directions.

Sincerely,
Craig Chi

On 2016-11-25 21:26, Nick Fisk<n...@fisk.me.uk>wrote:
>   
> Hi,
>   
>   
>   
>   
>   
> I didn’t so the maths, so maybe 7GB isn’t worth tuning for, although every 
> little helps ;-)
>   
>   
>   
>   
>   
> I don’t believe peering or recovery should effect this value, but other 
> things will consume memory during recovery, but I’m not aware if this can be 
> limited or tuned.
>   
>   
>   
>   
>   
> Yes, the write and read cache’s will consume memory and may limit Linux’s 
> ability to react quickly enough in tight memory conditions. I believe you can 
> be in a state where it looks like you have more memory potentially available 
> than actually is usable at that point in time. The min_free_bytes can help 
> here.
>   
>   
>   
>   
>   
> From:Craig Chi [mailto:craig...@synology.com]
> Sent:25 November 2016 01:46
> To:Brad Hubbard<bhubb...@redhat.com>
> Cc:Nick Fisk<n...@fisk.me.uk>; Ceph Users<ceph-users@lists.ceph.com>
> Subject:Re: [ceph-users] Ceph OSDs cause kernel unresponsive
>   
>   
>   
>   
>   
>   
>   
> Hi Nick,
>   
>   
>   
>   
>   
>   
>   
> I have seen the report before, if I understand correctly, the 
> osd_map_cache_size generally introduces a fixed amount of memory usage. We 
> are using the default value of 200, and a single osd map I got from our 
> cluster is 404KB.
>   
>   
>   
>   
>   
>   
>   
> That is totally 404KB * 200 * 90 (osds) = about 7GB on each node.
>   
>   
>   
>   
>   
>   
>   
> Will the memory consumption generated by this factor become larger when 
> unstably peering or recovering? If not, we still need to find the root cause 
> of why free memory drops without control.
>   
>   
>   
>   
>   
>   
>   
> Does anyone know that what is the relation between filestore or journal 
> configurations and the OSD's memory consumption? Is it possible that the 
> filestore queue or journal queue occupy huge memory pages and cause 
> filesystem cache hard to release (and result in oom)?
>   
>   
>   
>   
>   
>   
>   
> At last, about nobarrier, I fully knew the consequence and is seriously 
> testing on this option. Sincerely appreciate your kindness and useful 
> suggestions.
>   
>   
>   
>   
>   
>   
>   
> Sincerely,
> Craig Chi
>   
>   
>   
>   
> On 2016

Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-24 Thread Craig Chi
Hi Brad,

Thank you for your investigation.

Here are the reasons of why we thought the abnormal Ceph behavior was caused by 
memory exhaustion. The following link redirect to the dmesg output on a toughly 
survived Ceph node.http://pastebin.com/Aa1FDd4K

However I can not ensure that this is responsible for each time kernel hang, 
since most of the time we could not retrieve any related log once the kernel 
became inactive.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan.

On 2016-11-25 09:46, Craig Chi<craig...@synology.com>wrote:
> Hi Nick,
>   
> I have seen the report before, if I understand correctly, the 
> osd_map_cache_size generally introduces a fixed amount of memory usage. We 
> are using the default value of 200, and a single osd map I got from our 
> cluster is 404KB.
>   
> That is totally 404KB * 200 * 90 (osds) = about 7GB on each node.
>   
> Will the memory consumption generated by this factor become larger when 
> unstably peering or recovering? If not, we still need to find the root cause 
> of why free memory drops without control.
>   
> Does anyone know that what is the relation between filestore or journal 
> configurations and the OSD's memory consumption? Is it possible that the 
> filestore queue or journal queue occupy huge memory pages and cause 
> filesystem cache hard to release (and result in oom)?
>   
> At last, about nobarrier, I fully knew the consequence and is seriously 
> testing on this option. Sincerely appreciate your kindness and useful 
> suggestions.
>   
> Sincerely,
> Craig Chi
> On 2016-11-25 07:23, Brad Hubbard<bhubb...@redhat.com>wrote:
> > Two of these appear to be hung task timeouts and the other is an invalid 
> > opcode.
> > There is no evidence here of memory exhaustion (although it remains to be 
> > seen whether this is a factor but I'd expect to see evidence of shrinker 
> > activity in the stacks) and I would speculate the increased memory 
> > utilisation is due to the issues with the OSD tasks.
> > I would suggest that the next step here is to work out specifically why the 
> > invalid opcode happened and/or why kernel tasks are hanging for>120 seconds.
> > To do that you may need to capture a vmcore and analyse it and/or engage 
> > your kernel support team to investigate further.
> >   
> > On Fri, Nov 25, 2016 at 8:26 AM, Nick 
> > Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> > >   
> > > There’s a couple of things you can do to reduce memory usage by limiting 
> > > the number of OSD maps each OSD stores, but you will still be pushing up 
> > > against the limits of the ram you have available. There is a Cern 30PB 
> > > test (should be on google) which gives some details on some of the 
> > > settings, but quite a few are no longer relevant in jewel.
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Once other thing, I saw you have nobarrier set on mount options. Please 
> > > please please understand the consequences of this option
> > >   
> > >   
> > >   
> > >   
> > >   
> > > From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf 
> > > OfCraig Chi
> > > Sent:24 November 2016 10:37
> > > To:Nick Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>
> > > Cc:ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > > Subject:Re: [ceph-users] Ceph OSDs cause kernel unresponsive
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Hi Nick,
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Thank you for your helpful information.
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change 
> > > the hardware architecture now.
> > >   
> > >   
> > >   
> > > Are there any methods to set the resource limit one OSD can consume?
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > And for your question, we currently set system configuration as:
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > vm.swappiness=10
> > > kernel.pid_max=4194303
> > > fs.file-max=26234859
> > > vm.zone_reclaim_mode=0
> > > vm.vfs_cache_pressure=

Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-24 Thread Craig Chi
Hi Nick,

I have seen the report before, if I understand correctly, the 
osd_map_cache_size generally introduces a fixed amount of memory usage. We are 
using the default value of 200, and a single osd map I got from our cluster is 
404KB.

That is totally 404KB * 200 * 90 (osds) = about 7GB on each node.

Will the memory consumption generated by this factor become larger when 
unstably peering or recovering? If not, we still need to find the root cause of 
why free memory drops without control.

Does anyone know that what is the relation between filestore or journal 
configurations and the OSD's memory consumption? Is it possible that the 
filestore queue or journal queue occupy huge memory pages and cause filesystem 
cache hard to release (and result in oom)?

At last, about nobarrier, I fully knew the consequence and is seriously testing 
on this option. Sincerely appreciate your kindness and useful suggestions.

Sincerely,
Craig Chi
On 2016-11-25 07:23, Brad Hubbard<bhubb...@redhat.com>wrote:
> Two of these appear to be hung task timeouts and the other is an invalid 
> opcode.
> There is no evidence here of memory exhaustion (although it remains to be 
> seen whether this is a factor but I'd expect to see evidence of shrinker 
> activity in the stacks) and I would speculate the increased memory 
> utilisation is due to the issues with the OSD tasks.
> I would suggest that the next step here is to work out specifically why the 
> invalid opcode happened and/or why kernel tasks are hanging for>120 seconds.
> To do that you may need to capture a vmcore and analyse it and/or engage your 
> kernel support team to investigate further.
>   
> On Fri, Nov 25, 2016 at 8:26 AM, Nick 
> Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> >   
> > There’s a couple of things you can do to reduce memory usage by limiting 
> > the number of OSD maps each OSD stores, but you will still be pushing up 
> > against the limits of the ram you have available. There is a Cern 30PB test 
> > (should be on google) which gives some details on some of the settings, but 
> > quite a few are no longer relevant in jewel.
> >   
> >   
> >   
> >   
> >   
> > Once other thing, I saw you have nobarrier set on mount options. Please 
> > please please understand the consequences of this option
> >   
> >   
> >   
> >   
> >   
> > From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf OfCraig 
> > Chi
> > Sent:24 November 2016 10:37
> > To:Nick Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>
> > Cc:ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > Subject:Re: [ceph-users] Ceph OSDs cause kernel unresponsive
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > Hi Nick,
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > Thank you for your helpful information.
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change the 
> > hardware architecture now.
> >   
> >   
> >   
> > Are there any methods to set the resource limit one OSD can consume?
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > And for your question, we currently set system configuration as:
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > vm.swappiness=10
> > kernel.pid_max=4194303
> > fs.file-max=26234859
> > vm.zone_reclaim_mode=0
> > vm.vfs_cache_pressure=50
> > vm.min_free_kbytes=4194303
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > I would try to configure vm.min_free_kbytes larger and test.
> >   
> >   
> >   
> > I will be grateful if anyone has the experience of how to tune these values 
> > for Ceph.
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > Sincerely,
> > Craig Chi
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> >   
> > On 2016-11-24 17:48, Nick 
> > Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> >   
> >   
> >   
> >   
> > >   
> > > Hi Craig,
> > >   
> > >   
> > >   
> > >   
> > >   
> > > From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf 
> > > OfCraig Chi
> > > Sent:24 November 2016 0

Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-24 Thread Craig Chi
Hi Nick,

Thank you for your helpful information.

I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change the 
hardware architecture now.
Are there any methods to set the resource limit one OSD can consume?

And for your question, we currently set system configuration as:

vm.swappiness=10
kernel.pid_max=4194303
fs.file-max=26234859
vm.zone_reclaim_mode=0
vm.vfs_cache_pressure=50
vm.min_free_kbytes=4194303

I would try to configure vm.min_free_kbytes larger and test.
I will be grateful if anyone has the experience of how to tune these values for 
Ceph.

Sincerely,
Craig Chi

On 2016-11-24 17:48, Nick Fisk<n...@fisk.me.uk>wrote:
>   
> Hi Craig,
>   
>   
>   
>   
>   
> From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf OfCraig 
> Chi
> Sent:24 November 2016 08:34
> To:ceph-users@lists.ceph.com
> Subject:[ceph-users] Ceph OSDs cause kernel unresponsive
>   
>   
>   
>   
>   
>   
>   
> Hi Cephers,
>   
> We have encountered kernel hanging issue on our Ceph cluster. Just 
> likehttp://imgur.com/a/U2Flz(http://xo4t.mj.am/lnk/AEMAGgl8d9kAAF3gdxIAADNJBWwAAACRXwBYNrdn86zWn8c6TG2erWuwIplBRAAAlBI/1/43srN8AAcy_LqSj9aZGHgQ/aHR0cDovL2ltZ3VyLmNvbS9hL1UyRmx6),http://imgur.com/a/lyEko(http://xo4t.mj.am/lnk/AEMAGgl8d9kAAF3gdxIAADNJBWwAAACRXwBYNrdn86zWn8c6TG2erWuwIplBRAAAlBI/2/JjAE2wY38AaHTQJAIAkRPA/aHR0cDovL2ltZ3VyLmNvbS9hL2x5RWtv)orhttp://imgur.com/a/IGXdu(http://xo4t.mj.am/lnk/AEMAGgl8d9kAAF3gdxIAADNJBWwAAACRXwBYNrdn86zWn8c6TG2erWuwIplBRAAAlBI/3/hf8qL6yeUrzKsSNgrg0cHQ/aHR0cDovL2ltZ3VyLmNvbS9hL0lHWGR1).
>   
> We believed it is caused by out of memory, because we observed that when OSDs 
> went crazy, the available memory of each node were decreasing rapidly (from 
> 50% available to lower than 10%). Then the node running Ceph OSD became 
> unresponsive with console showing hung_task_timout or slab_out_of_memory, 
> etc. The only thing we can do then is hard reset the unit.
>   
> It is hard to predict when the kernel hanging issue will happen. In my past 
> experiences, it usually happened after a long term benchmark procedure, and 
> followed by a manual trigger like 1) reboot a node 2) restart all OSDs 3) 
> modify CRUSH map.
>   
> Currently the cluster is back to normal, but we want to figure out the root 
> cause to avoid happening again. We think the high values of ceph.conf are 
> pretty suspicous, but without code tracing we are hard to realize the impact 
> of the values and the memory consumption.
>   
> Many thanks if you have any suggestions.
>   
>   
>   
>   
>   
>   
>   
> I think you are probably running out of memory, 90x8TB disks is 720Tb of 
> storage, that will need a lot of ram to run and also the fact that the 
> problems occur when PG’s start moving around after a node failure also 
> suggests this.
>   
>   
>   
>   
>   
> Have you adjusted your vm.vfs_cache_pressure?
>   
>   
>   
>   
>   
> You might also want to try setting vm.min_free_kbytes to 8-16GB to try and 
> keep some memory free and avoid fragmentation.
>   
>   
>   
>   
>   
>   
> =
>   
>   
>   
>   
> Following is our ceph cluster architecture:
>   
> OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 GNU/Linux)
> Ceph: Jewel 10.2.3
>   
> 3 Ceph Monitors running on 3 dedicated machines
> 630 Ceph OSDs running on 7 storage machines (each machine has 256GB RAM and 
> 90 units of 8TB hard drives)
>   
> There are 4 pools with following settings:
> vms512pg x 3 replica
> images512pg x 3 replica
> volumes 8192 pg x 3 replica
> objects 4096 pg x (17,3) erasure code profile
>   
> ==>average 173.92 pgs per OSD
>   
> We tuned our ceph.conf by referencing many performance tuning resources 
> online ( mainly from slide 38 
> ofhttps://goo.gl/Idkh41(http://xo4t.mj.am/lnk/AEMAGgl8d9kAAF3gdxIAADNJBWwAAACRXwBYNrdn86zWn8c6TG2erWuwIplBRAAAlBI/4/GqKQV3EBTJMuFM6ont0jEA/aHR0cHM6Ly9nb28uZ2wvSWRraDQx))
>   
> [global]
> osd pool default pg num = 4096
> osd pool default pgp num = 4096
> err to syslog = true
> log to syslog = true
> osd pool default size = 3
> max open files = 131072
> fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595
> osd crush chooseleaf type = 1
>   
> [mon.mon1]
> host = mon1
> mon addr = 172.20.1.2
>   
> [mon.mon2]
> host = mon2
> mon addr = 172.20.1.3
>   
> [mon.mon3]
> host = mon3
> mon addr = 172.20.1.4
>   
> [mon]
> mon osd full ratio = 0.85
> mon osd nearfull ratio = 0.7
> mon osd down out interval = 600
> mon osd down out subtree limit = host
> mon allow pool delete

[ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-24 Thread Craig Chi
Hi Cephers,

We have encountered kernel hanging issue on our Ceph cluster. Just 
likehttp://imgur.com/a/U2Flz,http://imgur.com/a/lyEkoorhttp://imgur.com/a/IGXdu.

We believed it is caused by out of memory, because we observed that when OSDs 
went crazy, the available memory of each node were decreasing rapidly (from 50% 
available to lower than 10%). Then the node running Ceph OSD became 
unresponsive with console showing hung_task_timout or slab_out_of_memory, etc. 
The only thing we can do then is hard reset the unit.

It is hard to predict when the kernel hanging issue will happen. In my past 
experiences, it usually happened after a long term benchmark procedure, and 
followed by a manual trigger like 1) reboot a node 2) restart all OSDs 3) 
modify CRUSH map.

Currently the cluster is back to normal, but we want to figure out the root 
cause to avoid happening again. We think the high values of ceph.conf are 
pretty suspicous, but without code tracing we are hard to realize the impact of 
the values and the memory consumption.

Many thanks if you have any suggestions.

=

Following is our ceph cluster architecture:

OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 GNU/Linux)
Ceph: Jewel 10.2.3

3 Ceph Monitors running on 3 dedicated machines
630 Ceph OSDs running on 7 storage machines (each machine has 256GB RAM and 90 
units of 8TB hard drives)

There are 4 pools with following settings:
vms512pg x 3 replica
images512pg x 3 replica
volumes 8192 pg x 3 replica
objects 4096 pg x (17,3) erasure code profile

==>average 173.92 pgs per OSD

We tuned our ceph.conf by referencing many performance tuning resources online 
( mainly from slide 38 ofhttps://goo.gl/Idkh41)

[global]
osd pool default pg num = 4096
osd pool default pgp num = 4096
err to syslog = true
log to syslog = true
osd pool default size = 3
max open files = 131072
fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595
osd crush chooseleaf type = 1

[mon.mon1]
host = mon1
mon addr = 172.20.1.2

[mon.mon2]
host = mon2
mon addr = 172.20.1.3

[mon.mon3]
host = mon3
mon addr = 172.20.1.4

[mon]
mon osd full ratio = 0.85
mon osd nearfull ratio = 0.7
mon osd down out interval = 600
mon osd down out subtree limit = host
mon allow pool delete = true
mon compact on start = true

[osd]
public_network = 172.20.3.1/21
cluster_network = 172.24.0.1/24
osd disk threads = 4
osd mount options xfs = 
rw,noexec,nodev,noatime,nodiratime,nobarrier,inode64,logbsize=256k
osd crush update on start = false
osd op threads = 20
osd mkfs options xfs = -f -i size=2048
osd max write size = 512
osd mkfs type = xfs
osd journal size = 5120
filestore max inline xattrs = 6
filestore queue committing max bytes = 1048576000
filestore queue committing max ops = 5000
filestore queue max bytes = 1048576000
filestore op threads = 32
filestore max inline xattr size = 254
filestore max sync interval = 15
filestore min sync interval = 10
journal max write bytes = 1048576000
journal max write entries = 1000
journal queue max ops = 3000
journal queue max bytes = 1048576000
ms dispatch throttle bytes = 1048576000

Sincerely,
Craig Chi

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD lost parents after rados cppool

2016-11-21 Thread Craig Chi
Hi Jason,

This really did the trick!
I can now rescue my rbds, thank you very much!

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan.

On 2016-11-21 21:44, Jason Dillaman<jdill...@redhat.com>wrote:
> You are correct -- rbd uses the pool id as a reference and now your pool has 
> a new id. There was a thread on this mailing list a year ago for the same 
> issue [1]. [1] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001456.html On 
> Sun, Nov 20, 2016 at 2:19 AM, Craig Chi<craig...@synology.com>wrote:>Hi 
> Cephers,>>I am tuning the pg numbers of my OpenStack pools.>>As everyone 
> knows, the pg number of a pool can not be decreased, so I came>up with an 
> idea to copy my pools to new pools with lower pg_num and then>delete the 
> original pool.>>I execute following commands:>>rados cppool volumes 
> new-volumes>rados cppool images new-images>ceph osd pool rm volumes volumes 
> --yes-i-really-really-mean-it>ceph osd pool rm images images 
> --yes-i-really-really-mean-it>ceph osd pool rename new-volumes volumes>ceph 
> osd pool rename new-images images>>But after that, when I want to query the 
> usage of volumes by `rbd -p volumes>du`, it returns a mass of error messages 
> like>>2016-11-20 08:
 01:47.12

6068 7fa8337fe700 -1 librbd::image::OpenRequest:>failed to retreive name: (2) 
No such file or directory>2016-11-20 08:01:47.126119 7fa832ffd700 
-1>librbd::image::RefreshParentRequest: failed to open parent image: (2) 
No>such file or directory>2016-11-20 08:01:47.126135 7fa832ffd700 -1 
librbd::image::RefreshRequest:>failed to refresh parent image: (2) No such file 
or directory>2016-11-20 08:01:47.126150 7fa832ffd700 -1 
librbd::image::OpenRequest:>failed to refresh image: (2) No such file or 
directory>>I think it may be caused by the change of "images" pool id, 
right?>>Is it possible to re-reference the rbds in "volumes" on new "images" 
pool?>Or is it possible to change or specify the pool id of new pool?>>Any 
suggestions are very welcome. Thanks>>Sincerely,>Craig Chi (Product 
Developer)>Synology Inc. Taipei, Taiwan.>>>>>Sent from Synology 
MailPlus>___>ceph-users mailing 
list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cg
 i/ceph-u

sers-ceph.com>-- Jason




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD lost parents after rados cppool

2016-11-19 Thread Craig Chi
Hi Cephers,

I am tuning the pg numbers of my OpenStack pools.

As everyone knows, the pg number of a pool can not be decreased, so I came up 
with an idea to copy my pools to new pools with lower pg_num and then delete 
the original pool.

I execute following commands:

rados cppool volumes new-volumes
rados cppool images new-images
ceph osd pool rm volumes volumes --yes-i-really-really-mean-it
ceph osd pool rm images images --yes-i-really-really-mean-it
ceph osd pool rename new-volumes volumes
ceph osd pool rename new-images images

But after that, when I want to query the usage of volumes by `rbd -p volumes 
du`, it returns a mass of error messages like

2016-11-20 08:01:47.126068 7fa8337fe700 -1 librbd::image::OpenRequest: failed 
to retreive name: (2) No such file or directory
2016-11-20 08:01:47.126119 7fa832ffd700 -1 librbd::image::RefreshParentRequest: 
failed to open parent image: (2) No such file or directory
2016-11-20 08:01:47.126135 7fa832ffd700 -1 librbd::image::RefreshRequest: 
failed to refresh parent image: (2) No such file or directory
2016-11-20 08:01:47.126150 7fa832ffd700 -1 librbd::image::OpenRequest: failed 
to refresh image: (2) No such file or directory

I think it may be caused by the change of "images" pool id, right?

Is it possible to re-reference the rbds in "volumes" on new "images" pool? Or 
is it possible to change or specify the pool id of new pool?

Any suggestions are very welcome. Thanks

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Craig Chi
Hi Nick and other Cephers,

Thanks for your reply.
>2) Config Errors>This can be an easy one to say you are safe from. But I would 
>say most outages and data loss incidents I have seen on the mailing>lists have 
>been due to poor hardware choice or configuring options such as size=2, 
>min_size=1 or enabling stuff like nobarriers.

I am wondering the pros and cons of the nobarrier option used by Ceph.

It is well known that nobarrier is dangerous when power outage happens, but if 
we already have replicas in different racks or PDUs, will Ceph reduce the risk 
of data lost with this option?

I have seen many performance tuning articles providing nobarrier option in xfs, 
but there are not many of then mention the trade-off of nobarrier.

Is it really unacceptable to use nobarrier in production environment? I will be 
much grateful if you guys are willing to share any experiences about nobarrier 
and xfs.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan. Ext. 361

On 2016-11-17 05:04, Nick Fisk<n...@fisk.me.uk>wrote:
> >-Original Message->From: ceph-users 
> >[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pedro Benites>Sent: 
> >16 November 2016 17:51>To: ceph-users@lists.ceph.com>Subject: [ceph-users] 
> >how possible is that ceph cluster crash>>Hi,>>I have a ceph cluster with 50 
> >TB, with 15 osds, it is working fine for one year and I would like to grow 
> >it and migrate all my old storage,>about 100 TB to ceph, but I have a doubt. 
> >How possible is that the cluster fail and everything went very bad? 
> >Everything is possible, I think there are 3 main risks 1) Hardware failure I 
> >would say Ceph is probably one of the safest options in regards to hardware 
> >failures, certainly if you start using 4TB+ disks. 2) Config Errors This can 
> >be an easy one to say you are safe from. But I would say most outages and 
> >data loss incidents I have seen on the mailing lists have been due to poor 
> >hardware choice or configuring options such as size=2, min_size=1 or 
> >enabling stuff like nobarriers. 3) Ceph Bugs Pro
 bably th

e rarest, but potentially the most scary as you have less control. They do 
happen and it's something to be aware of How reliable is ceph?>What is the risk 
about lose my data.? is necessary backup my data? Yes, always backup your data, 
no matter solution you use. Just like RAID != Backup, neither does 
ceph.>>Regards.>Pedro.>___>ceph-users
 mailing 
list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___ ceph-users mailing list 
ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon not starting on system startup (Ubuntu 16.04 / systemd)

2016-11-15 Thread Craig Chi
Hi,

You can try to manually fix this by adding the 
/lib/systemd/system/ceph-mon.target file, which contains:
===
[Unit]
Description=ceph target allowing to start/stop all ceph-mon@.service instances 
at once
PartOf=ceph.target
[Install]
WantedBy=multi-user.target ceph.target
===

and then execute the following command to tell systemd to start this target on 
bootup
systemctl enable ceph-mon.target

so as ceph-osd can be fixed by the same trick.

Alternatively you can manage the apt install repository as 
inhttp://docs.ceph.com/docs/jewel/start/quick-start-preflight/#advanced-package-tool-apt
If you have a ceph-mon deb with /lib/systemd/system/ceph-mon.target installed, 
you can start ceph-mon automatically on bootup.

$ dpkg -c ceph-mon_10.2.2-1xenial_amd64.deb | grep ceph-mon.target
-rw-r--r-- root/root162 2016-06-14 20:22 ./lib/systemd/system/ceph-mon.target
I would recommend the latter solution.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan.

On 2016-11-15 18:33, Matthew Vernon<m...@sanger.ac.uk>wrote:
> Hi, On 15/11/16 01:27, Craig Chi wrote:>What's your Ceph version?>I am using 
> Jewel 10.2.3 and systemd seems to work normally. I deployed>Ceph by ansible, 
> too. The version in Ubuntu 16.04, which is 10.2.2-0ubuntu0.16.04.2>You can 
> check whether you have /lib/systemd/system/ceph-mon.target file.>I believe it 
> was a bug existing in 10.2.1 before>cfa2d0a08a0bcd0fac153041b9eff17cb6f7c9af 
> has been merged. No, I have the following: 
> /lib/systemd/system/ceph-create-keys.service 
> /lib/systemd/system/ceph-create-keys@.service 
> /lib/systemd/system/ceph-disk@.service /lib/systemd/system/ceph-mon.service 
> /lib/systemd/system/ceph-mon@.service /lib/systemd/system/ceph-osd@.service 
> /lib/systemd/system/ceph.target [so no ceph-osd.service ; ceph-osd@.service 
> says its part of ceph-osd.target which I can't see defined anywhere 
> explicitly] Also /etc/systemd/system/ceph-mon.target.wants (contains a link 
> to ceph-mon@hostname.service) and ...ceph-osd.target.wants (which contains 
> links to the ceph-osd s
 ervices)

 ceph-mon.service says PartOf ceph.target. Regards, Matthew -- The Wellcome 
Trust Sanger Institute is operated by Genome Research Limited, a charity 
registered in England with number 1021457 and a company registered in England 
with number 2742969, whose registered office is 215 Euston Road, London, NW1 
2BE.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon not starting on system startup (Ubuntu 16.04 / systemd)

2016-11-14 Thread Craig Chi
Hi,

What's your Ceph version?
I am using Jewel 10.2.3 and systemd seems to work normally. I deployed Ceph by 
ansible, too.

You can check whether you have /lib/systemd/system/ceph-mon.target file.
I believe it was a bug existing in 10.2.1 before 
cfa2d0a08a0bcd0fac153041b9eff17cb6f7c9af has been merged.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan.

On 2016-11-15 01:32, David Turner<david.tur...@storagecraft.com>wrote:
>   
>   
>   
> I had to set my mons to sysvinit while my osds are systemd.That allows 
> everything to start up when my system boots.I don't know why the osds don't 
> work with sysvinit and the mon doesn't work with systemd... but that worked 
> to get me running.
>   
>   
>   
>   
> DavidTurner|Cloud Operations Engineer|StorageCraft Technology 
> Corporation(https://storagecraft.com)
> 380 Data Drive Suite 300|Draper|Utah|84020
> Office:801.871.2760|Mobile:385.224.2943
>   
>   
>   
>   
>   
>   
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
>   
>   
>   
>   
>   
>   
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Matthew 
> Vernon [m...@sanger.ac.uk]
> Sent: Monday, November 14, 2016 9:44 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] ceph-mon not starting on system startup (Ubuntu 16.04/ 
> systemd)
>   
> Hi,
>   
> I have a problem that my ceph-mon isn't getting started when my machine
> boots; the OSDs start up just fine. Checking logs, there's no sign of
> systemd making any attempt to start it, although it is seemingly enabled:
>   
> root@sto-1-1:~# systemctl status ceph-mon@sto-1-1
> ● ceph-mon@sto-1-1.service - Ceph cluster monitor daemon
> Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled;
> vendor preset
> Active: inactive (dead)
>   
> I see a thread on this issue in the list archives from May, but no sign
> of what the eventual solution was...
>   
> If it matters, I'm deploying Jewel using ceph-ansible (
> https://github.com/ceph/ceph-ansible ); that does (amongst other things)
> systemctl enable ceph-mon@sto-1-1
>   
> Thanks,
>   
> Matthew
>   
>   
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com