Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace
We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic but the problem still persists. Lately we see these crashes on a daily basis. I'm leaning toward the conclusion that this is a software problem - this hardware ran stable before and we're seeing all four nodes crash randomly with the same messages in log.. I'm thinking if this can be flashcache related.. nothing else comes to mind.. can anyone look at the logs and help some? ceph-osd log: http://pastebin.com/AGGtvHr2 kernel log: http://pastebin.com/jVSa8eme J On 10/09/2015 09:15 AM, Jacek Jarosiewicz wrote: Hi, We've noticed a problem with our cluster setup: 4 x OSD nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Storage nodes are connected to two SuperMicro chassis: 847E1C-R1K28JBOD Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd drives (240GB Intel DC S3710 drives) for journal and cache 3 monitors running on OSD nodes ceph hammer 0.94.3 Ubuntu 14.04 standard replicated pools with size 2 (min_size 1) 40GB journal per osd on SSD drives, 40GB flashcache per osd. Everything seems to work fine, but every few days or so one of the nodes (not always the same node - different nodes each time) gets very high load, becomes inaccessible and needs to be rebooted. After reboot we can start osd's and the cluster returns to HEALTH_OK state pretty quickly. After looking into logfiles this seems to be related to ceph-osd processes (links to the logs are at the bottom of this msg). The cluster is a test setup - not used in production and at the time the ceph-osd processes crushes the cluster isn't doing anything. Any help would be appreciated. ceph-osd log: http://pastebin.com/AGGtvHr2 kernel log: http://pastebin.com/jVSa8eme J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA -> http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about hardware and CPU selection
Hello, There are of course a number of threads in the ML archives about things like this. On Sat, 24 Oct 2015 17:48:35 +0200 Mike Miller wrote: > Hi, > > as I am planning to set up a ceph cluster with 6 OSD nodes with 10 > harddisks in each node, could you please give me some advice about > hardware selection? CPU? RAM? > I am planning a 10 GBit/s public and a separate 10 GBit/s private > network. > If I read this correctly your OSDs are entirely HDD based (no journal SSDs). In that case you'll be lucky to see writes faster than 750MB/s, meaning your split network is wasted. IMHO a split cluster/public network only makes sense if you can actually saturate either link if not both. In your case a redundant (LACP) setup would be much more beneficial, unless your use case is vastly skewed to reads from hot (in page cache) objects. As for CPU, pure HDD OSDs will do well with about 1GHz per OSD, the more small write I/Os you have the more power you need. For OSDs with journal SSDs my rule of thumb is at least 2GHz, for purely SSD based OSDs whatever you can afford. 2GB RAM per OSD are generally sufficient, however more is definitely better in my book. This is especially true when you have hot (read) objects that may get evicted from local (in VM) page caches, but still fit comfortably in the distributed page caches of your OSD nodes. Regards, Christian > For a smaller test cluster with 5 OSD nodes and 4 harddisks each, 2 > GBit/s public and 4 GBit/s private network, I already tested this using > core i5 boxes 16GB RAM installed. In most of my test scenarios including > load, node failure, backfilling, etc. the CPU usage was not at all the > bottleneck with a maximum of about 25% load per core. The private > network was also far from being fully loaded. > > It would be really great to get some advice about hardware choices for > my newly planned setup. > > Thanks very much and regards, > > Mike > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 2-Node Cluster - possible scenario?
Hi, In a little project of mine I plan to start ceph storage with a small setup and to be able to scale it up later. Perhaps someone can give me any advice if the following (two nodes with OSDs, third node with Monitor only): - 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable space in case of 3* redundancy, 1 Monitor on each of the nodes - 1 extra node that has no OSDs but runs a third monitor. - 10GBit Ethernet as storage backbone Later I may add more nodes + OSDs to expand the cluster in case more storage / performance is needed. Would this work / be stable? Or do I need to spread my OSDs to 3 ceph nodes (e.g. in order to achive quorum). In case one of the two OSD nodes fail, would the storage still be accessible? The setup should be used for RBD/QEMU only, no cephfs or the like. Any hints are appreciated! Best Regards, Hermann -- herm...@qwer.tk PGP/GPG: 299893C7 (on keyservers) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] locked up cluster while recovering OSD
Hi, we have a Ceph cluster with: - 12 OSDs on 6 physical nodes, 64 GB RAM - each OSD has a 6 TB spinning disk and a 10GB journal in ram (tmpfs) [1] - 3 redundant copies - 25% space usage so far - ceph 0.94.2. - store data via radosgw, using sharded bucket indexes (64 shards). - 500 PGs per node (as we are planning on scaling the number of nodes without adding more pools in the future). We currently have a constant write load (about 60 PUTs per second of small objects, usually a few KB, but sometimes they can go up to a few MB). If I restart an OSD, it seems that most operations get stuck for up to multiple minutes until the OSD is done recovering. (noout is set, but I understand it does not matter because the the OSD is down for less than 5 minutes). Most of the "slow operation" messages had the following reasons: - currently waiting for rw locks - currently waiting for missing object - currently waiting for degraded object And were: - [call rgw.bucket_prepare_op] ... ondisk+write+known_if_redirected - [call rgw.bucket_complete_op] ... ondisk+write+known_if_redirected operating mostly on the bucket index shard objects. The monitors and gateways look completely unloaded. On the other side it looks like the IO on the OSDs is very intense (average disk write completion time is 300 ms) and the disk IO utilization is around 50%. It looks to me the storage layer needs to be improved (RAID controller with big write-back cache maybe?). However I do not understand exactly what is going wrong here. I would expect that the operations keep being served as before either writing to the primary PG or to the replica, and the PGs would recover in the background. Do you have any ideas? What path would you follow to understand what the problem is? I am happy to provide more logs if that helps. Thanks in advance for any help, Ludovico [1] We had to disable filestore_fadivse, otherwise two threads per OSD would get stuck on 100% CPU moving pages from ram (presumably the journal) to the swap. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 2-Node Cluster - possible scenario?
Quorum can be achieved with one monitor node (for testing purposes this would be OK, but of course it is a single point of failure) however the default for the OSD nodes is three way replication (can be changed) but easier to set up three OSD nodes to start with and one monitor node. For your case the monitor node would not need to be very powerful and a lower spec system could be used allowing your previously suggested mon node to be used instead as a third OSD node. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Hermann Himmelbauer Sent: Monday, October 26, 2015 12:17 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] 2-Node Cluster - possible scenario? Hi, In a little project of mine I plan to start ceph storage with a small setup and to be able to scale it up later. Perhaps someone can give me any advice if the following (two nodes with OSDs, third node with Monitor only): - 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable space in case of 3* redundancy, 1 Monitor on each of the nodes - 1 extra node that has no OSDs but runs a third monitor. - 10GBit Ethernet as storage backbone Later I may add more nodes + OSDs to expand the cluster in case more storage / performance is needed. Would this work / be stable? Or do I need to spread my OSDs to 3 ceph nodes (e.g. in order to achive quorum). In case one of the two OSD nodes fail, would the storage still be accessible? The setup should be used for RBD/QEMU only, no cephfs or the like. Any hints are appreciated! Best Regards, Hermann -- herm...@qwer.tk PGP/GPG: 299893C7 (on keyservers) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG won't stay clean
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I have a 0.94.4 cluster that when I repair/deep-scrub a PG, it comes back clean, but as soon as I restart any OSD that hosts it, it goes back to inconsistent. If I deep-scrub that PG it clears up. I determined that the bad copy was not on the primary and issued a pg repair command. I have shut down and deleted the PG folder on each OSD in turn and let it back fill. I tried taking the primary OSD down and issuing a repair command then. I took an m5sum of all files in the PG directory and compared all files across the OSDs and it came back clean. I shut down each OSD in turn and removed any PG_TEMP directories. I'm just not sure why the cluster is so confused as to the status of this PG. Any ideas? - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWLWr3CRDmVDuy+mK58QAA4HEP/361NhUXujdrKr9xa4d/ Lr/zMCbppT7uof3BueLkkIF2erun19ENNLZV5ehyzcbeWlHVA3UEzJaZlaew eQrmFN0TKk2BFtQAnXp66KhBVco05tiLKZthkGH9AzwQT33ftf8ErVwT4GXs aEXDdQLctLGxvqfoyys9woNqalYjG9JtZxJHTWfaVU+t3yXEme3GBJBmMlVE GSSk8KAyEil8DP1q4PMQJrScQxqYFpfBi1UGnbiQj02pan16OtbkaUkJNLMB o8XlHdiNfPWmMoyAuOPBMoKSPo1diLBP3uEJN8u3Mw4+9kLZSeMDMRHdkMF0 kmhGA26ihRHcHWVsC+4wevCGJoq7vvPmf8892z+hEjC5vm4eWGAD7UPBjqjl 5BL282XI+AYLbw4VkiDrP4tTL4neOr6IW50mnG8SPVSAvMN+cFJnlMZRpQ/6 SQB4Tv5fr1SEMZDZqC//RacWZYCsBd1XZi6M0VhOhOrGjqmlr/41P2dmrdI1 ldHEyl3l07mJdBANQ0AgIMAeMyuD3dGQ4q0IpJgVMbrfxq8m/lLp/jddOWq3 MhNGT5J1K4Qg3eFlqhuTIw7yLmERACYHlMUBCHGq/8jGCjEQOe6uhuePyH/6 ugUoZ+J4Y+Fxsu1Jsoj+GQtDSSkOTGjUpGhXPp/gqMhxZRlGdsZL5LIEs7hm Us8M =gnRK -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace
- Original Message - > From: "Jacek Jarosiewicz" > To: ceph-users@lists.ceph.com > Sent: Sunday, 25 October, 2015 8:48:59 PM > Subject: Re: [ceph-users] osd crash and high server load - ceph-osd crashes > with stacktrace > > We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic > but the problem still persists. Lately we see these crashes on a daily > basis. I'm leaning toward the conclusion that this is a software problem > - this hardware ran stable before and we're seeing all four nodes crash > randomly with the same messages in log.. I'm thinking if this can be > flashcache related.. nothing else comes to mind.. > > can anyone look at the logs and help some? > > ceph-osd log: http://pastebin.com/AGGtvHr2 > kernel log: http://pastebin.com/jVSa8eme I'd suggest you focus on why the kernel threads are going into d-state (uninterruptible sleep) since that should probably be addressed first. Ceph is a userspace application so should not be able to "hang" the kernel. XFS filesystem code or the underlying storage appears to be implicated here but it could be something else. The hung kernel threads are waiting for something, need to work out what that is. It is likely Ceph is just triggering this problem. Cheers, Brad > > J > > On 10/09/2015 09:15 AM, Jacek Jarosiewicz wrote: > > Hi, > > > > We've noticed a problem with our cluster setup: > > > > 4 x OSD nodes: > > E5-1630 CPU > > 32 GB RAM > > Mellanox MT27520 56Gbps network cards > > SATA controller LSI Logic SAS3008 > > Storage nodes are connected to two SuperMicro chassis: 847E1C-R1K28JBOD > > Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd drives (240GB > > Intel DC S3710 drives) for journal and cache > > 3 monitors running on OSD nodes > > ceph hammer 0.94.3 > > Ubuntu 14.04 > > standard replicated pools with size 2 (min_size 1) > > 40GB journal per osd on SSD drives, 40GB flashcache per osd. > > > > Everything seems to work fine, but every few days or so one of the nodes > > (not always the same node - different nodes each time) gets very high > > load, becomes inaccessible and needs to be rebooted. > > > > After reboot we can start osd's and the cluster returns to HEALTH_OK > > state pretty quickly. > > > > After looking into logfiles this seems to be related to ceph-osd > > processes (links to the logs are at the bottom of this msg). > > > > The cluster is a test setup - not used in production and at the time the > > ceph-osd processes crushes the cluster isn't doing anything. > > > > Any help would be appreciated. > > > > ceph-osd log: http://pastebin.com/AGGtvHr2 > > kernel log: http://pastebin.com/jVSa8eme > > > > J > > > > > -- > Jacek Jarosiewicz > Administrator Systemów Informatycznych > > > SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie > ul. Senatorska 13/15, 00-075 Warszawa > Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego > Rejestru Sądowego, > nr KRS 029537; kapitał zakładowy 42.756.000 zł > NIP: 957-05-49-503 > Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa > > > SUPERMEDIA -> http://www.supermedia.pl > dostep do internetu - hosting - kolokacja - lacza - telefonia > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 2-Node Cluster - possible scenario?
Hello, On Sun, 25 Oct 2015 16:17:02 +0100 Hermann Himmelbauer wrote: > Hi, > In a little project of mine I plan to start ceph storage with a small > setup and to be able to scale it up later. Perhaps someone can give me > any advice if the following (two nodes with OSDs, third node with > Monitor only): > > - 2 Nodes (enough RAM + CPU), 6*3TB Harddisk for OSDs -> 9TB usable > space in case of 3* redundancy, 1 Monitor on each of the nodes Just for the record, a monitor will be happy with 2GB RAM and 2GHz of CPU (more is better), but does a LOT of time critical writes, so it running on decent (also in the endurance sense) SSDs is recommended. Once you have SSDs in the game, using them for Ceph journals comes naturally. Keep in mind that while you certainly can improve the performance by just adding more OSDs later on, SSD journals are such a significant improvement when it comes to writes that you may want to consider them. > - 1 extra node that has no OSDs but runs a third monitor. Ceph uses the MON with the lowest IP address as leader, which is busier (sometimes a lot so) than the other MONs. Plan your nodes with that in mind. > - 10GBit Ethernet as storage backbone > Good for lower latency. I assume "storage backbone" is a single (the "public" network in Ceph speak) network. Having 10GB for the Ceph private network in your case would be a bit of a waste, though. > Later I may add more nodes + OSDs to expand the cluster in case more > storage / performance is needed. > > Would this work / be stable? Or do I need to spread my OSDs to 3 ceph > nodes (e.g. in order to achive quorum). In case one of the two OSD nodes > fail, would the storage still be accessible? > A monitor quorum of 3 is fine, OSDs don't enter that picture. However 3 OSD storage nodes are highly advised, because with non-SSD journal HDDs for OSDs your performance will already be low. It also saves you from having to deal with a custom CRUSH map. As for accessibility, yes, in theory. I certainly have tested this with a 2 storage node cluster and a replication of 2 (min_size 1). With this setup (custom CRUSH map) you will need a min_size of 1 as well. So again, 3 storage nodes will give you a lot less headaches. > The setup should be used for RBD/QEMU only, no cephfs or the like. > Depending on what these VMs do and the amount of them, see my comments about performance. Christian > Any hints are appreciated! > > Best Regards, > Hermann > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG won't stay clean
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I set debug_osd = 20/20 and restarted the primary osd. The logs are at http://162.144.87.113/files/ceph-osd.110.log.xz . The PG in question is 9.e3 and it is one of 15 that have this same behavior. The cluster is currently idle. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sun, Oct 25, 2015 at 5:51 PM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I have a 0.94.4 cluster that when I repair/deep-scrub a PG, it comes > back clean, but as soon as I restart any OSD that hosts it, it goes > back to inconsistent. If I deep-scrub that PG it clears up. > > I determined that the bad copy was not on the primary and issued a pg > repair command. I have shut down and deleted the PG folder on each OSD > in turn and let it back fill. I tried taking the primary OSD down and > issuing a repair command then. I took an m5sum of all files in the PG > directory and compared all files across the OSDs and it came back > clean. I shut down each OSD in turn and removed any PG_TEMP > directories. I'm just not sure why the cluster is so confused as to > the status of this PG. > > Any ideas? > > - > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.2.2 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWLWr3CRDmVDuy+mK58QAA4HEP/361NhUXujdrKr9xa4d/ > Lr/zMCbppT7uof3BueLkkIF2erun19ENNLZV5ehyzcbeWlHVA3UEzJaZlaew > eQrmFN0TKk2BFtQAnXp66KhBVco05tiLKZthkGH9AzwQT33ftf8ErVwT4GXs > aEXDdQLctLGxvqfoyys9woNqalYjG9JtZxJHTWfaVU+t3yXEme3GBJBmMlVE > GSSk8KAyEil8DP1q4PMQJrScQxqYFpfBi1UGnbiQj02pan16OtbkaUkJNLMB > o8XlHdiNfPWmMoyAuOPBMoKSPo1diLBP3uEJN8u3Mw4+9kLZSeMDMRHdkMF0 > kmhGA26ihRHcHWVsC+4wevCGJoq7vvPmf8892z+hEjC5vm4eWGAD7UPBjqjl > 5BL282XI+AYLbw4VkiDrP4tTL4neOr6IW50mnG8SPVSAvMN+cFJnlMZRpQ/6 > SQB4Tv5fr1SEMZDZqC//RacWZYCsBd1XZi6M0VhOhOrGjqmlr/41P2dmrdI1 > ldHEyl3l07mJdBANQ0AgIMAeMyuD3dGQ4q0IpJgVMbrfxq8m/lLp/jddOWq3 > MhNGT5J1K4Qg3eFlqhuTIw7yLmERACYHlMUBCHGq/8jGCjEQOe6uhuePyH/6 > ugUoZ+J4Y+Fxsu1Jsoj+GQtDSSkOTGjUpGhXPp/gqMhxZRlGdsZL5LIEs7hm > Us8M > =gnRK > -END PGP SIGNATURE- -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWLaAwCRDmVDuy+mK58QAAAMkP/iJ+OpklHB/O2wgj2Le1 Wudniy7KzpDVaS+kXpPZ8Bhbn+rxXCuN9ySUf1sxM37SdqZBCWHpPdx7GTbC QDaO2Bucn53iNG6FDcXHf0TDAUuw7f5u381B2+qfuUbc6Q7iJeJRIjrzQsce 1ieBDytn+DKis1YEOY5Rlbj80CBB5MhkiokJlxjNjaj2AZJAORwoLbqoCSSI u8YnzsbxhkpYxCcCqM3lHf36dsP40vkyXXyqjVWgaW9qThFx9N67ERG/hQSU VTBWXqY8glAQmbeuvlT/zAhl0e2qEsEOBUBn5r/ydL5M2x+3dFHsLp92ewwt pyrEeq6n8Wt1mmYklesDZQCgex47uWAy5mDFOWQjzWBbeO7ji8jwM+PpXxW9 h/mRJZFLLTScFHTOONDXfF41GXFV3ZtdukpHdT46k++RRHmlFVRZgcTym3/b g0pLQKZ8ynKvFzAora2/r9IlN7dDPJEw2jpN2pAYda0GlY8wc6h5i/qUQoGE VN6b5SNURyw53OMPv6yOx2bvc7RKmpLGWjhnTEHjydI0w+kmbqvAnbT2mG/O eHeEyteK4m3+Jtf/s+wN9ULr1pNr++37Zt2igfTtvPm4OceqiU4Y02YgrPu+ LgOWGduVSmEmmRRnBE8+gZYU6gSgpOV3JWqP0AQMavujZmQpGs55DjnfMwdv v8IJ =4TOB -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [performance] rbd kernel module versus qemu librbd
Hi, New information, i think the poor performance is due to too many threads of qume-system-x86 process. For normal case, it just use about 200 threads. For abnormal case, it will use about 400 threads or 700 threads, and the performance is: 200 threads > 400 threads > 700 threads Now, i guess it performance down is due to the competition between the threads. As you could see, i paste the perf record before. The problem is really stuck us. So, anyone know why the threads number of qume-system-x86 increasing? And any way we could control it. Thanks! hzwuli...@gmail.com From: hzwuli...@gmail.com Date: 2015-10-23 13:15 To: Alexandre DERUMIER CC: ceph-users Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Yeah, you are right. Test the rbd volume form host is fine. Now, at least we could affirm ti's the qemu or kvm problem, not ceph. hzwuli...@gmail.com From: Alexandre DERUMIER Date: 2015-10-23 12:51 To: hzwulibin CC: ceph-users Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd >>Anyway, i could try to collect somthing, maybe there are some clues. And you don't have problem to read/write to this rbd from host with fio-rbd ? (try a read full the rbd volume for example) - Mail original - De: hzwuli...@gmail.com À: "aderumier" Cc: "ceph-users" Envoyé: Vendredi 23 Octobre 2015 06:42:41 Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Oh, no, from the phenomenon. IO in VM is wait for the host to completion. The CPU wait in VM is very high. Anyway, i could try to collect somthing, maybe there are some clues. hzwuli...@gmail.com From: Alexandre DERUMIER Date: 2015-10-23 12:39 To: hzwulibin CC: ceph-users Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Do you have tried to use perf inside the faulty guest too ? - Mail original - De: hzwuli...@gmail.com À: "aderumier" Cc: "ceph-users" Envoyé: Vendredi 23 Octobre 2015 06:15:07 Objet: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd btw, we use perf to track the process qemu-system-x86(15801), there is an abnormal function: Samples: 1M of event 'cycles', Event count (approx.): 1057109744252 - 75.23% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock - do_raw_spin_lock + 54.44% 0x7fc79fc769d9 + 45.31% 0x7fc79fc769ab So, maybe it's the kvm problem? hzwuli...@gmail.com From: hzwuli...@gmail.com Date: 2015-10-23 11:54 To: Alexandre DERUMIER CC: ceph-users Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Hi, list We still stuck on this problem, when this problem comes, the CPU usage of qemu-system-x86 if very high(1420): 15801 libvirt- 20 0 33.7g 1.4g 11m R 1420 0.6 1322:26 qemu-system-x86 quem-system-x86 process 15801 is responsible for the VM. Anyone has ever run into this problem also. hzwuli...@gmail.com BQ_BEGIN From: hzwuli...@gmail.com Date: 2015-10-22 10:15 To: Alexandre DERUMIER CC: ceph-users Subject: Re: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Hi, Sure, all those could help, but not so much -:) Now, we find it's the VM problem. CPU on the host is very high. We create a new VM could solve this problem, but don't know why until now. Here is the detail version info: Compiled against library: libvirt 1.2.9 Using library: libvirt 1.2.9 Using API: QEMU 1.2.9 Running hypervisor: QEMU 2.1.2 Are there any already know bugs about those version? Thanks! hzwuli...@gmail.com BQ_BEGIN From: Alexandre DERUMIER Date: 2015-10-21 18:38 To: hzwulibin CC: ceph-users Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd here a libvirt sample to enable iothreads: 2 With this, you can scale with multiple disks. (but it should help a little bit with 1 disk too) - Mail original - De: hzwuli...@gmail.com À: "aderumier" Cc: "ceph-users" Envoyé: Mercredi 21 Octobre 2015 10:31:56 Objet: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Hi, let me post the version and configuration here first. host os: debian 7.8 kernel: 3.10.45 guest os: debian 7.8 kernel: 3.2.0-4 qemu version: ii ipxe-qemu 1.0.0+git-2013.c3d1e78-2.1~bpo70+1 all PXE boot firmware - ROM images for qemu ii qemu-kvm 1:2.1+dfsg-12~bpo70+1 amd64 QEMU Full virtualization on x86 hardware ii qemu-system-common 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation binaries (common files) ii qemu-system-x86 1:2.1+dfsg-12~bpo70+1 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 1:2.1+dfsg-12~bpo70+1 amd64 QEMU utilities vm config: *** Thanks! hzwuli...@gmail.com From: Alexandre DERUMIER Date: 2015-10-21 14:01 To: hzwulibin CC: ceph-users Subject: Re: [ceph-users] [performance] rbd kernel module versus qemu librbd Damn, that's a huge difference. What is your
Re: [ceph-users] when an osd is started up, IO will be blocked
Hi all, When an osd is started, I will get a lot of slow requests from the corresponding osd log, as follows: 2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds old, received at 2015-10-26 03:42:47.625968: osd_repop(client.2682003.0:2686048 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) currently commit_sent 2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds old, received at 2015-10-26 03:42:47.629239: osd_repop(client.2682003.0:2686049 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193029) currently commit_sent 2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included below; oldest blocked for > 53.692556 secs 2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds old, received at 2015-10-26 03:42:50.321151: osd_repop(client.3684690.0:191908 43.540 f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) currently commit_sent 2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds old, received at 2015-10-26 03:42:50.323461: osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 [write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently commit_sent 2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds old, received at 2015-10-26 03:42:47.625828: osd_repop(client.2682003.0:2686047 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193028) currently commit_sent 2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds old, received at 2015-10-26 03:42:47.625968: osd_repop(client.2682003.0:2686048 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) currently commit_sent 2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds old, received at 2015-10-26 03:42:47.629239: osd_repop(client.2682003.0:2686049 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193029) currently commit_sent 2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included below; oldest blocked for > 54.692945 secs 2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds old, received at 2015-10-26 03:42:37.589800: osd_repop(client.2682003.0:2686041 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193024) currently commit_sent 2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds old, received at 2015-10-26 03:42:37.590580: osd_repop(client.2682003.0:2686040 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) currently commit_sent 2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds old, received at 2015-10-26 03:42:37.593515: osd_repop(client.2682003.0:2686042 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193025) currently commit_sent 2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds old, received at 2015-10-26 03:42:24.455641: osd_repop(client.4764855.0:65121 43.dbe 169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) currently commit_sent 2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds old, received at 2015-10-26 03:42:37.595656: osd_repop(client.1800547.0:1205399 43.cc5 9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) currently commit_sent 2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included below; oldest blocked for > 55.693227 secs 2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds old, received at 2015-10-26 03:42:50.321151: osd_repop(client.3684690.0:191908 43.540 f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) currently commit_sent 2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds old, received at 2015-10-26 03:42:50.323461: osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 [write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently commit_sent Meanwhile, I run fio process with the rbd ioengine. The iops of read and write were too small to response any IO from the fio process, In other words, when an osd is started, the IO of the whole cluster will be blocked. Is there some parameter to adjust ? How to explain this problem? The results of running fio process were as fllows: ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64 fio-2.2.9-20-g1520 Starting 1 thread rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 05h:10m:14s] ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015 read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16 clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28 clat percentiles (msec): | 1.00th=[3], 5.00th=[
Re: [ceph-users] when an osd is started up, IO will be blocked
Hi all, When an osd is started, I will get a lot of slow requests from the corresponding osd log, as follows: 2015-10-26 03:42:51.593961 osd.4 [WRN] slow request 3.967808 seconds old, received at 2015-10-26 03:42:47.625968: osd_repop(client.2682003.0:2686048 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) currently commit_sent 2015-10-26 03:42:51.593964 osd.4 [WRN] slow request 3.964537 seconds old, received at 2015-10-26 03:42:47.629239: osd_repop(client.2682003.0:2686049 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193029) currently commit_sent 2015-10-26 03:42:52.594166 osd.4 [WRN] 40 slow requests, 17 included below; oldest blocked for > 53.692556 secs 2015-10-26 03:42:52.594172 osd.4 [WRN] slow request 2.272928 seconds old, received at 2015-10-26 03:42:50.321151: osd_repop(client.3684690.0:191908 43.540 f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) currently commit_sent 2015-10-26 03:42:52.594175 osd.4 [WRN] slow request 2.270618 seconds old, received at 2015-10-26 03:42:50.323461: osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 [write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently commit_sent 2015-10-26 03:42:52.594264 osd.4 [WRN] slow request 4.968252 seconds old, received at 2015-10-26 03:42:47.625828: osd_repop(client.2682003.0:2686047 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193028) currently commit_sent 2015-10-26 03:42:52.594266 osd.4 [WRN] slow request 4.968111 seconds old, received at 2015-10-26 03:42:47.625968: osd_repop(client.2682003.0:2686048 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347845) currently commit_sent 2015-10-26 03:42:52.594318 osd.4 [WRN] slow request 4.964841 seconds old, received at 2015-10-26 03:42:47.629239: osd_repop(client.2682003.0:2686049 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193029) currently commit_sent 2015-10-26 03:42:53.594527 osd.4 [WRN] 40 slow requests, 16 included below; oldest blocked for > 54.692945 secs 2015-10-26 03:42:53.594533 osd.4 [WRN] slow request 16.004669 seconds old, received at 2015-10-26 03:42:37.589800: osd_repop(client.2682003.0:2686041 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193024) currently commit_sent 2015-10-26 03:42:53.594536 osd.4 [WRN] slow request 16.003889 seconds old, received at 2015-10-26 03:42:37.590580: osd_repop(client.2682003.0:2686040 43.fcf d1ddfcf/rbd_data.196483222ac2db.0010/head//43 v 9744'347842) currently commit_sent 2015-10-26 03:42:53.594538 osd.4 [WRN] slow request 16.000954 seconds old, received at 2015-10-26 03:42:37.593515: osd_repop(client.2682003.0:2686042 43.b4b cbcbbb4b/rbd_data.196483222ac2db.020b/head//43 v 9744'193025) currently commit_sent 2015-10-26 03:42:53.594541 osd.4 [WRN] slow request 29.138828 seconds old, received at 2015-10-26 03:42:24.455641: osd_repop(client.4764855.0:65121 43.dbe 169a9dbe/rbd_data.49a7a4633ac0b1.0021/head//43 v 9744'12509) currently commit_sent 2015-10-26 03:42:53.594543 osd.4 [WRN] slow request 15.998814 seconds old, received at 2015-10-26 03:42:37.595656: osd_repop(client.1800547.0:1205399 43.cc5 9285ecc5/rbd_data.1b794560c6e2ea.00d0/head//43 v 9744'36732) currently commit_sent 2015-10-26 03:42:54.594892 osd.4 [WRN] 39 slow requests, 17 included below; oldest blocked for > 55.693227 secs 2015-10-26 03:42:54.594908 osd.4 [WRN] slow request 4.273600 seconds old, received at 2015-10-26 03:42:50.321151: osd_repop(client.3684690.0:191908 43.540 f1858540/rbd_data.1fc5ca7429fc17.0280/head//43 v 9744'63645) currently commit_sent 2015-10-26 03:42:54.594911 osd.4 [WRN] slow request 4.271290 seconds old, received at 2015-10-26 03:42:50.323461: osd_op(client.3684690.0:191911 rbd_data.1fc5ca7429fc17.0209 [write 2633728~4096] 43.72b9f039 ack+ondisk+write e9744) currently commit_sent Meanwhile, I run fio process with the rbd ioengine. The iops of read and write were too small to response any IO from the fio process, In other words, when an osd is started, the IO of the whole cluster will be blocked. Is there some parameter to adjust ? How to explain this problem? The results of running fio process were as fllows: ebs: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=rbd, iodepth=64 fio-2.2.9-20-g1520 Starting 1 thread rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [m(1)] [0.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 05h:10m:14s] ebs: (groupid=0, jobs=1): err= 0: pid=40323: Mon Oct 26 04:02:00 2015 read : io=10904KB, bw=175183B/s, *iops=21*, runt= 63737msec slat (usec): min=0, max=61, avg= 1.11, stdev= 3.16 clat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28 lat (msec): min=1, max=63452, avg=1190.04, stdev=6046.28 clat percentiles (msec): | 1.00th=[3], 5.00th=[
[ceph-users] randwrite iops of rbd volume in kvm decrease after several hours with qemu threads and cpu usage on host increasing
Hi experts, When I test io performance of rbd volume in pure ssd pool with fio in kvm vm, the iops decreased from 15k to 5k, while nums of qemu threads on host increased from about 200 to about 700, cpu usage of qemu process on host increased from 600% to 1400%. My testing scene is as following: rw=randwrite direct=1 numjobs=64 ioengine=sync bsrange=4k-4k runtime=180 The version of some packages are as following: ceph: 0.94.3 qemu-kvm: 2.1.2 host kernel: 3.10 What's maybe the problem? Appreciate for any help. Best Regards, Jackie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com