Re: Client still connect failed leader after that mon down
On 17/12/15 21:27, Sage Weil wrote: On Thu, 17 Dec 2015, Jaze Lee wrote: Hello cephers: In our test, there are three monitors. We find client run ceph command will slow when the leader mon is down. Even after long time, a client run ceph command will also slow in first time. >From strace, we find that the client first to connect the leader, then after 3s, it connect the second. After some search we find that the quorum is not change, the leader is still the down monitor. Is that normal? Or is there something i miss? It's normal. Even when the quorum does change, the client doesn't know that. It should be contacting a random mon on startup, though, so I would expect the 3s delay 1/3 of the time. That's because client randomly picks up a mon from Monmap. But what we observed is that when a mon is down no change is made to monmap(neither the epoch nor the members). Is it the culprit for this phenomenon? Thanks, Jevon A long-standing low-priority feature request is to have the client contact 2 mons in parallel so that it can still connect quickly if one is down. It's requires some non-trivial work in mon/MonClient.{cc,h} though and I don't think anyone has looked at it seriously. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why package ceph-fuse needs packages ceph?
Hi Sage, Here comes another question, does ceph-fuse support Unix OS(like HP-UNIX or AIX)? Thanks, Jevon On 26/10/15 15:19, Sage Weil wrote: On Mon, 26 Oct 2015, Jaze Lee wrote: Hello, I think the ceph-fuse is just a client, why it needs packages ceph? I found when i install ceph-fuse, it will install package ceph. But when i install ceph-common, it will not install package ceph. May be ceph-fuse is not just a ceph client? It is, and the Debian packaging works as expected. This is a simple error in the spec file. I'll submit a patch. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seek advice for using Ceph to provice NAS service
Any comments or suggestions? Thanks, Jevon On 23/9/15 10:21, Jevon Qiao wrote: Hi Sage and other Ceph experts, This is a greeting from Jevon, I'm from China and working in a company which are using Ceph as the backend storage. At present, I'm evaluating the following two options of using Ceph cluster to provide NAS service and I need your advice from the perspective of stability and feasibility. Option 1: Directly use CephFS Since Ceph as a unified storage can provide file system storage service via cephfs, this looks an ideal solution for my case if CephFS is ready to be used in production environment. However, based on the previous discussions on CephFS, I see that there are still some issues like not ready for supporting multiple metadata servers, lack of a fully functioning fsck and so on. Also, I learn that CephFS has been evaluated by a large community of users and there are production systems using it with a single MDS from the official website of Ceph. So it is difficult for me to make the decision on whether I should use it. Option 2: Ceph rbd + NFS server This might be a common architecture used in current NAS storage. But the problem is how to get rid of the single point failure on NFS server. What I have right now is to use Corosync and Pacemaker(the typical HA solution in Linux) to form a cluster. It seems that Sebastien Han has verified the feasibility. Your comments/advices would be highly appreciated and I'm looking forward to your reply. Thanks, Jevon -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Seek advice for using Ceph to provice NAS service
Hi Sage and other Ceph experts, This is a greeting from Jevon, I'm from China and working in a company which are using Ceph as the backend storage. At present, I'm evaluating the following two options of using Ceph cluster to provide NAS service and I need your advice from the perspective of stability and feasibility. Option 1: Directly use CephFS Since Ceph as a unified storage can provide file system storage service via cephfs, this looks an ideal solution for my case if CephFS is ready to be used in production environment. However, based on the previous discussions on CephFS, I see that there are still some issues like not ready for supporting multiple metadata servers, lack of a fully functioning fsck and so on. Also, I learn that CephFS has been evaluated by a large community of users and there are production systems using it with a single MDS from the official website of Ceph. So it is difficult for me to make the decision on whether I should use it. Option 2: Ceph rbd + NFS server This might be a common architecture used in current NAS storage. But the problem is how to get rid of the single point failure on NFS server. What I have right now is to use Corosync and Pacemaker(the typical HA solution in Linux) to form a cluster. It seems that Sebastien Han has verified the feasibility. Your comments/advices would be highly appreciated and I'm looking forward to your reply. Thanks, Jevon -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Is it safe to increase pg number in a production environment
Hi Jan, Thank you very much for the suggestion. Regards, Jevon On 5/8/15 19:36, Jan Schermer wrote: Hi, comments inline. On 05 Aug 2015, at 05:45, Jevon Qiao wrote: Hi Jan, Thank you for the detailed suggestion. Please see my reply in-line. On 5/8/15 01:23, Jan Schermer wrote: I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production. Basicaly we had to 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs 2) increse pgp_num in small increments and then go higher So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number… Actually we first increased both to 8192 and then decided to go higher, but that doesn’t matter. The only reason for this was that the first step took could run unattended at night without disturbing the workload.* The second step had to be attended. * in other words, we didn’t see “slow requests” because of our threshold settings, but while PGs were creating the cluster paused IO for non-trivial amounts of time. I suggest you do this in as small steps as possible, depending on your SLAs. We went from 4096 placement groups up to 16384 pg_num (the number of on-disk created placement groups) was increased like this: # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done this ran overnight (and was upped to 128 step during the night) Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs. We did it again in steps and waited for the cluster to settle before continuing. Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had. The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling. Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts... This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources. Depends on how much free space you have. We had some OSDs at close to 85% capacity before we started (and other OSD’s at only 30%). When increasing the number of PGs the data shuffled greatly - but this depends on what CRUSH rules you have (and what version you are running). Newer versions with newer tunables will make this a lot easier I guess. And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing. In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it? We are using straw because of dumpling. Straw2 should make everything better :-) Jan Thanks, Jevon Jan On 04 Aug 2015, at 18:52, Marek Dohojda wrote: I have done this not that long ago. My original PG estimates were wrong and I had to increase them. After increasing the PG numbers the Ceph rebalanced, and that took a while. To be honest in my case the slowdown wasn’t really visible, but it took a while. My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish. Do it slowly and do not increase multiple pools at once. It isn’t recommended practice but doable. On Aug 4, 2015, at 10:46 AM, Samuel Just wrote: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But rece
Re: [ceph-users] Is it safe to increase pg number in a production environment
Hi Jan, Thank you for the detailed suggestion. Please see my reply in-line. On 5/8/15 01:23, Jan Schermer wrote: I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production. Basicaly we had to 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs 2) increse pgp_num in small increments and then go higher So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number... We went from 4096 placement groups up to 16384 pg_num (the number of on-disk created placement groups) was increased like this: # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done this ran overnight (and was upped to 128 step during the night) Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs. We did it again in steps and waited for the cluster to settle before continuing. Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had. The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step. The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling. Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts... This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources. And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing. In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it? Thanks, Jevon Jan On 04 Aug 2015, at 18:52, Marek Dohojda wrote: I have done this not that long ago. My original PG estimates were wrong and I had to increase them. After increasing the PG numbers the Ceph rebalanced, and that took a while. To be honest in my case the slowdown wasn’t really visible, but it took a while. My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish. Do it slowly and do not increase multiple pools at once. It isn’t recommended practice but doable. On Aug 4, 2015, at 10:46 AM, Samuel Just wrote: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time. Two long-term solutions come in my mind, 1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects. 2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance
Re: [ceph-users] Is it safe to increase pg number in a production environment
Got it, thank you for the suggestion. Regards, Jevon On 5/8/15 00:51, Stefan Priebe wrote: We've done the splitting several times. The most important thing is to run a ceph version which does not have the linger ops bug. This is dumpling latest release, giant and hammer. Latest firefly release still has this bug. Which results in wrong watchers and no working snapshots. Stefan Am 04.08.2015 um 18:46 schrieb Samuel Just: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time. Two long-term solutions come in my mind, 1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects. 2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance of Clients? Any comments/help/suggestions will be highly appreciated. -- Best Regards Jevon ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Is it safe to increase pg number in a production environment
Thank you and Samuel for the prompt response. On 5/8/15 00:52, Marek Dohojda wrote: I have done this not that long ago. My original PG estimates were wrong and I had to increase them. After increasing the PG numbers the Ceph rebalanced, and that took a while. To be honest in my case the slowdown wasn’t really visible, but it took a while. How many OSDs do you have in your cluster? How much did you adjust the PG numbers? My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish. Do it slowly and do not increase multiple pools at once. Both you and Samuel said to do it slowly, do you mean to adjust the pg numbers step by step rather than doing it in one step? Also, would you please explain 'a long IO time' in details. Thanks, Jevon It isn’t recommended practice but doable. On Aug 4, 2015, at 10:46 AM, Samuel Just wrote: It will cause a large amount of data movement. Each new pg after the split will relocate. It might be ok if you do it slowly. Experiment on a test cluster. -Sam On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 wrote: Hi Cephers, This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow. Issue: I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time. Two long-term solutions come in my mind, 1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects. 2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance of Clients? Any comments/help/suggestions will be highly appreciated. -- Best Regards Jevon ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlml -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html