Hi, Nova As the Cells v2 architecture is getting mature, and CERN used it and seems worked well, *Huawei *is also willing to consider using this in our Public Cloud deployments. As we still have concerns about the performance when doing multi-cell listing, recently *Yikun Jiang* and I have done a performance test for ``instance list`` across multi-cell deployment, we would like share our test results and findings.
First, I want to point out our testing environment, as we(Yikun and I) are doing this as a concept test(to show the ratio between time consumptions for query data from DB and sorting etc.) so we are doing it on our own machine, the machine has 16 CPUs and 80 GB RAM, as it is old, so the Disk might be slow. So we will not judging the time consumption data itself, but the overall logic and the ratios between different steps. We are doing it with a devstack deployment on this single machine. Then I would like to share our test plan, we will setup 10 cells (cell1~cell10) and we will generate 10000 instance records in those cells (considering 20 instances per host, it would be like 500 hosts, which seems a good size for a cell), cell0 is kept empty as the number for errored instance could be very less and it doesn't really matter. We will test the time consumption for listing instances across 1,2,5, and 10 cells(cell0 will be always queried, so it is actually 2, 3, 6 and 11 cells) with the limit of 100, 200, 500 and 1000, as the default maximum limit is 1000. In order to get more general results, we tested the list with default sort key and dir, sort by instance_uuid and sort by uuid & name, this should provide a more general result. This is what we got(the time unit is second): *Default sort* *Uuid* *Sort* *uuid+name* *Sort* *Cell* *Num* *Limit* *Total* *Cost* *Data Gather Cost* *Merge Sort Cost* *Construct View* *Total* *Cost* *Data Gather Cost* *Merge Sort Cost* *Construct View* *Total* *Cost* *Data Gather Cost* *Merge Sort Cost* *Construct View* 10 100 2.3313 2.1306 0.1145 0.0672 2.3693 2.1343 0.1148 0.1016 2.3284 2.1264 0.1145 0.0679 200 3.5979 3.2137 0.2287 0.1265 3.5316 3.1509 0.2265 0.1255 3.481 3.054 0.2697 0.1284 500 7.1952 6.2597 0.5704 0.3029 7.5057 6.4761 0.6263 0.341 7.4885 6.4623 0.6239 0.3404 1000 13.5745 11.7012 1.1511 0.5966 13.8408 11.9007 1.2268 0.5939 13.8813 11.913 1.2301 0.6187 5 100 1.3142 1.1003 0.1163 0.0706 1.2458 1.0498 0.1163 0.0665 1.2528 1.0579 0.1161 0.066 200 2.0151 1.6063 0.2645 0.1255 1.9866 1.5386 0.2668 0.1615 2.0352 1.6246 0.2646 0.1262 500 4.2109 3.1358 0.7033 0.3343 4.1605 3.0893 0.6951 0.3384 4.1972 3.2461 0.6104 0.3028 1000 7.841 5.8881 1.2027 0.6802 7.7135 5.9121 1.1363 0.5969 7.8377 5.9385 1.1936 0.6376 2 100 0.6736 0.4727 0.1113 0.0822 0.605 0.4192 0.1105 0.0656 0.688 0.4613 0.1126 0.0682 200 1.1226 0.7229 0.2577 0.1255 1.0268 0.6671 0.2255 0.1254 1.2805 0.8171 0.2222 0.1258 500 2.2358 1.3506 0.5595 0.3026 2.3307 1.2748 0.6581 0.3362 2.741 1.6023 0.633 0.3365 1000 4.2079 2.3367 1.2053 0.5986 4.2384 2.4071 1.2017 0.633 4.3437 2.4136 1.217 0.6394 1 100 0.4857 0.2869 0.1097 0.069 0.4205 0.233 0.1131 0.0672 0.6372 0.3305 0.196 0.0681 200 0.6835 0.3236 0.2212 0.1256 0.7777 0.3754 0.261 0.13 0.9245 0.4527 0.227 0.129 500 1.5848 0.6415 0.6251 0.3043 1.6472 0.6554 0.6292 0.3053 1.9455 0.8201 0.5918 0.3447 1000 3.1692 1.2124 1.2246 0.6762 3.0836 1.2286 1.2055 0.643 3.0991 1.2248 1.2615 0.6028 Our conclusions from the data are: 1. The time consumption for *MERGE SORT* process has strong correlation with the *LIMIT*, and seems *not *effected by *number of cells;* 2. The major time consumption part of the whole process is actually the data gathering process, so we will have a closer look on this With we added some audit log in the code, and from the log we can saw: 02:24:53.376705 db begin, nova_cell0 02:24:53.425836 db end, nova_cell0: 0.0487968921661 02:24:53.426622 db begin, nova_cell1 02:24:54.451235 db end, nova_cell1: 1.02400803566 02:24:54.451991 db begin, nova_cell2 02:24:55.715769 db end, nova_cell2: 1.26333093643 02:24:55.716575 db begin, nova_cell3 02:24:56.963428 db end, nova_cell3: 1.24626398087 02:24:56.964202 db begin, nova_cell4 02:24:57.980187 db end, nova_cell4: 1.01546406746 02:24:57.980970 db begin, nova_cell5 02:24:59.279139 db end, nova_cell5: 1.29762792587 02:24:59.279904 db begin, nova_cell6 02:25:00.311717 db end, nova_cell6: 1.03130197525 02:25:00.312427 db begin, nova_cell7 02:25:01.654819 db end, nova_cell7: 1.34187483788 02:25:01.655643 db begin, nova_cell8 02:25:02.689731 db end, nova_cell8: 1.03352093697 02:25:02.690502 db begin, nova_cell9 02:25:04.076885 db end, nova_cell9: 1.38588285446 yes, the DB query was in serial, after some investigation, it seems that we are unable to perform eventlet.mockey_patch in uWSGI mode, so Yikun made this fix: https://review.openstack.org/#/c/592285/ After making this change, we test again, and we got this kind of data: total collect sort view before monkey_patch 13.5745 11.7012 1.1511 0.5966 after monkey_patch 12.8367 10.5471 1.5642 0.6041 The performance improved a little, and from the log we can saw: Aug 16 02:14:46.383081 begin detail api Aug 16 02:14:46.406766 begin cell gather begin Aug 16 02:14:46.419346 db begin, nova_cell0 Aug 16 02:14:46.425065 db begin, nova_cell1 Aug 16 02:14:46.430151 db begin, nova_cell2 Aug 16 02:14:46.435012 db begin, nova_cell3 Aug 16 02:14:46.440634 db begin, nova_cell4 Aug 16 02:14:46.446191 db begin, nova_cell5 Aug 16 02:14:46.450749 db begin, nova_cell6 Aug 16 02:14:46.455461 db begin, nova_cell7 Aug 16 02:14:46.459959 db begin, nova_cell8 Aug 16 02:14:46.466066 db begin, nova_cell9 Aug 16 02:14:46.470550 db begin, ova_cell10 Aug 16 02:14:46.731882 db end, nova_cell0: 0.311906099319 Aug 16 02:14:52.667791 db end, nova_cell5: 6.22100400925 Aug 16 02:14:54.065655 db end, nova_cell1: 7.63998198509 Aug 16 02:14:54.939856 db end, nova_cell3: 8.50425100327 Aug 16 02:14:55.309017 db end, nova_cell6: 8.85762405396 Aug 16 02:14:55.309623 db end, nova_cell8: 8.84928393364 Aug 16 02:14:55.310240 db end, nova_cell2: 8.87976694107 Aug 16 02:14:56.057487 db end, ova_cell10: 9.58636116982 Aug 16 02:14:56.058001 db end, nova_cell4: 9.61698698997 Aug 16 02:14:56.058547 db end, nova_cell9: 9.59216403961 Aug 16 02:14:56.954209 db end, nova_cell7: 10.4981210232 Aug 16 02:14:56.954665 end cell gather end: 10.5480799675 Aug 16 02:14:56.955010 begin heaq.merge Aug 16 02:14:58.527040 end heaq.merge: 1.57150006294 so, now the queries are in parallel, but the whole thing still seems serial. We tried to adjust the database configs like: max_thread_pool, use_tpool, etc. And we also tried to use a separate DB for some of the cells, but the result seems to be no big difference. So, the above are what we have now, and feel free to ping us if you have any questions or suggestions. BR, Zhenyu Zheng
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev