Hi, I would like to submit a set of patch i have written in order to improve the performances of the SubTotal functionality.
1/ What is the context Some people i work with are using the LO SubTotal functionality on very large spreadsheets. Let say something like 50 columns and 100 000 rows. We noticed that the performance of this function is really below compared to the performances of other spreadsheets like Excel. As an example, the computation of the subtotal for a single numeric column on such a sheet takes more than two days on a modern box (quad core and 8 gigs RAM) ! The code analysis i did have show that most of the performance is lost because of useless data structure iteration, increasing processing time, and sometimes algorithm complexity. I have created two set of patches to fix as much as i could. The first set applies on mdds library and is already have been merged by Kohei (thanks !). Today i submit the second set, applying on SC code. 2/ What are the benefits ? Now speaking about numbers... here is a view of the execution time of the different method before and after the patch was applied. Please keep in mind that this is only relevant for SubTotal on very large sheets. Of course it will benefits to others function, but since the gain is high don't expect too much magic ;) Inserting a row at the top of a document of about 100 000 rows Method ScDocument::InsertRow . current version 1206.8 ms . patch mdds 561.8 ms . patch mdds & lo 304.1 ms The first benefits is clearly that it runs faster :) There is another benefit to the current situation, it is memory consumption. The mdds patch reduce the memory footprint. For such a large files, memory used goes from about 8gigs down to 1.6 gigs. Without tha patch, the SubTotal cannot end because it keeps allocating memory during the computation. I strongly suggest you give a try to this mdds for the next release. On the LO side there is another issue that the second set of patches is solving. The execution duration of several method is not constant and depend on the number of rows stored in the structure. Which is not really the issue in itself, method complexity is linear. The issue is that the code was slow because of the data structures were fully iterated in some method, even if there is nothing to do. This may be negligible when working with small sheet, but when it comes to a processing the insertion of tens of thousand lines in a very large file, you can reach tens or hundreds of millions of vector iteration only to check there is nothing to do. I tried to remove the extra iterations, and here is the results. The following numbers compares execution time between version with mdds patches, and mdds + LO patches. Version without patch cannot do that much iterations. Initial condition are still a 100 000 row documents. Iteration number or "insertion count" refers to the number of iteration in DoSubTotal method, and to the number of lines that have been added to the original document. It. mdds mdds + Lo 1 572 ms 304 ms 500 605 ms 306 ms 1 500 645 ms 312 ms 5 000 833 ms 362 ms 12 000 1246 ms 394 ms 20 000 1927 ms 451 ms 40 000 2685 ms 718 ms 70 000 4488 ms 830 ms As you can notice, when the number of iterations increase, the execution time is still linear to the number of iteration, but situation is really better. There are two main method to optimize, and the remaining work will be more tricky. 3/ So what the patch is doing ? The analysis of time spent in each method show a few guilty methods. The following durations show the time spent in several method called by the ScDocument::InsertRow method (taken for the 50 000th row insertion, it the middle of the processing and duration are close to the overall average duration per row) SetAutoCalc 0.004 ms TestInsertRow loop 118.759 ms lcl_GetFirstTabRange 0.054 ms UpdateBroadcastAreas loop 127.657 ms Boucle while UpdateReference 1099.61 ms SetNeedsListeningGroups 0.048 ms InsertRow loop on column 708.855 ms Boucle for UpdateDrawRef 0.306 ms StartNeededListeners 500.723 ms SetDirtyIfPostponed 510.201 ms BroadcastRecalcOnRefMoveHandler 509.798 ms UpdateDirtyCharts : 0.004 Overall ScDocument::InsertRow execution time 3576.07 ms. More the 3.5 seconds to insert one row... looks like it needs a little rework here. Having a look to every method i noticed that several of the method were dealing with formula and iterating everytime and fully the list of cells to apply formula processing. Even if the document has no formula... When close to iteration number 50 000, each column contains up to 30 000 blocks allocated in the mdds backend. Each time such a method is called, it iterates the 30 000 blocks, for each row, for each columns, thats millions of call to vector containing no formula at all. So i tried to add a flag the the ScColumn used to prevent iteration on the column data if there is no formula in this column. This seems to work and the savings were really important. I write seems to, because i really need some review on this patch :) Unit tests are ok, and my personal tests are ok too, but it is such a large piece of software that there may be cases i am not aware of. Here are the new timings with the "formula flag" (only for long durations) TestInsertRow loop 118.759 ms => no change UpdateBroadcastAreas loop 127.657 ms => 0.027 ms Boucle while UpdateReference 1099.61 ms => 2.262 ms InsertRow loop on column 708.855 ms => no change StartNeededListeners 500.723 ms => 1.735 ms SetDirtyIfPostponed 510.201 ms => 1.624 ms BroadcastRecalcOnRefMoveHandler 509.798 ms => 1.603 ms The execution time goes down by about 2.7 secs for a single row (please remeber there is about 100 000 to process...) Talking about the implementation now... a flag has been added to the ScColumn, and is called mbMayHaveFormula (please feel free to rename it). It flags the fact that the column contains at least one cell with a formula. First of all, i am aware that there is already an existing method, but its implementation iterates the full sheet, testing each cell to compare its type with the formula type. Which is exactly what we want to get rid of. This flag is set to true each time an operation of adding or cloning a cell is done (same for cloning rows). And this flag is set to false when a method that iterates all the formula cell of a column find no formula. This switch to false is currently implemented only in StartListeners. This is correct, since if the value is true even if there is no formula. Having a true value instead of a false not an issue. The consequence will only be that vectors will be iterated even if there is no formulas. But.... Having a false value when the column contains formulas is really bad. Functions won't be called. And this is one of the subject i'd like a have a deep review please. In the method implantation, the chance are really simple. I first check the flag at the beginning, and return if needed. If and only if the method does only formula processing. Which is the case of several methods. This is done instead of iterating the full list of cells and check cell's type against sc::element_type_formula. 4/ What's next First of all i look forward for your review, comment, improvement, etc :) Then if the general idea of this patch is accepted, i'll continue my optimization, trying to reduce the execution time of the TestInsertRow and InsertRow methods. The optimization is back to mdds, since it is about the get_block_position and insert_empty methods. 5/ Gerrit references Sorry i was not aware that all my local commits would become a distinct gerrit entry... first time use :( next time i'll group them https://gerrit.libreoffice.org/18540 https://gerrit.libreoffice.org/18541 https://gerrit.libreoffice.org/18542 https://gerrit.libreoffice.org/18543 https://gerrit.libreoffice.org/18544 https://gerrit.libreoffice.org/18545 https://gerrit.libreoffice.org/18546 https://gerrit.libreoffice.org/18547 https://gerrit.libreoffice.org/18548 https://gerrit.libreoffice.org/18549 https://gerrit.libreoffice.org/18550 https://gerrit.libreoffice.org/18551 https://gerrit.libreoffice.org/18552 https://gerrit.libreoffice.org/18553 Cheers W. -- William http://www.wbonnet.net http://france.debian.net Association Debian France http://www.opencsw.org Community SoftWare for Solaris _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice