On Thu, 03 Mar 2011, Ferran Jorba wrote: > But now, with full text index activated, updating records with > documents that sum up several thousand pages of PDF, it may take hours > for bibindex to complete its tasks.
The indexing speed is a slightly different issue. The indexing process does several things in a non-optimised way already on the Python side. Optimising this part would gain quite a lot. I planned to ticketise this since a long time, so maybe I'll take this opportunity and do it one of these days. Invenio indexes were designed in such a way that the response to the users should be ultra fast at the price of the indexing speed that is allowed to be ultra slow. The theory was that documents are more often read than updated, so once all the documents are pre-indexed, a slow indexing speed of usual daily load is still perfectly acceptable. But even here there are things to be optimised on higher level. For example, full-text indexing gets triggered after every minor metadata change, which is not necessary and which may contribute to a large extent to the indexing slowness, if you often change metadata without changing attached full-text files. IOW, boosting my.cnf may help for some use cases, but for your concrete use case, some higher-level optimisations may be rather called for. > Sure? 1-2 GB is not 8 GB. Is this dedicated MySQL box, or a box where Apache/MySQL co-exist? (It helps a lot to separate them.) Note also that the OS filesystem cache nicely helps MySQL: even if MySQL does not appear to have everything in its memory buffers, it can usually access table indexes pretty fast via OS filesystem caching. (And it helps if it does not have to compete against Apache for the same OS filesystem cache.) BTW, have you enabled log-slow-queries in your my.cnf? If yes, then this log file will show you if you have some SQL queries that take an unusually long time. This would be a good basis for MySQL optimisations. E.g. is the load on your DB server higher than the load on your web application server? Etc. > As I understood well Samuele last summer, when he was explaining me > how full text works, with the list of records being compressed and > uncompressed (serialised and serialised) continuously, a word that is > found in several thousand documents has all their record ids in a > blob, right? Yes, but such a blob is still very small in size, so max_allowed_packet does not need to be large. In the Invenio INSTALL file, we still recommend the value to be larger than 4M only. Increasing max_allowed_packet value beyond that is usually needed only for big blobs such as big citation dictionaries. E.g. for INSPIRE, the size of dictionaries is about 800 MB, so we set max_allowed_packet to 1G. If it were lower, the site would not have even worked, since the query that loads/updates citation dictionary would have been killed by the MySQL server since going over the limit. > http://ddd.uab.cat/search?sc=1&f=fulltext&p=of (21,758 records) BTW, you can install python-profile package and add `&profile=t' to the URL to see some timings. Best regards -- Tibor Simko
