On Thu, 03 Mar 2011, Ferran Jorba wrote:
> But now, with full text index activated, updating records with
> documents that sum up several thousand pages of PDF, it may take hours
> for bibindex to complete its tasks.

The indexing speed is a slightly different issue.  The indexing process
does several things in a non-optimised way already on the Python side.
Optimising this part would gain quite a lot.  I planned to ticketise
this since a long time, so maybe I'll take this opportunity and do it
one of these days.

Invenio indexes were designed in such a way that the response to the
users should be ultra fast at the price of the indexing speed that is
allowed to be ultra slow.  The theory was that documents are more often
read than updated, so once all the documents are pre-indexed, a slow
indexing speed of usual daily load is still perfectly acceptable.  But
even here there are things to be optimised on higher level.  For
example, full-text indexing gets triggered after every minor metadata
change, which is not necessary and which may contribute to a large
extent to the indexing slowness, if you often change metadata without
changing attached full-text files.

IOW, boosting my.cnf may help for some use cases, but for your concrete
use case, some higher-level optimisations may be rather called for.

> Sure?  1-2 GB is not 8 GB.

Is this dedicated MySQL box, or a box where Apache/MySQL co-exist?  (It
helps a lot to separate them.)  Note also that the OS filesystem cache
nicely helps MySQL: even if MySQL does not appear to have everything in
its memory buffers, it can usually access table indexes pretty fast via
OS filesystem caching.  (And it helps if it does not have to compete
against Apache for the same OS filesystem cache.)

BTW, have you enabled log-slow-queries in your my.cnf?  If yes, then
this log file will show you if you have some SQL queries that take an
unusually long time.  This would be a good basis for MySQL
optimisations.  E.g. is the load on your DB server higher than the load
on your web application server?  Etc.

> As I understood well Samuele last summer, when he was explaining me
> how full text works, with the list of records being compressed and
> uncompressed (serialised and serialised) continuously, a word that is
> found in several thousand documents has all their record ids in a
> blob, right?

Yes, but such a blob is still very small in size, so max_allowed_packet
does not need to be large.  In the Invenio INSTALL file, we still
recommend the value to be larger than 4M only.  Increasing
max_allowed_packet value beyond that is usually needed only for big
blobs such as big citation dictionaries.  E.g. for INSPIRE, the size of
dictionaries is about 800 MB, so we set max_allowed_packet to 1G.  If it
were lower, the site would not have even worked, since the query that
loads/updates citation dictionary would have been killed by the MySQL
server since going over the limit.

>  http://ddd.uab.cat/search?sc=1&f=fulltext&p=of (21,758 records)

BTW, you can install python-profile package and add `&profile=t' to the
URL to see some timings.

Best regards
-- 
Tibor Simko

Reply via email to