On Fri, 25 Feb 2011, Cornelia Plott wrote:
> We have loaded and indexed about 1,8 Mio records into our only local
> open invenio instance.  Normaly the records have a large reference
> block (like below).

Do you know how many citer-citee pairs do your records generate?  How
many references do you have in total for these 1.8M records?  Do
references usually refer to other existing records in your system, or do
they refer to outside records that you do not store?

> 2010-12-05 10:09:41 --> rnkWORD01F contains 503657 words from 224197 records
> 2010-12-05 10:09:41 --> rnkWORD01F is in consistent state

Note that rnkWORD01 is the `word similarity' ranking method, so this bit
is not related to citations.  In order to optimise the word similarity
ranking calculations a little bit, you can simply use a bigger flushing
interval when calling bibrank, say `-f 100000' or perhaps more,
depending on your available RAM.

> 2011-02-22 03:03:39 --> d_report_numbers done 0 of 15000
> 2011-02-23 10:14:24 --> d_report_numbers done fully

Citation ranking method works with big citation dictionaries that are
usually held in memory.  Do you have enough RAM on your box to hold
them, or did your box start to swap perhaps?  Have you tuned your MySQL
DB settings and do you have large enough max_allowed_packet and friends
in your /etc/my.cnf?

Also, for every record, the system has to look up forward and reverse
citations for every (modified) paper and every reference field in it.
This can lead to executing a large number of queries, which can take
quite some time, especially if you have many reference fields per
record, as your sample record seems to indicate (~250 or so).

Some optimisation is definitely possible, but in order to find out about
the best approach, we would have to know more information about the
document corpus, the memory usage while bibrank runs, the DB usage and
slow queries and my.cnf parameters, etc.

Moreover, it would be helpful if you could also run bibrank for say ~100
sample records via Python profiler so that we'd know where the inside
bottlenecks are.  Here is an example of how to submit such a profiled
bibrank task:

  $ sudo -u www-data /opt/invenio/bin/bibrank -u admin -w citation \
          -a -i 1-100 --profile=t

You would then run the submitted task as usual and the profiling
information will be found in the task output file.  (provided you have
`python-profiler' package installed)

For example, for the demo site I get:

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   25.894   25.894 bibtask.py:690(_task_run)
        1    0.000    0.000   25.856   25.856 bibrank.py:128(task_run_core)
        1    0.000    0.000   25.854   25.854 
bibrank_tag_based_indexer.py:482(citation)
        1    0.000    0.000   25.854   25.854 
bibrank_tag_based_indexer.py:329(bibrank_engine)
        1    0.000    0.000   25.731   25.731 
bibrank_tag_based_indexer.py:86(citation_exec)
        1    0.000    0.000   25.720   25.720 
bibrank_citation_indexer.py:60(get_citation_weight)
        1    0.113    0.113   24.808   24.808 
bibrank_citation_indexer.py:570(ref_analyzer)
    17711    0.554    0.000   23.602    0.001 dbquery.py:121(run_sql)
     2197    0.045    0.000   22.481    0.010 search_engine.py:2024(search_unit)
    17711    0.845    0.000   21.522    0.001 cursors.py:129(execute)
     1391    0.122    0.000   20.522    0.015 
search_engine.py:2068(search_unit_in_bibwords)
    17711    0.167    0.000   19.753    0.001 cursors.py:311(_query)
    17711    0.278    0.000   18.639    0.001 cursors.py:273(_do_query)
    17711   16.792    0.001   16.792    0.001 {method 'query' of 
'_mysql.connection' objects}
     2789    0.025    0.000   16.080    0.006 
data_cacher.py:71(recreate_cache_if_needed)
     2782    0.028    0.000   16.055    0.006 
search_engine.py:352(get_index_stemming_language)
     2795    0.131    0.000   16.031    0.006 
dbquery.py:255(get_table_update_time)
     2782    0.024    0.000   16.003    0.006 
search_engine.py:342(timestamp_verifier)
     6345    0.145    0.000    4.919    0.001 
search_engine.py:570(get_index_id_from_field)

This shows that on my laptop 16 out of 26 seconds were spent in
timestamp verification... not very efficient.  The good news is that we
know about the inefficiency of timestamp verifiers already; we wanted to
change them since a year or so, but we have not got to do it yet.  So
maybe we can take this opportunity and finally DTRT, which will also
help your problem.

(However, you may get somewhat different profiles, depending on your
concrete data set, and maybe in your case the bottleneck is
elsewhere...)

Best regards
-- 
Tibor Simko

Reply via email to