On Fri, 25 Feb 2011, Cornelia Plott wrote:
> We have loaded and indexed about 1,8 Mio records into our only local
> open invenio instance. Normaly the records have a large reference
> block (like below).
Do you know how many citer-citee pairs do your records generate? How
many references do you have in total for these 1.8M records? Do
references usually refer to other existing records in your system, or do
they refer to outside records that you do not store?
> 2010-12-05 10:09:41 --> rnkWORD01F contains 503657 words from 224197 records
> 2010-12-05 10:09:41 --> rnkWORD01F is in consistent state
Note that rnkWORD01 is the `word similarity' ranking method, so this bit
is not related to citations. In order to optimise the word similarity
ranking calculations a little bit, you can simply use a bigger flushing
interval when calling bibrank, say `-f 100000' or perhaps more,
depending on your available RAM.
> 2011-02-22 03:03:39 --> d_report_numbers done 0 of 15000
> 2011-02-23 10:14:24 --> d_report_numbers done fully
Citation ranking method works with big citation dictionaries that are
usually held in memory. Do you have enough RAM on your box to hold
them, or did your box start to swap perhaps? Have you tuned your MySQL
DB settings and do you have large enough max_allowed_packet and friends
in your /etc/my.cnf?
Also, for every record, the system has to look up forward and reverse
citations for every (modified) paper and every reference field in it.
This can lead to executing a large number of queries, which can take
quite some time, especially if you have many reference fields per
record, as your sample record seems to indicate (~250 or so).
Some optimisation is definitely possible, but in order to find out about
the best approach, we would have to know more information about the
document corpus, the memory usage while bibrank runs, the DB usage and
slow queries and my.cnf parameters, etc.
Moreover, it would be helpful if you could also run bibrank for say ~100
sample records via Python profiler so that we'd know where the inside
bottlenecks are. Here is an example of how to submit such a profiled
bibrank task:
$ sudo -u www-data /opt/invenio/bin/bibrank -u admin -w citation \
-a -i 1-100 --profile=t
You would then run the submitted task as usual and the profiling
information will be found in the task output file. (provided you have
`python-profiler' package installed)
For example, for the demo site I get:
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 25.894 25.894 bibtask.py:690(_task_run)
1 0.000 0.000 25.856 25.856 bibrank.py:128(task_run_core)
1 0.000 0.000 25.854 25.854
bibrank_tag_based_indexer.py:482(citation)
1 0.000 0.000 25.854 25.854
bibrank_tag_based_indexer.py:329(bibrank_engine)
1 0.000 0.000 25.731 25.731
bibrank_tag_based_indexer.py:86(citation_exec)
1 0.000 0.000 25.720 25.720
bibrank_citation_indexer.py:60(get_citation_weight)
1 0.113 0.113 24.808 24.808
bibrank_citation_indexer.py:570(ref_analyzer)
17711 0.554 0.000 23.602 0.001 dbquery.py:121(run_sql)
2197 0.045 0.000 22.481 0.010 search_engine.py:2024(search_unit)
17711 0.845 0.000 21.522 0.001 cursors.py:129(execute)
1391 0.122 0.000 20.522 0.015
search_engine.py:2068(search_unit_in_bibwords)
17711 0.167 0.000 19.753 0.001 cursors.py:311(_query)
17711 0.278 0.000 18.639 0.001 cursors.py:273(_do_query)
17711 16.792 0.001 16.792 0.001 {method 'query' of
'_mysql.connection' objects}
2789 0.025 0.000 16.080 0.006
data_cacher.py:71(recreate_cache_if_needed)
2782 0.028 0.000 16.055 0.006
search_engine.py:352(get_index_stemming_language)
2795 0.131 0.000 16.031 0.006
dbquery.py:255(get_table_update_time)
2782 0.024 0.000 16.003 0.006
search_engine.py:342(timestamp_verifier)
6345 0.145 0.000 4.919 0.001
search_engine.py:570(get_index_id_from_field)
This shows that on my laptop 16 out of 26 seconds were spent in
timestamp verification... not very efficient. The good news is that we
know about the inefficiency of timestamp verifiers already; we wanted to
change them since a year or so, but we have not got to do it yet. So
maybe we can take this opportunity and finally DTRT, which will also
help your problem.
(However, you may get somewhat different profiles, depending on your
concrete data set, and maybe in your case the bottleneck is
elsewhere...)
Best regards
--
Tibor Simko