Hi Tibor,

Thanks for your answer and your hints.

We have loaded and indexed about 1,8 Mio records into our only local
open invenio instance.  Normaly the records have a large reference
block (like below).
Do you know how many citer-citee pairs do your records generate?  How
many references do you have in total for these 1.8M records?  Do
references usually refer to other existing records in your system, or do
they refer to outside records that you do not store?

In total we have in for this 1.8 Mio records 41 .6 Mio references. Yes,
there exist references, which point to outside records that we do not
store. We dosn't now how many citer-citee pairs our records will generate.

2011-02-22 03:03:39 -->  d_report_numbers done 0 of 15000
2011-02-23 10:14:24 -->  d_report_numbers done fully
Citation ranking method works with big citation dictionaries that are
usually held in memory.  Do you have enough RAM on your box to hold
them, or did your box start to swap perhaps?  Have you tuned your MySQL
DB settings and do you have large enough max_allowed_packet and friends
in your /etc/my.cnf?

This invenio instance not runs on a virtual machine and have really 16
GB RAM.

MemTotal:     16627700 kB
MemFree:       5153924 kB
Buffers:        327200 kB
Cached:        9792016 kB
SwapCached:          0 kB
Active:        2613668 kB
Inactive:      8401064 kB
HighTotal:    15854912 kB
HighFree:      5144616 kB
LowTotal:       772788 kB
LowFree:          9308 kB
SwapTotal:     5144568 kB
SwapFree:      5144476 kB

We had also tuned our MySql DB settings like this:
[mysqld]
...
#key_buffer = 384M
key_buffer = 2G
#key_buffer_size = 2M
key_buffer_size = 512M
max_allowed_packet = 16M
table_cache = 512
#sort_buffer_size = 2M
sort_buffer_size = 16M
#read_buffer_size = 2M
read_buffer_size = 64M
#read_rnd_buffer_size = 8M
read_rnd_buffer_size = 128M
#myisam_sort_buffer_size = 64M
myisam_sort_buffer_size = 256M
thread_cache_size = 8
query_cache_size = 32M
...

We change the settings like an recommendation from Baron Schwarz "High
performance MySQL: optimization, backups, replication and more".  We
don't changed the max_allowed_packet. What would be a good size?


Moreover, it would be helpful if you could also run bibrank for say ~100
sample records via Python profiler so that we'd know where the inside
bottlenecks are.  Here is an example of how to submit such a profiled
bibrank task:

Here our result from the profiled bibrank task:

./bibrank -u admin -w citation -a -i 1-100 --profile=t

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.029   12.029 bibtask.py:755(_task_run)
        1    0.000    0.000   12.025   12.025 bibrank.py:128(task_run_core)
        1    0.000    0.000   12.025   12.025
bibrank_tag_based_indexer.py:482(citation)
        1    0.043    0.043   12.025   12.025
bibrank_tag_based_indexer.py:329(bibrank_engine)
        1    0.016    0.016   11.737   11.737
bibrank_tag_based_indexer.py:86(citation_exec)
        1    0.001    0.001   11.656   11.656
bibrank_citation_indexer.py:60(get_citation_weight)
        1    0.118    0.118   11.310   11.310
bibrank_citation_indexer.py:570(ref_analyzer)
    17303    0.330    0.000    9.560    0.001 dbquery.py:121(run_sql)
     2141    0.021    0.000    9.300    0.004
search_engine.py:1988(search_unit)
    17303    0.545    0.000    8.506    0.000 cursors.py:127(execute)
     1360    0.059    0.000    8.293    0.006
search_engine.py:2032(search_unit_in_bibwords)
    17303    0.099    0.000    7.480    0.000 cursors.py:308(_query)
    17303    6.400    0.000    7.045    0.000 cursors.py:270(_do_query)
     2725    0.011    0.000    6.139    0.002
data_cacher.py:71(recreate_cache_if_needed)
     2720    0.012    0.000    6.130    0.002
search_engine.py:320(get_index_stemming_language)
     2729    0.056    0.000    6.117    0.002
dbquery.py:256(get_table_update_time)
     2720    0.011    0.000    6.108    0.002
search_engine.py:310(timestamp_verifier)
     6193    0.083    0.000    2.499    0.000
search_engine.py:536(get_index_id_from_field)
        8    0.001    0.000    1.186    0.148
bibrank_citation_indexer.py:947(insert_into_cit_db)
      892    0.005    0.000    1.044    0.001
bibrank_citation_indexer.py:47(__call__)
      781    0.015    0.000    1.039    0.001
bibrank_citation_indexer.py:54(get_recids_matching_query)
      782    0.023    0.000    1.023    0.001
search_engine.py:1726(search_pattern)
        9    0.936    0.104    0.936    0.104
dbquery.py:315(serialize_via_marshal)
      666    0.025    0.000    0.725    0.001
search_engine.py:2091(search_unit_in_idxphrases)
    17287    0.353    0.000    0.608    0.000
cursors.py:105(_do_get_result)
     2113    0.028    0.000    0.575    0.000
bibrank_citation_indexer.py:997(insert_into_missing)
    17303    0.074    0.000    0.481    0.000 cursors.py:55(__del__)
    17303    0.086    0.000    0.408    0.000 cursors.py:60(close)
        1    0.000    0.000    0.398    0.398
bibrank_citation_indexer.py:921(insert_cit_ref_list_intodb)
...

Have you already some optimisation hints or need you more informations
about our system?

Thanks & Kind Regards
Cornelia

Cornelia Plott
Zentralbibliothek
Forschungszentrum Jülich
D-52425 Jülich
GERMANY

Tel: ++49-2461-616206
Email: [email protected]
Web: http://www.fz-juelich.de/zb



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to