Hi. We have 10M records and 50M references which translate to very big citation dictionaries (several ).
When Giovanni was at CERN, Tibor and him discovered that some queries took a much longer time to execute with the citation dictionaries loaded in memory. I've investigated the problem and this is what I've found so far. It seems to be a general problem which does not specifically involve Invenio, only Python and MySQLdb. Here is a test showing the problem. We run the same query three times. It's very basic and returns 1M of integers. The first one is ran right after the start of IPython (143M of memory used). The second is with a list of 50M integers in memory (1681M of memory) and the last one with two such lists (2854M of memory). You can see that the queries take more and more time to execute. In [1]: from invenio.dbquery import run_sql In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 1.44 s, sys: 0.08 s, total: 1.52 s Wall time: 1.92 s In [3]: i = range(50000000) In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 11.36 s, sys: 0.07 s, total: 11.43 s Wall time: 11.67 s In [5]: j = range(50000000) In [6]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 21.21 s, sys: 0.06 s, total: 21.27 s Wall time: 21.54 s It is interesting to notice that if we retry this experiment using strings to increase the memory footprint, we don't get the same results (2051M of memory used): In [1]: from invenio.dbquery import run_sql In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 1.39 s, sys: 0.08 s, total: 1.47 s Wall time: 1.77 s In [3]: i = 'a' * 2000000000 In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s Wall time: 2.30 s Any idea about why we're seeing this and how we can fix it? It is quite a big problem for us as our citation dictionaries are so big. Cheers. -- Benoit Thiell The SAO/NASA Astrophysics Data System http://adswww.harvard.edu/