On Thu, 07 Oct 2010, Roman Chyla wrote: > But what I, perhaps naively, proposed (just thinking aloud...) is > trading space for speed - so the 10M pairs index actually holds the > pairs (report_no-->X, citing_author-->X...) ?
No, 10M citer-citee pairs would only hold the record ID references in a dictionary, as described for example at <http://invenio-software.org/ticket/21>. > Do you have some reasons to believe that the pairs are more storage > effective, than the points in the index? A web app node does not have to contact the DB node in order to walk over the citation map to provide a cite summary, because the full citation map is readily available in its memory. Good for load distribution, hence speed and scalability. > if I understood the problem correctly, the 10M pairs would need 20M > pairs to allow for fast both-directional lookup, would they? 10M citer-citee pairs are currently stored in two dictionaries indeed, one for storing citer->citees direction, one for storing citee->citers direction. This is so that we could reply ultra-fast to both refersto and citedby queries. Solr-wise, storing 10M pairs should be enough, I guess. > searching is obviously more important for the user, but lack of data > (due to indexing speed issues) may be as serious as the search speed, > thus I think we inevitably deal with a cycle It may be, but if you pre-index everything, even very slowly, then you are done, and if you have only ~1000 changes per day during usual operational conditions, then the indexing speed should not matter so much for typical operational costs, because you only care about getting those ~1000 daily changes. Typically, we are in a situation where old stuff gets updated little, so number of SELECTs is much larger than number of INSERTs/UPDATEs, so to speak, borrowing SQL parlance. Which is why Invenio cares a lot about SELECT speed, and less lot about INSERT/UPDATE speed, in various parts of the codebase. Best regards -- Tibor Simko
