On Thu, Oct 7, 2010 at 4:12 PM, Tibor Simko <[email protected]> wrote: > On Thu, 07 Oct 2010, Roman Chyla wrote: >> But what I, perhaps naively, proposed (just thinking aloud...) is >> trading space for speed - so the 10M pairs index actually holds the >> pairs (report_no-->X, citing_author-->X...) ? > > No, 10M citer-citee pairs would only hold the record ID references in a > dictionary, as described for example at > <http://invenio-software.org/ticket/21>.
thank you, so i guess there is something that interprets those relations > >> Do you have some reasons to believe that the pairs are more storage >> effective, than the points in the index? > > A web app node does not have to contact the DB node in order to walk > over the citation map to provide a cite summary, because the full > citation map is readily available in its memory. Good for load > distribution, hence speed and scalability. strictly speaking, we were having discussion about the storage, so it still seems to be not soooo much more swelled > >> if I understood the problem correctly, the 10M pairs would need 20M >> pairs to allow for fast both-directional lookup, would they? > > 10M citer-citee pairs are currently stored in two dictionaries indeed, > one for storing citer->citees direction, one for storing citee->citers > direction. This is so that we could reply ultra-fast to both refersto > and citedby queries. Solr-wise, storing 10M pairs should be enough, I > guess. yep > >> searching is obviously more important for the user, but lack of data >> (due to indexing speed issues) may be as serious as the search speed, >> thus I think we inevitably deal with a cycle > > It may be, but if you pre-index everything, even very slowly, then you > are done, and if you have only ~1000 changes per day during usual > operational conditions, then the indexing speed should not matter so > much for typical operational costs, because you only care about getting > those ~1000 daily changes. but it can happen, and if something can go wrong,... > > Typically, we are in a situation where old stuff gets updated little, so > number of SELECTs is much larger than number of INSERTs/UPDATEs, so to > speak, borrowing SQL parlance. Which is why Invenio cares a lot about > SELECT speed, and less lot about INSERT/UPDATE speed, in various parts > of the codebase. > > Best regards > -- > Tibor Simko >
