On Thu, 07 Oct 2010, Roman Chyla wrote:
> But what I, perhaps naively, proposed (just thinking aloud...) is
> trading space for speed - so the 10M pairs index actually holds the
> pairs (report_no-->X, citing_author-->X...) ?

No, 10M citer-citee pairs would only hold the record ID references in a
dictionary, as described for example at
<http://invenio-software.org/ticket/21>.

> Do you have some reasons to believe that the pairs are more storage
> effective, than the points in the index?

A web app node does not have to contact the DB node in order to walk
over the citation map to provide a cite summary, because the full
citation map is readily available in its memory.  Good for load
distribution, hence speed and scalability.

> if I understood the problem correctly, the 10M pairs would need 20M
> pairs to allow for fast both-directional lookup, would they?

10M citer-citee pairs are currently stored in two dictionaries indeed,
one for storing citer->citees direction, one for storing citee->citers
direction.  This is so that we could reply ultra-fast to both refersto
and citedby queries.  Solr-wise, storing 10M pairs should be enough, I
guess.

> searching is obviously more important for the user, but lack of data
> (due to indexing speed issues) may be as serious as the search speed,
> thus I think we inevitably deal with a cycle

It may be, but if you pre-index everything, even very slowly, then you
are done, and if you have only ~1000 changes per day during usual
operational conditions, then the indexing speed should not matter so
much for typical operational costs, because you only care about getting
those ~1000 daily changes.

Typically, we are in a situation where old stuff gets updated little, so
number of SELECTs is much larger than number of INSERTs/UPDATEs, so to
speak, borrowing SQL parlance.  Which is why Invenio cares a lot about
SELECT speed, and less lot about INSERT/UPDATE speed, in various parts
of the codebase.

Best regards
--
Tibor Simko

Reply via email to