Re: Lucene indexing questions

Roman Chyla Thu, 7 Oct 2010 16:59:20 +0200

On Thu, Oct 7, 2010 at 4:12 PM, Tibor Simko <[email protected]> wrote:
> On Thu, 07 Oct 2010, Roman Chyla wrote:
>> But what I, perhaps naively, proposed (just thinking aloud...) is
>> trading space for speed - so the 10M pairs index actually holds the
>> pairs (report_no-->X, citing_author-->X...) ?
>
> No, 10M citer-citee pairs would only hold the record ID references in a
> dictionary, as described for example at
> <http://invenio-software.org/ticket/21>.


thank you, so i guess there is something that interprets those relations

>
>> Do you have some reasons to believe that the pairs are more storage
>> effective, than the points in the index?
>
> A web app node does not have to contact the DB node in order to walk
> over the citation map to provide a cite summary, because the full
> citation map is readily available in its memory.  Good for load
> distribution, hence speed and scalability.

strictly speaking, we were having discussion about the storage, so it
still seems to be not soooo much more swelled

>
>> if I understood the problem correctly, the 10M pairs would need 20M
>> pairs to allow for fast both-directional lookup, would they?
>
> 10M citer-citee pairs are currently stored in two dictionaries indeed,
> one for storing citer->citees direction, one for storing citee->citers
> direction.  This is so that we could reply ultra-fast to both refersto
> and citedby queries.  Solr-wise, storing 10M pairs should be enough, I
> guess.

yep

>
>> searching is obviously more important for the user, but lack of data
>> (due to indexing speed issues) may be as serious as the search speed,
>> thus I think we inevitably deal with a cycle
>
> It may be, but if you pre-index everything, even very slowly, then you
> are done, and if you have only ~1000 changes per day during usual
> operational conditions, then the indexing speed should not matter so
> much for typical operational costs, because you only care about getting
> those ~1000 daily changes.

but it can happen, and if something can go wrong,...

>
> Typically, we are in a situation where old stuff gets updated little, so
> number of SELECTs is much larger than number of INSERTs/UPDATEs, so to
> speak, borrowing SQL parlance.  Which is why Invenio cares a lot about
> SELECT speed, and less lot about INSERT/UPDATE speed, in various parts
> of the codebase.
>
> Best regards
> --
> Tibor Simko
>

Re: Lucene indexing questions

Reply via email to