Re: Lucene indexing questions

Tibor Simko Wed, 6 Oct 2010 18:40:09 +0200

Hi Jay:

On Wed, 06 Oct 2010, Jay Luker wrote:
> I tried to explain how we would need to rerank results according to
> metadata in Invenio. He thought that was odd and wondered why that
> additional ranking data couldn't also be indexed.


In principle, the additional ranking data could be ported to Solr as
well, of course.  But that would mean to re-implement all the search
goodies that we have in Invenio so that Solr would do them natively.
Otherwise some Invenio-Solr combination would be still necessary.

And, in this regard, the second-order search capabilities come into the
picture as troublesome.  The `ranking data' to be ported to Solr would
not necessarily mean only `recID---score' pairs.  Solr would have to
have access to the full raw ranking data in order to process
second-order operators such as citation summaries over the citation map.
So, doing this in Solr would essentially mean to start porting all the
second-order algorithms to Solr.  This is doable, but work, and takes
time, and requires some committed decision.  It is not unlike
re-building `everything' search related inside Solr, in Java, so to
speak.

You can think of queries like:

  author:ellis AND citedby:author:witten NOT refersto:author:witten
   AND cited:10->20 AND refersto:keyword:muon

that would find all papers authored by Ellis that are cited by Witten
but that do not cite any of Witten's papers themselves and that are
cited more than 10 and less than 20 times by other papers and that refer
to some other papers that were tagged with the keyword muon.

Or there are co-cited-with stats on pages like:
<http://inspirebeta.net/record/201469/citations>

Or there are various author summaries on author pages like:
<http://inspirebeta.net/author/Dixon%2C%20Lance%20J.>

Or people-who-have-read-have-also-read similarity recommendations.

Or virtual collections based on some dynamic metadata query.

To sum up, everything could be reproduced in Solr, but Solr would have
to have direct access to the raw ranking data (=citation map), not only
to ranked values (=citation counts), otherwise generation of things like
cite summaries (which is one of the most used feature) would be slow.
And we would have to port everything that operates on these raw data
sets to Solr/Java, which is a very time consuming project when compared
to alternative options such as dispatching only certain index (such as
full-text) to Solr/Lucene and combining results back in Invenio.

(just some quick top-of-the-head musings)

Best regards
--
Tibor Simko

Re: Lucene indexing questions

Reply via email to