Hi Jay: On Wed, 06 Oct 2010, Jay Luker wrote: > I tried to explain how we would need to rerank results according to > metadata in Invenio. He thought that was odd and wondered why that > additional ranking data couldn't also be indexed.
In principle, the additional ranking data could be ported to Solr as well, of course. But that would mean to re-implement all the search goodies that we have in Invenio so that Solr would do them natively. Otherwise some Invenio-Solr combination would be still necessary. And, in this regard, the second-order search capabilities come into the picture as troublesome. The `ranking data' to be ported to Solr would not necessarily mean only `recID---score' pairs. Solr would have to have access to the full raw ranking data in order to process second-order operators such as citation summaries over the citation map. So, doing this in Solr would essentially mean to start porting all the second-order algorithms to Solr. This is doable, but work, and takes time, and requires some committed decision. It is not unlike re-building `everything' search related inside Solr, in Java, so to speak. You can think of queries like: author:ellis AND citedby:author:witten NOT refersto:author:witten AND cited:10->20 AND refersto:keyword:muon that would find all papers authored by Ellis that are cited by Witten but that do not cite any of Witten's papers themselves and that are cited more than 10 and less than 20 times by other papers and that refer to some other papers that were tagged with the keyword muon. Or there are co-cited-with stats on pages like: <http://inspirebeta.net/record/201469/citations> Or there are various author summaries on author pages like: <http://inspirebeta.net/author/Dixon%2C%20Lance%20J.> Or people-who-have-read-have-also-read similarity recommendations. Or virtual collections based on some dynamic metadata query. To sum up, everything could be reproduced in Solr, but Solr would have to have direct access to the raw ranking data (=citation map), not only to ranked values (=citation counts), otherwise generation of things like cite summaries (which is one of the most used feature) would be slow. And we would have to port everything that operates on these raw data sets to Solr/Java, which is a very time consuming project when compared to alternative options such as dispatching only certain index (such as full-text) to Solr/Lucene and combining results back in Invenio. (just some quick top-of-the-head musings) Best regards -- Tibor Simko
