Hi, It will be interesting, what Grant Ingersoll tells you for the 2nd order queries, but let me muse about the same query
author:ellis AND citedby:author:witten NOT refersto:author:witten AND cited:10->20 AND refersto:keyword:muon first, let's assume we built the index with this necessary information doc: 10 cited: 3,6,80,90,89... citing_author: witten, frank, lagra, ngeyen, chu, thuey... year: 2000 title: something something authors: witten,ellis doc: 90 cites: 3,8,90.... the lucene query with the same effect then is: ((author:ellis +citedby:witten -author:witten) +keyword:muon) --> cluster_by(len(cited)) notes: - citedby:author:witten -- it doesnt make sense to me that it could be sb else than other author - the 'cluster_by' is a pseudo code, i don't know how to write it - but the same query could be done by filtering -- ie. execute the inner query, let lucene filter by cited field, read results, stop when num of citing_authors <10 limitations: - 2nd order links must be carefully prepared (but honestly, how many of those 2nd order relations are really needed, and really used? this number is probably low...) - the index grows (but you can compare its size with the size of current in-memory dictionary, which is effectively doubled and holds the precious RAM - because of cited<->citing) opportunities: - it is exteremely easy to put any field/relation into the index (and reindex, which is both easy and fast) - it allows to combine the full power of the search engine (but inevitably, things are done differently) - assumption that it will be slower than python in-memory dictionary is assumption (and should be _recognized_ as such) - it is just a different paradigm than rdbms Thanks, Jay, for the offer of questions, it would be great if you could ask also about these two: 1) -- is it possible to use payload for search? [i know it can influence scoring and be useful for display, but as i understand it, it is a metadata about the given position] example, if we assume situation when we index authors <-- and add payload to them field:author | payload [affiliation,field_of_study,email] ------------------------------ ellis | cern,umi hep-theory [email protected] swank | umi hep-ex [email protected] is it possible to query this structure directly? ex. "author:swink~4 and author:affiliation:cern" (I want to find all names similar to swink, schwink, sink... and i also know the person is working at cern -- but i am not interested in a record which was written by swink@umi, and ellis@cern --> i want only swink@cern and for that i need payload) 2) What would be the best strategy to have several separate indexes? Ie. to have a separate index for metadata, for recently-changed-metadata, fulltext, citation-pairs? presumably, all those indexes contain only records (so the results from them are mergeable on the recid match), but obviously the scoring function makes sense only inside the index; but if one would like to combine results (in a meaningful way) from the several indexes, what would be the best strategy? thanks and cheers, roman On Wed, Oct 6, 2010 at 9:14 PM, Jay Luker <[email protected]> wrote: > On Wed, Oct 6, 2010 at 12:40 PM, Tibor Simko <[email protected]> wrote: >> >> To sum up, everything could be reproduced in Solr, but Solr would have >> to have direct access to the raw ranking data (=citation map), not only >> to ranked values (=citation counts), otherwise generation of things like >> cite summaries (which is one of the most used feature) would be slow. >> And we would have to port everything that operates on these raw data >> sets to Solr/Java, which is a very time consuming project when compared >> to alternative options such as dispatching only certain index (such as >> full-text) to Solr/Lucene and combining results back in Invenio. > > OK, yes, the 2nd order stuff is tricky. Sometimes when you're just trying to > get an answer about how to do something from these "experts" you have to > first get past the phase where they try to convince you that you don't need > to do what you're trying to do. > > > -- > ****************************************************** > Jay Luker Astrophysics Data System (ADS) > [email protected] Center for Astrophysics > 617-495-4588 60 Garden Street MS 67 > 617-495-7356 fax Cambridge, MA 02138 > ****************************************************** > >
