Re: Lucene indexing questions

Roman Chyla Wed, 6 Oct 2010 22:08:10 +0200

Hi,

It will be interesting, what Grant Ingersoll tells you for the 2nd
order queries, but let me muse about the same query


 author:ellis AND citedby:author:witten NOT refersto:author:witten
  AND cited:10->20 AND refersto:keyword:muon


first, let's assume we built the index with this necessary information

doc: 10
  cited: 3,6,80,90,89...
  citing_author: witten, frank, lagra, ngeyen, chu, thuey...
  year: 2000
  title: something something
  authors: witten,ellis

doc: 90
  cites: 3,8,90....


the lucene query with the same effect then is:

((author:ellis +citedby:witten -author:witten) +keyword:muon) -->
cluster_by(len(cited))


notes:
 - citedby:author:witten -- it doesnt make sense to me that it could
be sb else than other author
 - the 'cluster_by' is a pseudo code, i don't know how to write it -
but the same query could be done by filtering -- ie. execute the inner
query, let lucene filter by cited field, read results, stop when num
of citing_authors <10

limitations:
 - 2nd order links must be carefully prepared (but honestly, how many
of those 2nd order relations are really needed, and really used? this
number is probably low...)
 - the index grows (but you can compare its size with the size of
current in-memory dictionary, which is effectively doubled and holds
the precious RAM - because of cited<->citing)

opportunities:
 - it is exteremely easy to put any field/relation into the index (and
reindex, which is both easy and fast)
 - it allows to combine the full power of the search engine (but
inevitably, things are done differently)
 - assumption that it will be slower than python in-memory dictionary
is assumption (and should be _recognized_ as such)
 - it is just a different paradigm than rdbms

Thanks, Jay, for the offer of questions, it would be great if you
could ask also about these two:

1)
-- is it possible to use payload for search? [i know it can influence
scoring and be useful for display, but as i understand it, it is a
metadata about the given position]

example, if we assume situation when we index authors <-- and add
payload to them

field:author | payload [affiliation,field_of_study,email]
------------------------------
ellis            | cern,umi  hep-theory [email protected]
swank        | umi  hep-ex  [email protected]

is it possible to query this structure directly? ex.

"author:swink~4 and author:affiliation:cern"

(I want to find all names similar to swink, schwink, sink... and i
also know the person is working at cern -- but i am not interested in
a record which was written by swink@umi, and ellis@cern --> i want
only swink@cern and for that i need payload)


2)
What would be the best strategy to have several separate indexes? Ie.
to have a separate index for metadata, for recently-changed-metadata,
fulltext, citation-pairs?

presumably, all those indexes contain only records (so the results
from them are mergeable on the recid match), but obviously the scoring
function makes sense only inside the index; but if one would like to
combine results (in a meaningful way) from the several indexes, what
would be the best strategy?

thanks and cheers,

  roman

On Wed, Oct 6, 2010 at 9:14 PM, Jay Luker <[email protected]> wrote:
> On Wed, Oct 6, 2010 at 12:40 PM, Tibor Simko <[email protected]> wrote:
>>
>> To sum up, everything could be reproduced in Solr, but Solr would have
>> to have direct access to the raw ranking data (=citation map), not only
>> to ranked values (=citation counts), otherwise generation of things like
>> cite summaries (which is one of the most used feature) would be slow.
>> And we would have to port everything that operates on these raw data
>> sets to Solr/Java, which is a very time consuming project when compared
>> to alternative options such as dispatching only certain index (such as
>> full-text) to Solr/Lucene and combining results back in Invenio.
>
> OK, yes, the 2nd order stuff is tricky. Sometimes when you're just trying to
> get an answer about how to do something from these "experts" you have to
> first get past the phase where they try to convince you that you don't need
> to do what you're trying to do.
>
>
> --
> ******************************************************
> Jay Luker               Astrophysics Data System (ADS)
> [email protected]  Center for Astrophysics
> 617-495-4588            60 Garden Street  MS 67
> 617-495-7356 fax        Cambridge, MA  02138
> ******************************************************
>
>

Re: Lucene indexing questions

Reply via email to