Re: Lucene indexing questions

Roman Chyla Thu, 7 Oct 2010 14:32:09 +0200

Hi,

This is an interesting thread, please bear with me and my questions
little longer, I am trying to understand more...


On Wed, Oct 6, 2010 at 10:36 PM, Brooks, Travis C.
<[email protected]> wrote:
>
> On Oct 6, 2010, at 1:08 PM, Roman Chyla wrote:
>
>> Hi,
>>
>> It will be interesting, what Grant Ingersoll tells you for the 2nd
>> order queries, but let me muse about the same query
>
> yes
>
>>
>> author:ellis AND citedby:author:witten NOT refersto:author:witten
>>  AND cited:10->20 AND refersto:keyword:muon
>>
>>
>> first, let's assume we built the index with this necessary information
>>
>> doc: 10
>>  cited: 3,6,80,90,89...
>>  citing_author: witten, frank, lagra, ngeyen, chu, thuey...
>>  year: 2000
>>  title: something something
>>  authors: witten,ellis
>>
>> doc: 90
>>  cites: 3,8,90....
>>
>>
>> the lucene query with the same effect then is:
>>
>> ((author:ellis +citedby:witten -author:witten) +keyword:muon) -->
>> cluster_by(len(cited))
>>
>>
>> notes:
>> - citedby:author:witten -- it doesnt make sense to me that it could
>> be sb else than other author
>
> I'm not sure I know what you mean here.    Recall that citedby:author:witten 
> means "find me the papers that are cited by the papers that are written by 
> witten"
>
> One can equally well do citedby:reportnumber:hep-th* meaning "find me all the 
> papers that are cited by papers with a hep-th reportnumber"
>
> Or even better citedby:"topic:neutrino cited:500->99999" (i.e. the papers 
> cited by the highly cited papers in neutrino physics...a very interesting 
> list)
>

OK, I see -- I didn't realize what I was saying was limited, but
basically the relation is paper1-->paper2 (always from document to
document), there are different ways to find the document, in the
lucene way, different fields (or different values inside the citing
field, example a_witten, id_arxvi0909.0909,id_hep.... )

>
> In order to fully reproduce the behavior one needs to have citing_author   
> AND citing_reportnumber AND citing_year AND .... in fact all indexes need to 
> be reproduced as "citing".   Since there are something like 10 million 
> citing<->cited pairs in the DB, we've just swelled the ranks of the indexes 
> by a substantial factor, I think.

OK, i see two ways - if this citing<->cited pairs index is
used/maintained by Lucene/Solr, it will work in a similar way

But what I, perhaps naively, proposed (just thinking aloud...) is
trading space for speed - so the 10M pairs index actually holds the
pairs (report_no-->X, citing_author-->X...) ?

when document is indexed with those 2nd order relations, naturally
those 10M pairs must increase the size, but am I wrong in thinking it
just lives with the document index instead with the citation index? Do
you have some reasons to believe that the pairs are more storage
effective, than the points in the index?

and it should also be noted, that they are stored only once, unlike
the case with the python dictionary which is build in two copies - one
with reversed key-value pairs -- if I understood the problem
correctly, the 10M pairs would need 20M pairs to allow for fast
both-directional lookup, would they?


>
>>
>> limitations:
>> - 2nd order links must be carefully prepared (but honestly, how many
>> of those 2nd order relations are really needed, and really used? this
>> number is probably low...)
>
> See above.   OTOH we haven't had this ability in SPIRES, so it is 
> theoretically possible to live without the 2nd order stuff.   But I think it 
> is more and more, not less and less important.   The citation relationship is 
> central to the utility of these systems, as I think ADS would agree, and this 
> stuff has immense power for metrics and discovering relationships.    The 
> first physicist I showed "citedby" to referred to it immediately as "crack 
> cocaine"
>
>
>> - the index grows (but you can compare its size with the size of
>> current in-memory dictionary, which is effectively doubled and holds
>> the precious RAM - because of cited<->citing)
>>
>
> I think the index grows a lot...
>
>> opportunities:
>> - it is exteremely easy to put any field/relation into the index (and
>> reindex, which is both easy and fast)
>
> I agree, and that's an opportunity, though since we have a few admins and 50K 
> users, I'm not sure it is the right prioritization.   I.e. we may not care so 
> much about indexing speed, preferring to deliver ultra-fast search, or 
> semi-fast search with lots of added value(2nd order, summary, etc).

searching is obviously more important for the user, but lack of data
(due to indexing speed issues) may be as serious as the search speed,
thus I think we inevitably deal with a cycle

>
>> - it allows to combine the full power of the search engine (but
>> inevitably, things are done differently)
>
> This is clearly an advantage, plus the maint. advantage of not 
> writing/maintaining code that has already been written/is being maintained
>
>> - assumption that it will be slower than python in-memory dictionary
>> is assumption (and should be _recognized_ as such)
>
> Agreed
>
>> - it is just a different paradigm than rdbms
>
> Yep, interestingly enough, so was SPIRES.

;-)

>
>>
>> Thanks, Jay, for the offer of questions, it would be great if you
>> could ask also about these two:
>>
>> 1)
>> -- is it possible to use payload for search? [i know it can influence
>> scoring and be useful for display, but as i understand it, it is a
>> metadata about the given position]
>>
>> example, if we assume situation when we index authors <-- and add
>> payload to them
>>
>> field:author | payload [affiliation,field_of_study,email]
>> ------------------------------
>> ellis            | cern,umi  hep-theory [email protected]
>> swank        | umi  hep-ex  [email protected]
>>
>> is it possible to query this structure directly? ex.
>>
>> "author:swink~4 and author:affiliation:cern"
>>
>> (I want to find all names similar to swink, schwink, sink... and i
>> also know the person is working at cern -- but i am not interested in
>> a record which was written by swink@umi, and ellis@cern --> i want
>> only swink@cern and for that i need payload)
>>
>>
>> 2)
>> What would be the best strategy to have several separate indexes? Ie.
>> to have a separate index for metadata, for recently-changed-metadata,
>> fulltext, citation-pairs?
>>
>> presumably, all those indexes contain only records (so the results
>> from them are mergeable on the recid match), but obviously the scoring
>> function makes sense only inside the index; but if one would like to
>> combine results (in a meaningful way) from the several indexes, what
>> would be the best strategy?
>>
>> thanks and cheers,
>>
>>  roman
>>
>> On Wed, Oct 6, 2010 at 9:14 PM, Jay Luker <[email protected]> wrote:
>>> On Wed, Oct 6, 2010 at 12:40 PM, Tibor Simko <[email protected]> wrote:
>>>>
>>>> To sum up, everything could be reproduced in Solr, but Solr would have
>>>> to have direct access to the raw ranking data (=citation map), not only
>>>> to ranked values (=citation counts), otherwise generation of things like
>>>> cite summaries (which is one of the most used feature) would be slow.
>>>> And we would have to port everything that operates on these raw data
>>>> sets to Solr/Java, which is a very time consuming project when compared
>>>> to alternative options such as dispatching only certain index (such as
>>>> full-text) to Solr/Lucene and combining results back in Invenio.
>>>
>>> OK, yes, the 2nd order stuff is tricky. Sometimes when you're just trying to
>>> get an answer about how to do something from these "experts" you have to
>>> first get past the phase where they try to convince you that you don't need
>>> to do what you're trying to do.
>>>
>>>
>>> --
>>> ******************************************************
>>> Jay Luker               Astrophysics Data System (ADS)
>>> [email protected]  Center for Astrophysics
>>> 617-495-4588            60 Garden Street  MS 67
>>> 617-495-7356 fax        Cambridge, MA  02138
>>> ******************************************************
>>>
>>>
>
> Travis C. Brooks
> Manager of Information Systems & SPIRES/INSPIRE
> SLAC National Accelerator Laboratory Library
> http://www.slac.stanford.edu/spires/
>
>
>
>
>

Re: Lucene indexing questions

Reply via email to