[MarkLogic Dev General] Relevance and Fields

Andy Townsend Thu, 31 May 2007 09:15:58 -0700

Hi folks,

Could some kind soul (probably a kindly ML soul) please expand a little on 
how the new 3.2 Fields and Relevance interplay.


Slide 14 from Stephen's presentation on relevance from the User Conference 
(I'm afraid I was in another session) hints that Fields can have an effect 
as it says down the bottom:

        Relevance may be calculated with respect to 
        an element or a field
                More focused relevance measurement

However all the rest of the slides and the 3.2 developers guide (section 
23.2) refer only to fragments and the calculation of TF and IDF from 
fragment based stats.

I ran some very simple tests in a DB with about a hundred documents and 
turned on the Relevance trace (as explained at the conference).  I was 
able to demonstrate that creating a Field appears to create a new index 
from which TF is calculated since when doing a cts:field-word-query() 
since I could see a lower TF value in the trace output (for a document 
where some term occurances fell in the field and some fell outside). 
Marvellous!

However......  when doing a simple word-query across all docs I found that 
relevance actually varied depending on whether the Field actually existed. 


i.e. 
- DB, no fields, run cts:query(doc(), "myword") and docA gets relevance X
- create field, wait for DB to settle down after reindexing
- DB, with field, re-run cts:query(doc(), "myword") and now docA gets 
relevance Y where Y < X   (!!)
- drop field, wait for reindexing to settle
- DB, no fields, re-run cts:query(doc(), "myword") and now docA gets 
relevance X again.     (!!!)

The Relevance trace shows that the only value changing is the value for TF 
(so IDF still the same, number of total fragments still the same) however 
the number of term occurances has not changed, neither (as far as I know) 
has the fragment size.  This makes me wonder:
a) what the creation of a field is really doing to my DB in order to 
affect TF
b) what the TF normalization function is  - this function is refered to on 
slide 12, normalization for fragment length and in 23.1.1 in the developer 
docs where it also says:

        "a word that occurs 10 times in a 100 word document will get a 
higher score than a word that occurs 100 times in a 1,000 word document"

but gives no further details of what this function is and why docs with 
10/100 should count less than docs with 100/1000

Any clarifications on Fields, Field indexes and how these interplay with 
relevance calculations?

Thanks in advance,

Andy

P.S.  As an aside - the developer docs describes "inverse document 
frequency" as "log(1/df) where df (document frequency) is the number of 
documents in which the term occurs."

I think this is a little misleading  - it really means log( D/df) where D 
is the total number of documents (a.k.a fragments) or a variant definition 
of df is needed.  This is the behaviour that can be seen in the log trace. 
 Also, just to be pedantic (who me?) it should probably be ln(D/df) rather 
than log(D/df)  since it's the natural log :-)





----------------------------------------------------------------------
The information contained in this e-mail and any subsequent
correspondence is private and confidential and intended solely 
for the named recipient(s).  If you are not a named recipient, 
you must not copy, distribute, or disseminate the information, 
open any attachment, or take any action in reliance on it.  If you 
have received the e-mail in error, please notify the sender and delete
the e-mail.  
 
Any views or opinions expressed in this e-mail are those of the 
individual sender, unless otherwise stated.  Although this e-mail has 
been scanned for viruses you should rely on your own virus check, as 
the sender accepts no liability for any damage arising out of any bug 
or virus infection.

John Wiley & Sons Limited is a private limited company registered in
England with registered number 641132.

Registered office address: The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ.
----------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] Relevance and Fields

Reply via email to