RE: [MarkLogic Dev General] Relevance and Fields

Stephen Buxton Tue, 31 Jul 2007 22:30:48 -0700

Andy,
 
  I finally managed to find a few cycles to try this out, and I'm
puzzled.
 
  You said: "... creating a Field appears to create a new index from
which TF is calculated ..."
  Creating a Field causes new termlists to be created. So if you create
a field f1 that includes an element called title that contains the word
"pig", a new termlist for "the-word-pig-in-the-field-f1" is created (in
much the same way as when you turn on fast element word searches, a new
termlist such as "the-word-pig-in-the-element-title" is created). You
can think of this as creating "a new index", though we don't normally
describe it that way - it's just creating a set of new termlists.
 
  Then you described an experiment - here's where I'm puzzled.
  Presumably when you say you ran 'cts:query(doc(), "myword")', you mean
'cts:search(doc(), "myword")' ?? Or maybe 'cts:search(fn:collection(),
"myword")' ?? 
 
  If you ran the same word query over the same corpus with the same
database index settings, you should've seen the same scores.
  If you ran a different query - e.g. if you used cts:field-word-query()
instead of cts:word-query() - then, as you described in your "simple
tests", you should see a different score. Now the TF is the number of
times the term occurs *in the field*, not in the whole fragment.
  I tried to reproduce your results with just a few documents - the
"pig" documents I used in the User Conference presentation - and, as
expected, I got the same score for a simple word query whether or not a
field existed.
  Could you possibly send me a test case? Or at least an excerpt from
the trace?
  The existence of a field should not affect the scores returned by a
simple word query.
 
  You asked:
"a) what the creation of a field is really doing to my DB in order to
affect TF "
-- as described above, the creation of a field creates additional,
field-specific termlists, so that TF on a cts:field-word-query() is
based on the number of times the term appears in the field.


b) what the TF normalization function is  
-- the TF normalization function adjusts the count of the occurrences of
a term according to the length of the document (strictly, the fragment).
If we didn't adjust for document length, then longer documents would
always dominate the results since they are more likely to contain more
occurrences of any given term. We don't publish the exact algorithm -
partly because it's "secret sauce", and partly because we may tweak it
from time to time.
 
You said:
 
"P.S.  As an aside - the developer docs describes "inverse document
frequency" as "log(1/df) where df (document frequency) is the number of
documents in which the term occurs." 
I think this is a little misleading  - it really means log( D/df) where
D is the total number of documents (a.k.a fragments) or a variant
definition of df is needed.  This is the behaviour that can be seen in
the log trace.  Also, just to be pedantic (who me?) it should probably
be ln(D/df) rather than log(D/df)  since it's the natural log :-) "
 
Yes, correct. IDF is about the percentage of documents that contain a
term, not the absolute number of documents that contain that term.
I'll log a doc bug.
 
- Steve B.
 
Stephen Buxton
Director of Product Management
Mark Logic Corporation
999 Skyway Road

Suite 200

San Carlos, CA 94070

+1 650 655 2317 Phone
[EMAIL PROTECTED]
www.marklogic.com <http://www.marklogic.com/> 
This e-mail and any accompanying attachments are confidential. The
information is intended solely for the use of the individual to whom it
is addressed. Any review, disclosure, copying, distribution, or use of
this e-mail communication by others is strictly prohibited. If you are
not the intended recipient, please notify us immediately by returning
this message to the sender and delete all copies.  Thank you for your
cooperation.
 
 

________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andy
Townsend
Sent: Thursday, May 31, 2007 9:16 AM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] Relevance and Fields 



Hi folks, 

Could some kind soul (probably a kindly ML soul) please expand a little
on how the new 3.2 Fields and Relevance interplay. 

Slide 14 from Stephen's presentation on relevance from the User
Conference (I'm afraid I was in another session) hints that Fields can
have an effect as it says down the bottom: 

        Relevance may be calculated with respect to 
        an element or a field 
                More focused relevance measurement 

However all the rest of the slides and the 3.2 developers guide (section
23.2) refer only to fragments and the calculation of TF and IDF from
fragment based stats. 

I ran some very simple tests in a DB with about a hundred documents and
turned on the Relevance trace (as explained at the conference).  I was
able to demonstrate that creating a Field appears to create a new index
from which TF is calculated since when doing a cts:field-word-query()
since I could see a lower TF value in the trace output (for a document
where some term occurances fell in the field and some fell outside).
Marvellous! 

However......  when doing a simple word-query across all docs I found
that relevance actually varied depending on whether the Field actually
existed. 

i.e. 
- DB, no fields, run cts:query(doc(), "myword") and docA gets relevance
X 
- create field, wait for DB to settle down after reindexing 
- DB, with field, re-run cts:query(doc(), "myword") and now docA gets
relevance Y where Y < X   (!!) 
- drop field, wait for reindexing to settle 
- DB, no fields, re-run cts:query(doc(), "myword") and now docA gets
relevance X again.     (!!!) 

The Relevance trace shows that the only value changing is the value for
TF (so IDF still the same, number of total fragments still the same)
however the number of term occurances has not changed, neither (as far
as I know) has the fragment size.  This makes me wonder: 
a) what the creation of a field is really doing to my DB in order to
affect TF 
b) what the TF normalization function is  - this function is refered to
on slide 12, normalization for fragment length and in 23.1.1 in the
developer docs where it also says: 

        "a word that occurs 10 times in a 100 word document will get a
higher score than a word that occurs 100 times in a 1,000 word document"


but gives no further details of what this function is and why docs with
10/100 should count less than docs with 100/1000 

Any clarifications on Fields, Field indexes and how these interplay with
relevance calculations? 

Thanks in advance, 

Andy 

P.S.  As an aside - the developer docs describes "inverse document
frequency" as "log(1/df) where df (document frequency) is the number of
documents in which the term occurs." 

I think this is a little misleading  - it really means log( D/df) where
D is the total number of documents (a.k.a fragments) or a variant
definition of df is needed.  This is the behaviour that can be seen in
the log trace.  Also, just to be pedantic (who me?) it should probably
be ln(D/df) rather than log(D/df)  since it's the natural log :-) 






________________________________

The information contained in this e-mail and any subsequent
correspondence is private and confidential and intended solely 
for the named recipient(s).  If you are not a named recipient, 
you must not copy, distribute, or disseminate the information, 
open any attachment, or take any action in reliance on it.  If you 
have received the e-mail in error, please notify the sender and delete
the e-mail.  
 
Any views or opinions expressed in this e-mail are those of the 
individual sender, unless otherwise stated.  Although this e-mail has 
been scanned for viruses you should rely on your own virus check, as 
the sender accepts no liability for any damage arising out of any bug 
or virus infection.

John Wiley & Sons Limited is a private limited company registered in
England with registered number 641132.

Registered office address: The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ.



________________________________

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Relevance and Fields

Reply via email to