RE: [BULK] RE: [MarkLogic Dev General] Relevance and Fields

Stephen Buxton Mon, 06 Aug 2007 11:02:01 -0700

 
  Aha!
  OK, now I see what's going on.
 
  Relevance measurement is a balancing-act between performance and
accuracy. Relevance measurement is somewhat amorphous at best, and it
has to be done many times per query, so it has to be *really* fast. So,
trying for pinpoint accuracy at the cost of performance doesn't make
sense. Also, MarkLogic Server is optimized for large-scale, steady-state
very-high-performance, so experiments with 100 documents on a system
that may not have reached steady-state may highlight some edge-cases
that you may not see in production.
 
  That said, changing index settings, especially creating a field
definition, may cause changes to IDF and TF and therefore score. 
  In general, the changes will be very small. 
  In general, ranking will not be affected.
 
  Please contact me off-line if you'd like to dig deeper.
 
- Steve B.
 
Stephen Buxton
Director of Product Management
Mark Logic Corporation
999 Skyway Road


Suite 200

San Carlos, CA 94070

+1 650 655 2317 Phone
[EMAIL PROTECTED]
www.marklogic.com <http://www.marklogic.com/> 
This e-mail and any accompanying attachments are confidential. The
information is intended solely for the use of the individual to whom it
is addressed. Any review, disclosure, copying, distribution, or use of
this e-mail communication by others is strictly prohibited. If you are
not the intended recipient, please notify us immediately by returning
this message to the sender and delete all copies.  Thank you for your
cooperation.
 
 

________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andy
Townsend
Sent: Wednesday, August 01, 2007 7:15 AM
To: General Mark Logic Developer Discussion
Subject: [BULK] RE: [MarkLogic Dev General] Relevance and Fields
Importance: Low



Stephen, 

Thanks for the response - I confess I had relegated this to the pile of
unknowns.  I have tried to recreate the scenario this morning and have
not yet fully recreated it - I suspect it is some kind of edge case that
changes to my DB have affected, however I have repeated some pieces and
have attached an annotated ErrorLog.txt extract.  To touch on your
responses first - 

> Presumably when you say you ran 'cts:query(doc(), "myword")', you mean
'cts:search(doc(), "myword")' ?? 
Yes of course I mean cts:search() - sorry for the confusion, clearly
typing way too quickly. 

> "a) what the creation of a field is really doing to my DB in order to
affect TF " 
> -- as described above, the creation of a field creates additional,
field-specific termlists, so that TF on a cts:field-word-query() is 
> based on the number of times the term appears in the field. 
Okay - but perhaps we can clarify more with regard to the attachment. 

> b) what the TF normalization function is   
> -- the TF normalization function adjusts the count of the occurrences
of a term according to the length of the document (strictly, the
fragment). 
I do (and did) understand the principal, I guess I was asking what the
algorithm is to see if that helped me understand other things - of
course I respect that you consider the algorithm to be "secret sauce",
though can you indicate whether it is a "well-behaved" function or
whether there are 'transition document sizes' where the function might
cause quirky behaviour? 

And so to the attachment - from my ML installation this morning,
Windows, version 3.2-1 

It seems to show IDF changing from 316/2 to 508/4 depending on the
existence of the field. 

It also shows TF for the two matching documents/fragments changing
before and after the creation of the field, though not currently (unlike
my earlier example) changing back again after the field is deleted. 

Can you explain why/how these should change? 
Can you respond to / comment on the lines marked with "-- ??" ? 



Thanks in advance for any cycles that you can engage. 

Andy 







"Stephen Buxton" <[EMAIL PROTECTED]> 
Sent by: [EMAIL PROTECTED] 

01/08/2007 06:30 
Please respond to
General Mark Logic Developer Discussion
<[email protected]>


To
"General Mark Logic Developer Discussion"
<[email protected]> 
cc
Subject
RE: [MarkLogic Dev General] Relevance and Fields

        




Andy, 
  
  I finally managed to find a few cycles to try this out, and I'm
puzzled. 
  
  You said: "... creating a Field appears to create a new index from
which TF is calculated ..." 
  Creating a Field causes new termlists to be created. So if you create
a field f1 that includes an element called title that contains the word
"pig", a new termlist for "the-word-pig-in-the-field-f1" is created (in
much the same way as when you turn on fast element word searches, a new
termlist such as "the-word-pig-in-the-element-title" is created). You
can think of this as creating "a new index", though we don't normally
describe it that way - it's just creating a set of new termlists. 
  
  Then you described an experiment - here's where I'm puzzled. 
  Presumably when you say you ran 'cts:query(doc(), "myword")', you mean
'cts:search(doc(), "myword")' ?? Or maybe 'cts:search(fn:collection(),
"myword")' ?? 
  
  If you ran the same word query over the same corpus with the same
database index settings, you should've seen the same scores. 
  If you ran a different query - e.g. if you used cts:field-word-query()
instead of cts:word-query() - then, as you described in your "simple
tests", you should see a different score. Now the TF is the number of
times the term occurs *in the field*, not in the whole fragment. 
  I tried to reproduce your results with just a few documents - the
"pig" documents I used in the User Conference presentation - and, as
expected, I got the same score for a simple word query whether or not a
field existed. 
  Could you possibly send me a test case? Or at least an excerpt from
the trace? 
  The existence of a field should not affect the scores returned by a
simple word query. 
  
  You asked: 
"a) what the creation of a field is really doing to my DB in order to
affect TF " 
-- as described above, the creation of a field creates additional,
field-specific termlists, so that TF on a cts:field-word-query() is
based on the number of times the term appears in the field. 

b) what the TF normalization function is   
-- the TF normalization function adjusts the count of the occurrences of
a term according to the length of the document (strictly, the fragment).
If we didn't adjust for document length, then longer documents would
always dominate the results since they are more likely to contain more
occurrences of any given term. We don't publish the exact algorithm -
partly because it's "secret sauce", and partly because we may tweak it
from time to time. 
  
You said: 
  
"P.S.  As an aside - the developer docs describes "inverse document
frequency" as "log(1/df) where df (document frequency) is the number of
documents in which the term occurs." 
I think this is a little misleading  - it really means log( D/df) where
D is the total number of documents (a.k.a fragments) or a variant
definition of df is needed.  This is the behaviour that can be seen in
the log trace.  Also, just to be pedantic (who me?) it should probably
be ln(D/df) rather than log(D/df)  since it's the natural log :-) " 
  
Yes, correct. IDF is about the percentage of documents that contain a
term, not the absolute number of documents that contain that term. 
I'll log a doc bug. 
  
- Steve B. 
  
Stephen Buxton 
Director of Product Management 
Mark Logic Corporation 
999 Skyway Road 
Suite 200 
San Carlos, CA 94070 

+1 650 655 2317 Phone 
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>  
www.marklogic.com <http://www.marklogic.com/>  
This e-mail and any accompanying attachments are confidential. The
information is intended solely for the use of the individual to whom it
is addressed. Any review, disclosure, copying, distribution, or use of
this e-mail communication by others is strictly prohibited. If you are
not the intended recipient, please notify us immediately by returning
this message to the sender and delete all copies.  Thank you for your
cooperation. 
  
  


________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andy
Townsend
Sent: Thursday, May 31, 2007 9:16 AM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] Relevance and Fields 


Hi folks, 

Could some kind soul (probably a kindly ML soul) please expand a little
on how the new 3.2 Fields and Relevance interplay. 

Slide 14 from Stephen's presentation on relevance from the User
Conference (I'm afraid I was in another session) hints that Fields can
have an effect as it says down the bottom: 

       Relevance may be calculated with respect to 
       an element or a field 
               More focused relevance measurement 

However all the rest of the slides and the 3.2 developers guide (section
23.2) refer only to fragments and the calculation of TF and IDF from
fragment based stats. 

I ran some very simple tests in a DB with about a hundred documents and
turned on the Relevance trace (as explained at the conference).  I was
able to demonstrate that creating a Field appears to create a new index
from which TF is calculated since when doing a cts:field-word-query()
since I could see a lower TF value in the trace output (for a document
where some term occurances fell in the field and some fell outside).
Marvellous! 

However......  when doing a simple word-query across all docs I found
that relevance actually varied depending on whether the Field actually
existed. 

i.e. 
- DB, no fields, run cts:query(doc(), "myword") and docA gets relevance
X 
- create field, wait for DB to settle down after reindexing 
- DB, with field, re-run cts:query(doc(), "myword") and now docA gets
relevance Y where Y < X   (!!) 
- drop field, wait for reindexing to settle 
- DB, no fields, re-run cts:query(doc(), "myword") and now docA gets
relevance X again.     (!!!) 

The Relevance trace shows that the only value changing is the value for
TF (so IDF still the same, number of total fragments still the same)
however the number of term occurances has not changed, neither (as far
as I know) has the fragment size.  This makes me wonder: 
a) what the creation of a field is really doing to my DB in order to
affect TF 
b) what the TF normalization function is  - this function is refered to
on slide 12, normalization for fragment length and in 23.1.1 in the
developer docs where it also says: 

       "a word that occurs 10 times in a 100 word document will get a
higher score than a word that occurs 100 times in a 1,000 word document"


but gives no further details of what this function is and why docs with
10/100 should count less than docs with 100/1000 

Any clarifications on Fields, Field indexes and how these interplay with
relevance calculations? 

Thanks in advance, 

Andy 

P.S.  As an aside - the developer docs describes "inverse document
frequency" as "log(1/df) where df (document frequency) is the number of
documents in which the term occurs." 

I think this is a little misleading  - it really means log( D/df) where
D is the total number of documents (a.k.a fragments) or a variant
definition of df is needed.  This is the behaviour that can be seen in
the log trace.  Also, just to be pedantic (who me?) it should probably
be ln(D/df) rather than log(D/df)  since it's the natural log :-) 





________________________________

The information contained in this e-mail and any subsequent
correspondence is private and confidential and intended solely 
for the named recipient(s).  If you are not a named recipient, 
you must not copy, distribute, or disseminate the information, 
open any attachment, or take any action in reliance on it.  If you 
have received the e-mail in error, please notify the sender and delete
the e-mail.  

Any views or opinions expressed in this e-mail are those of the 
individual sender, unless otherwise stated.  Although this e-mail has 
been scanned for viruses you should rely on your own virus check, as 
the sender accepts no liability for any damage arising out of any bug 
or virus infection.

John Wiley & Sons Limited is a private limited company registered in
England with registered number 641132.

Registered office address: The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ.


________________________________

 _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general


________________________________

The information contained in this e-mail and any subsequent
correspondence is private and confidential and intended solely 
for the named recipient(s).  If you are not a named recipient, 
you must not copy, distribute, or disseminate the information, 
open any attachment, or take any action in reliance on it.  If you 
have received the e-mail in error, please notify the sender and delete
the e-mail.  
 
Any views or opinions expressed in this e-mail are those of the 
individual sender, unless otherwise stated.  Although this e-mail has 
been scanned for viruses you should rely on your own virus check, as 
the sender accepts no liability for any damage arising out of any bug 
or virus infection.

John Wiley & Sons Limited is a private limited company registered in
England with registered number 641132.

Registered office address: The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ.



________________________________

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [BULK] RE: [MarkLogic Dev General] Relevance and Fields

Reply via email to