Aha! OK, now I see what's going on. Relevance measurement is a balancing-act between performance and accuracy. Relevance measurement is somewhat amorphous at best, and it has to be done many times per query, so it has to be *really* fast. So, trying for pinpoint accuracy at the cost of performance doesn't make sense. Also, MarkLogic Server is optimized for large-scale, steady-state very-high-performance, so experiments with 100 documents on a system that may not have reached steady-state may highlight some edge-cases that you may not see in production. That said, changing index settings, especially creating a field definition, may cause changes to IDF and TF and therefore score. In general, the changes will be very small. In general, ranking will not be affected. Please contact me off-line if you'd like to dig deeper. - Steve B. Stephen Buxton Director of Product Management Mark Logic Corporation 999 Skyway Road
Suite 200 San Carlos, CA 94070 +1 650 655 2317 Phone [EMAIL PROTECTED] www.marklogic.com <http://www.marklogic.com/> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andy Townsend Sent: Wednesday, August 01, 2007 7:15 AM To: General Mark Logic Developer Discussion Subject: [BULK] RE: [MarkLogic Dev General] Relevance and Fields Importance: Low Stephen, Thanks for the response - I confess I had relegated this to the pile of unknowns. I have tried to recreate the scenario this morning and have not yet fully recreated it - I suspect it is some kind of edge case that changes to my DB have affected, however I have repeated some pieces and have attached an annotated ErrorLog.txt extract. To touch on your responses first - > Presumably when you say you ran 'cts:query(doc(), "myword")', you mean 'cts:search(doc(), "myword")' ?? Yes of course I mean cts:search() - sorry for the confusion, clearly typing way too quickly. > "a) what the creation of a field is really doing to my DB in order to affect TF " > -- as described above, the creation of a field creates additional, field-specific termlists, so that TF on a cts:field-word-query() is > based on the number of times the term appears in the field. Okay - but perhaps we can clarify more with regard to the attachment. > b) what the TF normalization function is > -- the TF normalization function adjusts the count of the occurrences of a term according to the length of the document (strictly, the fragment). I do (and did) understand the principal, I guess I was asking what the algorithm is to see if that helped me understand other things - of course I respect that you consider the algorithm to be "secret sauce", though can you indicate whether it is a "well-behaved" function or whether there are 'transition document sizes' where the function might cause quirky behaviour? And so to the attachment - from my ML installation this morning, Windows, version 3.2-1 It seems to show IDF changing from 316/2 to 508/4 depending on the existence of the field. It also shows TF for the two matching documents/fragments changing before and after the creation of the field, though not currently (unlike my earlier example) changing back again after the field is deleted. Can you explain why/how these should change? Can you respond to / comment on the lines marked with "-- ??" ? Thanks in advance for any cycles that you can engage. Andy "Stephen Buxton" <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 01/08/2007 06:30 Please respond to General Mark Logic Developer Discussion <[email protected]> To "General Mark Logic Developer Discussion" <[email protected]> cc Subject RE: [MarkLogic Dev General] Relevance and Fields Andy, I finally managed to find a few cycles to try this out, and I'm puzzled. You said: "... creating a Field appears to create a new index from which TF is calculated ..." Creating a Field causes new termlists to be created. So if you create a field f1 that includes an element called title that contains the word "pig", a new termlist for "the-word-pig-in-the-field-f1" is created (in much the same way as when you turn on fast element word searches, a new termlist such as "the-word-pig-in-the-element-title" is created). You can think of this as creating "a new index", though we don't normally describe it that way - it's just creating a set of new termlists. Then you described an experiment - here's where I'm puzzled. Presumably when you say you ran 'cts:query(doc(), "myword")', you mean 'cts:search(doc(), "myword")' ?? Or maybe 'cts:search(fn:collection(), "myword")' ?? If you ran the same word query over the same corpus with the same database index settings, you should've seen the same scores. If you ran a different query - e.g. if you used cts:field-word-query() instead of cts:word-query() - then, as you described in your "simple tests", you should see a different score. Now the TF is the number of times the term occurs *in the field*, not in the whole fragment. I tried to reproduce your results with just a few documents - the "pig" documents I used in the User Conference presentation - and, as expected, I got the same score for a simple word query whether or not a field existed. Could you possibly send me a test case? Or at least an excerpt from the trace? The existence of a field should not affect the scores returned by a simple word query. You asked: "a) what the creation of a field is really doing to my DB in order to affect TF " -- as described above, the creation of a field creates additional, field-specific termlists, so that TF on a cts:field-word-query() is based on the number of times the term appears in the field. b) what the TF normalization function is -- the TF normalization function adjusts the count of the occurrences of a term according to the length of the document (strictly, the fragment). If we didn't adjust for document length, then longer documents would always dominate the results since they are more likely to contain more occurrences of any given term. We don't publish the exact algorithm - partly because it's "secret sauce", and partly because we may tweak it from time to time. You said: "P.S. As an aside - the developer docs describes "inverse document frequency" as "log(1/df) where df (document frequency) is the number of documents in which the term occurs." I think this is a little misleading - it really means log( D/df) where D is the total number of documents (a.k.a fragments) or a variant definition of df is needed. This is the behaviour that can be seen in the log trace. Also, just to be pedantic (who me?) it should probably be ln(D/df) rather than log(D/df) since it's the natural log :-) " Yes, correct. IDF is about the percentage of documents that contain a term, not the absolute number of documents that contain that term. I'll log a doc bug. - Steve B. Stephen Buxton Director of Product Management Mark Logic Corporation 999 Skyway Road Suite 200 San Carlos, CA 94070 +1 650 655 2317 Phone [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> www.marklogic.com <http://www.marklogic.com/> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andy Townsend Sent: Thursday, May 31, 2007 9:16 AM To: General Mark Logic Developer Discussion Subject: [MarkLogic Dev General] Relevance and Fields Hi folks, Could some kind soul (probably a kindly ML soul) please expand a little on how the new 3.2 Fields and Relevance interplay. Slide 14 from Stephen's presentation on relevance from the User Conference (I'm afraid I was in another session) hints that Fields can have an effect as it says down the bottom: Relevance may be calculated with respect to an element or a field More focused relevance measurement However all the rest of the slides and the 3.2 developers guide (section 23.2) refer only to fragments and the calculation of TF and IDF from fragment based stats. I ran some very simple tests in a DB with about a hundred documents and turned on the Relevance trace (as explained at the conference). I was able to demonstrate that creating a Field appears to create a new index from which TF is calculated since when doing a cts:field-word-query() since I could see a lower TF value in the trace output (for a document where some term occurances fell in the field and some fell outside). Marvellous! However...... when doing a simple word-query across all docs I found that relevance actually varied depending on whether the Field actually existed. i.e. - DB, no fields, run cts:query(doc(), "myword") and docA gets relevance X - create field, wait for DB to settle down after reindexing - DB, with field, re-run cts:query(doc(), "myword") and now docA gets relevance Y where Y < X (!!) - drop field, wait for reindexing to settle - DB, no fields, re-run cts:query(doc(), "myword") and now docA gets relevance X again. (!!!) The Relevance trace shows that the only value changing is the value for TF (so IDF still the same, number of total fragments still the same) however the number of term occurances has not changed, neither (as far as I know) has the fragment size. This makes me wonder: a) what the creation of a field is really doing to my DB in order to affect TF b) what the TF normalization function is - this function is refered to on slide 12, normalization for fragment length and in 23.1.1 in the developer docs where it also says: "a word that occurs 10 times in a 100 word document will get a higher score than a word that occurs 100 times in a 1,000 word document" but gives no further details of what this function is and why docs with 10/100 should count less than docs with 100/1000 Any clarifications on Fields, Field indexes and how these interplay with relevance calculations? Thanks in advance, Andy P.S. As an aside - the developer docs describes "inverse document frequency" as "log(1/df) where df (document frequency) is the number of documents in which the term occurs." I think this is a little misleading - it really means log( D/df) where D is the total number of documents (a.k.a fragments) or a variant definition of df is needed. This is the behaviour that can be seen in the log trace. Also, just to be pedantic (who me?) it should probably be ln(D/df) rather than log(D/df) since it's the natural log :-) ________________________________ The information contained in this e-mail and any subsequent correspondence is private and confidential and intended solely for the named recipient(s). If you are not a named recipient, you must not copy, distribute, or disseminate the information, open any attachment, or take any action in reliance on it. If you have received the e-mail in error, please notify the sender and delete the e-mail. Any views or opinions expressed in this e-mail are those of the individual sender, unless otherwise stated. Although this e-mail has been scanned for viruses you should rely on your own virus check, as the sender accepts no liability for any damage arising out of any bug or virus infection. John Wiley & Sons Limited is a private limited company registered in England with registered number 641132. Registered office address: The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ. ________________________________ _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general ________________________________ The information contained in this e-mail and any subsequent correspondence is private and confidential and intended solely for the named recipient(s). If you are not a named recipient, you must not copy, distribute, or disseminate the information, open any attachment, or take any action in reliance on it. If you have received the e-mail in error, please notify the sender and delete the e-mail. Any views or opinions expressed in this e-mail are those of the individual sender, unless otherwise stated. Although this e-mail has been scanned for viruses you should rely on your own virus check, as the sender accepts no liability for any damage arising out of any bug or virus infection. John Wiley & Sons Limited is a private limited company registered in England with registered number 641132. Registered office address: The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ. ________________________________
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
