Andy, I finally managed to find a few cycles to try this out, and I'm puzzled. You said: "... creating a Field appears to create a new index from which TF is calculated ..." Creating a Field causes new termlists to be created. So if you create a field f1 that includes an element called title that contains the word "pig", a new termlist for "the-word-pig-in-the-field-f1" is created (in much the same way as when you turn on fast element word searches, a new termlist such as "the-word-pig-in-the-element-title" is created). You can think of this as creating "a new index", though we don't normally describe it that way - it's just creating a set of new termlists. Then you described an experiment - here's where I'm puzzled. Presumably when you say you ran 'cts:query(doc(), "myword")', you mean 'cts:search(doc(), "myword")' ?? Or maybe 'cts:search(fn:collection(), "myword")' ?? If you ran the same word query over the same corpus with the same database index settings, you should've seen the same scores. If you ran a different query - e.g. if you used cts:field-word-query() instead of cts:word-query() - then, as you described in your "simple tests", you should see a different score. Now the TF is the number of times the term occurs *in the field*, not in the whole fragment. I tried to reproduce your results with just a few documents - the "pig" documents I used in the User Conference presentation - and, as expected, I got the same score for a simple word query whether or not a field existed. Could you possibly send me a test case? Or at least an excerpt from the trace? The existence of a field should not affect the scores returned by a simple word query. You asked: "a) what the creation of a field is really doing to my DB in order to affect TF " -- as described above, the creation of a field creates additional, field-specific termlists, so that TF on a cts:field-word-query() is based on the number of times the term appears in the field.
b) what the TF normalization function is -- the TF normalization function adjusts the count of the occurrences of a term according to the length of the document (strictly, the fragment). If we didn't adjust for document length, then longer documents would always dominate the results since they are more likely to contain more occurrences of any given term. We don't publish the exact algorithm - partly because it's "secret sauce", and partly because we may tweak it from time to time. You said: "P.S. As an aside - the developer docs describes "inverse document frequency" as "log(1/df) where df (document frequency) is the number of documents in which the term occurs." I think this is a little misleading - it really means log( D/df) where D is the total number of documents (a.k.a fragments) or a variant definition of df is needed. This is the behaviour that can be seen in the log trace. Also, just to be pedantic (who me?) it should probably be ln(D/df) rather than log(D/df) since it's the natural log :-) " Yes, correct. IDF is about the percentage of documents that contain a term, not the absolute number of documents that contain that term. I'll log a doc bug. - Steve B. Stephen Buxton Director of Product Management Mark Logic Corporation 999 Skyway Road Suite 200 San Carlos, CA 94070 +1 650 655 2317 Phone [EMAIL PROTECTED] www.marklogic.com <http://www.marklogic.com/> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andy Townsend Sent: Thursday, May 31, 2007 9:16 AM To: General Mark Logic Developer Discussion Subject: [MarkLogic Dev General] Relevance and Fields Hi folks, Could some kind soul (probably a kindly ML soul) please expand a little on how the new 3.2 Fields and Relevance interplay. Slide 14 from Stephen's presentation on relevance from the User Conference (I'm afraid I was in another session) hints that Fields can have an effect as it says down the bottom: Relevance may be calculated with respect to an element or a field More focused relevance measurement However all the rest of the slides and the 3.2 developers guide (section 23.2) refer only to fragments and the calculation of TF and IDF from fragment based stats. I ran some very simple tests in a DB with about a hundred documents and turned on the Relevance trace (as explained at the conference). I was able to demonstrate that creating a Field appears to create a new index from which TF is calculated since when doing a cts:field-word-query() since I could see a lower TF value in the trace output (for a document where some term occurances fell in the field and some fell outside). Marvellous! However...... when doing a simple word-query across all docs I found that relevance actually varied depending on whether the Field actually existed. i.e. - DB, no fields, run cts:query(doc(), "myword") and docA gets relevance X - create field, wait for DB to settle down after reindexing - DB, with field, re-run cts:query(doc(), "myword") and now docA gets relevance Y where Y < X (!!) - drop field, wait for reindexing to settle - DB, no fields, re-run cts:query(doc(), "myword") and now docA gets relevance X again. (!!!) The Relevance trace shows that the only value changing is the value for TF (so IDF still the same, number of total fragments still the same) however the number of term occurances has not changed, neither (as far as I know) has the fragment size. This makes me wonder: a) what the creation of a field is really doing to my DB in order to affect TF b) what the TF normalization function is - this function is refered to on slide 12, normalization for fragment length and in 23.1.1 in the developer docs where it also says: "a word that occurs 10 times in a 100 word document will get a higher score than a word that occurs 100 times in a 1,000 word document" but gives no further details of what this function is and why docs with 10/100 should count less than docs with 100/1000 Any clarifications on Fields, Field indexes and how these interplay with relevance calculations? Thanks in advance, Andy P.S. As an aside - the developer docs describes "inverse document frequency" as "log(1/df) where df (document frequency) is the number of documents in which the term occurs." I think this is a little misleading - it really means log( D/df) where D is the total number of documents (a.k.a fragments) or a variant definition of df is needed. This is the behaviour that can be seen in the log trace. Also, just to be pedantic (who me?) it should probably be ln(D/df) rather than log(D/df) since it's the natural log :-) ________________________________ The information contained in this e-mail and any subsequent correspondence is private and confidential and intended solely for the named recipient(s). If you are not a named recipient, you must not copy, distribute, or disseminate the information, open any attachment, or take any action in reliance on it. If you have received the e-mail in error, please notify the sender and delete the e-mail. Any views or opinions expressed in this e-mail are those of the individual sender, unless otherwise stated. Although this e-mail has been scanned for viruses you should rely on your own virus check, as the sender accepts no liability for any damage arising out of any bug or virus infection. John Wiley & Sons Limited is a private limited company registered in England with registered number 641132. Registered office address: The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ. ________________________________
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
