In debugging some word queries that didn't return expected results,
given case-sensitive stemmed searches, I discovered via cts:stem() that
handling of proper nouns (capitalized terms) is inconsistent. I'm trying
to figure out whether there's a pattern. Here are some results:
cts:stem("Baptists"), cts:stem("Buddhists"), cts:stem("Quakers")
==> 'Baptist', 'Buddhist', 'Quakers' [note the last one]
cts:stem("baptists"), cts:stem("buddhists"), cts:stem("quakers")
==> 'baptist', 'buddhist', 'quaker' [note the last one]
cts:stem("Democrats"), cts:stem("Republicans"), cts:stem("Whigs")
==> 'Democrat', 'Republican', 'Whigs'
cts:stem("democrats"), cts:stem("republicans"), cts:stem("whigs")
==> 'democrat', 'republican', 'whig'
In practice, this means that a case-sensitive search on "Baptist" will
match, as expected, one or more Baptists, but a search on "Quaker" will
not (assuming a cts:word-query() where case-sensitivity is not
specified, so that the capitalization of the query text is used as a
trigger for a case-sensitive search).
I don't want to treat all queries as case-insensitive because this loses
important distinctions between generic "young" and "Young" as a name,
etc.
If I had some clue as to the set of words like "Quakers" and "Whigs"
that do not stem to singular nouns, I could create a custom dictionary
to handle such cases. Are MarkLogic's decisions here based on an
internal dictionary? algorithms? both?
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general