OK. I can see the logic that says it might be useful/convenient to filter case-sensitive search terms using a case-insensitive list of stop words.
What seems slightly odd is that you want exactness in the choice of case yet are using an imprecise matching technique (MoreLikeThis) - effectively saying "I really care about the exact use of case but really don't care exactly which words match". Is this really the requirement? I would have thought in most cases the user would be willing to relax the exact case match requirement along with their choice of precisely which words are used to match. If this applies to your app you could run MoreLikeThis on the lower-cased version of the field in the index. Cheers Mark ----- Original Message ---- From: Jong Kim <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, 9 July, 2007 3:55:03 PM Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project >>Or are you saying that you have deliberately chosen to index the content with a case-sensitive analyzer and that you want to supply stop words in a case-insensitive fashion? Correct. To be precise, we index each token up to twice - original token and its all-lowercase equivalent. Due to a product requirement, no token is thrown away at the time of indexing, that is, no stopwords filtering at indexing time. However, when executing MoreLikeThis feature, we do use a stopwords list (the fact that we indexed each and every word does not mean that they have to be included in the execution of MoreLikeThis), and we want the stopwords filtering to be case insensitive. /Jong -----Original Message----- From: mark harwood [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 10:33 AM To: java-user@lucene.apache.org Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project >>My application stores term vectors with the index And those stored term vectors contain terms produced by your choice of analyzer, no? Or are you saying that you have deliberately chosen to index the content with a case-sensitive analyzer and that you want to supply stop words in a case-insensitive fashion? ----- Original Message ---- From: Jong Kim <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, 9 July, 2007 3:00:05 PM Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project My application stores term vectors with the index, and use that information to implement more-like-this rather than tokenizing the original text using an analyzer. Consequently the option of achieving the effect by specifying different analyzer is no good for my case. /Jong -----Original Message----- From: mark harwood [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 5:01 AM To: java-user@lucene.apache.org Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project >>I need this comparison to be case-insensitive The choice of case-sensitivity (and preservation of punctuation, numbers etc etc) is controlled by your choice of analyzer that you pass to MoreLikeThis. If you want to ensure your list of stop words adheres to the same logic - use the same analyzer to construct the set from wherever you store your stop words e.g. a file. I don't imagine there should be a need to change the MoreLikeThis source. Cheers Mark ----- Original Message ---- From: Jong Kim <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, 8 July, 2007 10:12:08 PM Subject: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project Hi, The MoreLikeThis class in Lucene's contrib/queries project performs noise word filtering based on the case-sensitive comparison of the terms against the user-supplied stopwords set. I need this comparison to be case-insensitive, but I don't see any way of achieving it by extending this class. I would have created a subclass of MoreLikeThis and override the isNoiseWord() method. However, the problem is that, neither isNoiseWord() method nor the instance variables referenced inside that method are declared protected. They are all private. Has anyone solved this problem without modifying and building MoreLikeThis class directly? An alternative approach would be to supply a stopwords list containing all variants of the stop words with all possible mixed cases. Needless to say, that isn't likely to be a workable solution for many. Ultimately it would be nice if those methods and variables would have been made protected so that applications could override some of the default behaviors without having to modify the class directly. Any help would be appreciated. Thanks /Jong ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.htm l --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]