Re: French and SpellingQueryConverter
Jonathan Mamou schrieb: Thanks Michael for your answer! I think that (?:(?!(\w+:|\d+)))[\p{L}]+ should also be OK. Oh yes, that's much simpler and clearer than my suggestion. (Newbieness factor for Java style regular expressions, too.) Or maybe this:(?:(?!(\w+:|\d+)))[\p{L}\d_]+:-) Michael Ludwig
Re: French and SpellingQueryConverter
Thanks Michael for your answer! I think that (?:(?!(\w+:|\d+)))[\p{L}]+ should also be OK. Jonathan Michael Ludwig To solr-user@lucene.apache.org 19/05/2009 15:22 cc Subject Please respond to Re: French and solr-u...@lucene. SpellingQueryConverter apache.org Shalin Shekhar Mangar schrieb: > On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig > wrote: > >> Could you give an example of how the spellcheck.q parameter can be >> brought into play to (take non-ASCII characters into account, so >> that "Käse" isn't mishandled) given the following example: > > You will need to set the correct tokenizer and filters for your field > which can handle your language correctly. Look at the GermanAnalyzer > in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter, > LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword > list. Hello Shalin, thanks for your kind answer, and sorry for my delay in responding. Due to my newbieness in this domain, I misphrased my question. What I wanted to say (and Jonathan, too, I think) is that the regular expression in that SpellingQueryConverter only deals with ASCII, which is insufficient for most languages, including French and German. I think the regular expression in SpellingQueryConverter should be something like: (?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+ vs. (?:(?!(\w+:|\d+)))\w+ Then, correct German and French TokenStreams are generated in the example program I posted. But I may well have misunderstood the purpose of this class. You will know. Michael Ludwig
Re: French and SpellingQueryConverter
Shalin Shekhar Mangar schrieb: On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig wrote: Could you give an example of how the spellcheck.q parameter can be brought into play to (take non-ASCII characters into account, so that "Käse" isn't mishandled) given the following example: You will need to set the correct tokenizer and filters for your field which can handle your language correctly. Look at the GermanAnalyzer in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword list. Hello Shalin, thanks for your kind answer, and sorry for my delay in responding. Due to my newbieness in this domain, I misphrased my question. What I wanted to say (and Jonathan, too, I think) is that the regular expression in that SpellingQueryConverter only deals with ASCII, which is insufficient for most languages, including French and German. I think the regular expression in SpellingQueryConverter should be something like: (?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+ vs. (?:(?!(\w+:|\d+)))\w+ Then, correct German and French TokenStreams are generated in the example program I posted. But I may well have misunderstood the purpose of this class. You will know. Michael Ludwig
Re: French and SpellingQueryConverter
On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig wrote: > Could you give an example of how the spellcheck.q parameter can be > brought into play to (take non-ASCII characters into account, so > that "Käse" isn't mishandled) given the following example: > You will need to set the correct tokenizer and filters for your field which can handle your language correctly. Look at the GermanAnalyzer in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword list. Use the analysis.jsp on the admin page to see how queries on that field type are tokenizer. Tweak until it works as desired. Once that is setup, you need to send all the spell check queries through the spellcheck.q parameter. The query-time analyzer for that field will be used by spellchecker to analyze the query. -- Regards, Shalin Shekhar Mangar.
Re: French and SpellingQueryConverter
Shalin Shekhar Mangar schrieb: On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou wrote: SpellingQueryConverter always splits words with special character. I think that the issue is in SpellingQueryConverter class Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?: According to http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html, \w A word character: [a-zA-Z_0-9] I think that special character should also be added to the regex. Same issue for the GermanAnalyzer as for the FrenchAnalyzer. http://wiki.apache.org/solr/SpellCheckComponent says: The SpellingQueryConverter class does not deal properly with non-ASCII characters. In this case, you have either to use spellcheck.q, or to implement your own QueryConverter. If you use spellcheck.q parameter for specifying the spelling query, then the field's analyzer will be used (in this case, FrenchAnalyzer). If you use the q parameter, then the SpellingQueryConverter is used. Could you give an example of how the spellcheck.q parameter can be brought into play to (take non-ASCII characters into account, so that "Käse" isn't mishandled) given the following example: package org.apache.solr.spelling; import org.apache.lucene.analysis.de.GermanAnalyzer; public class GermanTest { public static void main(String[] args) { SpellingQueryConverter sqc = new SpellingQueryConverter(); sqc.analyzer = new GermanAnalyzer(); System.out.println(sqc.convert("Käse")); } } Note the result of the above, which is plain wrong, reads: [(k,0,1,type=), (se,2,4,type=)] Thanks. Michael Ludwig
Re: French and SpellingQueryConverter
On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou wrote: > Hi > It does not seem to be related to FrenchStemmer, the stemmer does not split > a word into 2 words. I have checked with other words and > SpellingQueryConverter always splits words with special character. > I think that the issue is in SpellingQueryConverter class > Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?: > According to > http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html, > \w A word character: [a-zA-Z_0-9] > I think that special character should also be added to the regex. > If you use spellcheck.q parameter for specifying the spelling query, then the field's analyzer will be used (in this case, FrenchAnalyzer). If you use the q parameter, then the SpellingQueryConverter is used. -- Regards, Shalin Shekhar Mangar.
Re: French and SpellingQueryConverter
Hi It does not seem to be related to FrenchStemmer, the stemmer does not split a word into 2 words. I have checked with other words and SpellingQueryConverter always splits words with special character. I think that the issue is in SpellingQueryConverter class Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?: According to http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html, \w A word character: [a-zA-Z_0-9] I think that special character should also be added to the regex. Best regards, Jonathan Jay Hill To solr-user@lucene.apache.org 07/05/2009 20:33 cc Subject Please respond to Re: French and solr-u...@lucene. SpellingQueryConverter apache.org It seems to me that this is just the expected behavior of the FrenchAnalyzer using the FrenchStemmer. I'm not familiar with the French language, but in English words like running, runner, and runs are all stemmed down to "run" as intended. I don't know what other words in French would stem down to "franc", but wouldn't this be what you would want? If not, maybe experiment with some of the other Analyzers to see if they give you what you need. -Jay On Thu, May 7, 2009 at 6:51 AM, Jonathan Mamou wrote: > > Hi > I have tried to run the following code > package org.apache.solr.spelling; > > import org.apache.lucene.analysis.fr.FrenchAnalyzer; > > > public class Test { > > public static void main (String args[]) { >SpellingQueryConverter sqc = new SpellingQueryConverter(); >sqc.analyzer = new FrenchAnalyzer(); >System.out.println(sqc.convert("français")); > }; > > }}; > > I would expect to get [(français,0,8,type=)] > However I get [(fran,0,4,type=), (ais,5,8,type=)] > Is there any issue with the support of special characters? > Thanks > Jonathan > >
Re: French and SpellingQueryConverter
It seems to me that this is just the expected behavior of the FrenchAnalyzer using the FrenchStemmer. I'm not familiar with the French language, but in English words like running, runner, and runs are all stemmed down to "run" as intended. I don't know what other words in French would stem down to "franc", but wouldn't this be what you would want? If not, maybe experiment with some of the other Analyzers to see if they give you what you need. -Jay On Thu, May 7, 2009 at 6:51 AM, Jonathan Mamou wrote: > > Hi > I have tried to run the following code > package org.apache.solr.spelling; > > import org.apache.lucene.analysis.fr.FrenchAnalyzer; > > > public class Test { > > public static void main (String args[]) { >SpellingQueryConverter sqc = new SpellingQueryConverter(); >sqc.analyzer = new FrenchAnalyzer(); >System.out.println(sqc.convert("français")); > }; > > }}; > > I would expect to get [(français,0,8,type=)] > However I get [(fran,0,4,type=), (ais,5,8,type=)] > Is there any issue with the support of special characters? > Thanks > Jonathan > >
French and SpellingQueryConverter
Hi I have tried to run the following code package org.apache.solr.spelling; import org.apache.lucene.analysis.fr.FrenchAnalyzer; public class Test { public static void main (String args[]) { SpellingQueryConverter sqc = new SpellingQueryConverter(); sqc.analyzer = new FrenchAnalyzer(); System.out.println(sqc.convert("français")); }; }}; I would expect to get [(français,0,8,type=)] However I get [(fran,0,4,type=), (ais,5,8,type=)] Is there any issue with the support of special characters? Thanks Jonathan