Re: French and SpellingQueryConverter

2009-05-19 Thread Michael Ludwig

Jonathan Mamou schrieb:

Thanks Michael for your answer!
I think that (?:(?!(\w+:|\d+)))[\p{L}]+
should also be OK.


Oh yes, that's much simpler and clearer than my suggestion.
(Newbieness factor for Java style regular expressions, too.)

Or maybe this:(?:(?!(\w+:|\d+)))[\p{L}\d_]+:-)

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-19 Thread Jonathan Mamou
Thanks Michael for your answer!
I think that (?:(?!(\w+:|\d+)))[\p{L}]+
should also be OK.
Jonathan


   
 Michael Ludwig
  To
   solr-user@lucene.apache.org 
 19/05/2009 15:22   cc
   
   Subject
 Please respond to Re: French and  
 solr-u...@lucene. SpellingQueryConverter  
apache.org 
   
   
   
   
   




Shalin Shekhar Mangar schrieb:
> On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig 
> wrote:
>
>> Could you give an example of how the spellcheck.q parameter can be
>> brought into play to (take non-ASCII characters into account, so
>> that "Käse" isn't mishandled) given the following example:
>
> You will need to set the correct tokenizer and filters for your field
> which can handle your language correctly. Look at the GermanAnalyzer
> in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter,
> LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword
> list.

Hello Shalin,

thanks for your kind answer, and sorry for my delay in responding.

Due to my newbieness in this domain, I misphrased my question. What
I wanted to say (and Jonathan, too, I think) is that the regular
expression in that SpellingQueryConverter only deals with ASCII,
which is insufficient for most languages, including French and
German.

I think the regular expression in SpellingQueryConverter should be
something like:

 (?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+
vs. (?:(?!(\w+:|\d+)))\w+

Then, correct German and French TokenStreams are generated in the
example program I posted.

But I may well have misunderstood the purpose of this class. You
will know.

Michael Ludwig




Re: French and SpellingQueryConverter

2009-05-19 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig 
wrote:


Could you give an example of how the spellcheck.q parameter can be
brought into play to (take non-ASCII characters into account, so
that "Käse" isn't mishandled) given the following example:


You will need to set the correct tokenizer and filters for your field
which can handle your language correctly. Look at the GermanAnalyzer
in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter,
LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword
list.


Hello Shalin,

thanks for your kind answer, and sorry for my delay in responding.

Due to my newbieness in this domain, I misphrased my question. What
I wanted to say (and Jonathan, too, I think) is that the regular
expression in that SpellingQueryConverter only deals with ASCII,
which is insufficient for most languages, including French and
German.

I think the regular expression in SpellingQueryConverter should be
something like:

(?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+
vs. (?:(?!(\w+:|\d+)))\w+

Then, correct German and French TokenStreams are generated in the
example program I posted.

But I may well have misunderstood the purpose of this class. You
will know.

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-11 Thread Shalin Shekhar Mangar
On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig  wrote:

> Could you give an example of how the spellcheck.q parameter can be
> brought into play to (take non-ASCII characters into account, so
> that "Käse" isn't mishandled) given the following example:
>

You will need to set the correct tokenizer and filters for your field which
can handle your language correctly. Look at the GermanAnalyzer in Lucene
contrib-analysis. It uses StandardTokenizer, StandardFilter,
LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword list.

Use the analysis.jsp on the admin page to see how queries on that field type
are tokenizer. Tweak until it works as desired. Once that is setup, you need
to send all the spell check queries through the spellcheck.q parameter. The
query-time analyzer for that field will be used by spellchecker to analyze
the query.

-- 
Regards,
Shalin Shekhar Mangar.


Re: French and SpellingQueryConverter

2009-05-11 Thread Michael Ludwig

Shalin Shekhar Mangar schrieb:

On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou 
wrote:



SpellingQueryConverter always splits words with special
character. I think that the issue is in SpellingQueryConverter
class Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?:
According to
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html,
\w A word character: [a-zA-Z_0-9]
I think that special character should also be added to the regex.


Same issue for the GermanAnalyzer as for the FrenchAnalyzer.

http://wiki.apache.org/solr/SpellCheckComponent says:

  The SpellingQueryConverter class does not deal properly with
  non-ASCII characters. In this case, you have either to use
  spellcheck.q, or to implement your own QueryConverter.


If you use spellcheck.q parameter for specifying the spelling
query, then the field's analyzer will be used (in this case,
FrenchAnalyzer). If you use the q parameter, then the
SpellingQueryConverter is used.


Could you give an example of how the spellcheck.q parameter can be
brought into play to (take non-ASCII characters into account, so
that "Käse" isn't mishandled) given the following example:

package org.apache.solr.spelling;
import org.apache.lucene.analysis.de.GermanAnalyzer;
public class GermanTest {
public static void main(String[] args) {
SpellingQueryConverter sqc = new SpellingQueryConverter();
sqc.analyzer = new GermanAnalyzer();
System.out.println(sqc.convert("Käse"));
}
}

Note the result of the above, which is plain wrong, reads:

  [(k,0,1,type=), (se,2,4,type=)]

Thanks.

Michael Ludwig


Re: French and SpellingQueryConverter

2009-05-08 Thread Shalin Shekhar Mangar
On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou  wrote:

> Hi
> It does not seem to be related to FrenchStemmer, the stemmer does not split
> a word into 2 words. I have checked with other words and
> SpellingQueryConverter always splits words with special character.
> I think that the issue is in SpellingQueryConverter class
> Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?:
> According to
> http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html,
> \w A word character: [a-zA-Z_0-9]
> I think that special character should also be added to the regex.
>

If you use spellcheck.q parameter for specifying the spelling query, then
the field's analyzer will be used (in this case, FrenchAnalyzer). If you use
the q parameter, then the SpellingQueryConverter is used.

-- 
Regards,
Shalin Shekhar Mangar.


Re: French and SpellingQueryConverter

2009-05-07 Thread Jonathan Mamou
Hi
It does not seem to be related to FrenchStemmer, the stemmer does not split
a word into 2 words. I have checked with other words and
SpellingQueryConverter always splits words with special character.
I think that the issue is in SpellingQueryConverter class
Pattern.compile.("(?:(?!(\\w+:|\\d+)))\\w+");?:
According to
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html,
\w A word character: [a-zA-Z_0-9]
I think that special character should also be added to the regex.
Best regards,
Jonathan


   
 Jay Hill  
 To
   solr-user@lucene.apache.org 
 07/05/2009 20:33   cc
   
   Subject
 Please respond to Re: French and  
 solr-u...@lucene. SpellingQueryConverter  
apache.org 
   
   
   
   
   




It seems to me that this is just the expected behavior of the
FrenchAnalyzer
using the FrenchStemmer. I'm not familiar with the French language, but in
English words like running, runner, and runs are all stemmed down to "run"
as intended. I don't know what other words in French would stem down to
"franc", but wouldn't this be what you would want? If not, maybe experiment
with some of the other Analyzers to see if they give you what you need.

-Jay

On Thu, May 7, 2009 at 6:51 AM, Jonathan Mamou  wrote:

>
> Hi
> I have tried to run the following code
> package org.apache.solr.spelling;
>
> import org.apache.lucene.analysis.fr.FrenchAnalyzer;
>
>
> public class Test {
>
>  public static void main (String args[]) {
>SpellingQueryConverter sqc = new SpellingQueryConverter();
>sqc.analyzer = new FrenchAnalyzer();
>System.out.println(sqc.convert("français"));
>  };
>
> }};
>
> I would expect to get [(français,0,8,type=)]
> However I get [(fran,0,4,type=), (ais,5,8,type=)]
> Is there any issue with the support of special characters?
> Thanks
> Jonathan
>
>




Re: French and SpellingQueryConverter

2009-05-07 Thread Jay Hill
It seems to me that this is just the expected behavior of the FrenchAnalyzer
using the FrenchStemmer. I'm not familiar with the French language, but in
English words like running, runner, and runs are all stemmed down to "run"
as intended. I don't know what other words in French would stem down to
"franc", but wouldn't this be what you would want? If not, maybe experiment
with some of the other Analyzers to see if they give you what you need.

-Jay

On Thu, May 7, 2009 at 6:51 AM, Jonathan Mamou  wrote:

>
> Hi
> I have tried to run the following code
> package org.apache.solr.spelling;
>
> import org.apache.lucene.analysis.fr.FrenchAnalyzer;
>
>
> public class Test {
>
>  public static void main (String args[]) {
>SpellingQueryConverter sqc = new SpellingQueryConverter();
>sqc.analyzer = new FrenchAnalyzer();
>System.out.println(sqc.convert("français"));
>  };
>
> }};
>
> I would expect to get [(français,0,8,type=)]
> However I get [(fran,0,4,type=), (ais,5,8,type=)]
> Is there any issue with the support of special characters?
> Thanks
> Jonathan
>
>


French and SpellingQueryConverter

2009-05-07 Thread Jonathan Mamou

Hi
I have tried to run the following code
package org.apache.solr.spelling;

import org.apache.lucene.analysis.fr.FrenchAnalyzer;


public class Test {

  public static void main (String args[]) {
SpellingQueryConverter sqc = new SpellingQueryConverter();
sqc.analyzer = new FrenchAnalyzer();
System.out.println(sqc.convert("français"));
  };

}};

I would expect to get [(français,0,8,type=)]
However I get [(fran,0,4,type=), (ais,5,8,type=)]
Is there any issue with the support of special characters?
Thanks
Jonathan