On 11/29/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
If I were to analyze greek text, I might do something like this:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
         <filter class="solr.SnowballPorterFilterFactory" language="Greek" />
xt"/>
      </analyzer>
 </fieldtype>

Hmm, I just discovered that the Porter2 snowball stemmers don't support greek.
Here is the relevant code of the GreekAnalyzer, so to duplicate this
I'd make a FilterFactory for GreekLowerCaseFilter and reuse existing
factories for the rest.

public TokenStream tokenStream(String fieldName, Reader reader)
{
 TokenStream result = new StandardTokenizer(reader);
 result = new GreekLowerCaseFilter(result, charset);
 result = new StopFilter(result, stopSet);
 return result;
}

At some point I'd like to get to automatic generation of
FilterFactories if none existed so new Lucene filters could be used
without any extra coding.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to