TermAnalyzer# tokenStream ( final String fieldName, final Reader reader ) ------------------------------------------------------------------------------------------ TokenStream t = new WhitespaceAnalyzer( Version.LUCENE_31 ).tokenStream( fieldName, cf); t = new StopFilter( Version.LUCENE_31, t, stopWordSet, true ); t = new ShingleAnalyzerWrapper( t, 4 ).tokenStream( fieldName, reader ); t = new LowerCaseFilter( Version.LUCENE_31, t ); return t;
Thx Clemens > -----Ursprüngliche Nachricht----- > Von: Simon Willnauer [mailto:simon.willna...@googlemail.com] > Gesendet: Montag, 25. April 2011 12:13 > An: java-user@lucene.apache.org > Betreff: Re: "Umlaute" getting lost > > On Sun, Apr 24, 2011 at 8:30 AM, Grant Ingersoll <gsing...@apache.org> > wrote: > > > > On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote: > > > >> I keep my search terms in a dedicated RAMDirectory (the termIndex). > >> In there I palce all the term of my real index. When putting the > >> terms into the termIndex I can still see [using the debugger] the > >> Umlaute (äöü). Unfortunately when searching the termIndex the > documents no more contain these Umlaute. > >> > >> Populating the termIndex: > >> termIndex = new RAMDirectory(); > >> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, > >> new TermAnalyzer( locale ) ); termIndexWriter = new IndexWriter( > >> termIndex, config ); TermEnum tEnum = realIndexReader.terms(); while > >> ( tEnum.next() ) { > >> Term t = tEnum.term(); > >> String termText = t.text(); > >> Document termDocument = new Document(); > >> Field field = new Field( FIELDNAME_TERM, termText, > >> Field.Store.YES, Field.Index.ANALYZED ); > >> termDocument.add( field ); > >> // and add term into the index > >> termIndexWriter.addDocument( termDocument ); } > >> termIndexWriter.commit(); termIndexWriter.optimize(); > >> termIndexWriter.close(); > >> > >> termIndexReader = IndexReader.open( termIndex, true ); > >> ---------- searching terms > >> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, > termFilter.toLowerCase() ) ) : > >> new WildcardQuery( new Term( > >> FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) ); TopDocs > >> topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 ); > >> for ( ScoreDoc hit : topDocs.scoreDocs ) { > >> Document doc = getTermIndexReader().document( hit.doc ); > >> String indexTerm = doc.get( FIELDNAME_TERM ); > >> if ( !returnValue.contains( indexTerm ) ) > >> { > >> returnValue.add( indexTerm ); > >> } > >> } > >> ---------- > >> The TermAbnalyzer is the same analyzer as the main index analyzer with > the exception that a LowerCaseFilter is applied. > > > > What is the Analyzer for the Main Index? What is the tokenizer and token > filters used? > > in other words, can you provide what TermAnalyzer is composed of? > > > simon > > > > Out of curiosity, what is the problem you are trying to solve? > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org