Re: Index and search terms containing character "-"

Erick Erickson Sun, 31 May 2009 06:33:56 -0700

Simple analyzer does two things: splits tokens on non-letter characters and
lowercases them.


So, in your test you've indexed the tokens "jack" and "bauer" in your second
document, the hyphen is completely lost during tokenization and you have
two tokens for that document.

Using the term query "jack-bauer" is looking for a *single* token that is
exactly "jack-bauer", which is nowhere in your index. Term queries assume
that you know exactly what result you want, they don't try to transform the
input at all. Had you run "jack-bauer" through the query parser with
SimpleAnalyzer, you'd be searching for a document with the two terms
"jack" and "bauer", and you'd have hit your second document but not your
first.

I'd strongly recommend you get a copy of Luke, it's invaluable for questions
like this
because it lets you look at what's actually in your index. It'll also show
you how
queries get broken down when pushed through various analyzers...

BTW, nice test case for demonstrating what you were seeing, it makes
answering
*vastly* easier.....

HTH
Erick

On Sun, May 31, 2009 at 5:55 AM, legrand thomas <thomaslegran...@yahoo.fr>wrote:

> Hi,
>
> I have a problem using TermQuery and FuzzyQuery for terms containing the
> character "-". Considering I've indexed "jack" and "jack-bauer" as 2
> tokenized captions, I get no result when searching for "jack-bauer".
> Moreover, "jack" with a TermQuery returns the two captions.
>
> What should I do to get "jack-bauer" with new TermQuery("jack-bauer") ?
>
> A full test case is given below.
>
> Thanks,
> Tom
>
>
> import junit.framework.Assert;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.store.FSDirectory;
> import org.junit.Test;
>
> public class IDebugIndexTest {
>
>     @Test
>     public void TermQueryTest() {
>
>         Analyzer analyser = new SimpleAnalyzer();
>
>         try {
>             // write docs to new index
>             IndexWriter writer = new IndexWriter(FSDirectory
>                     .getDirectory("/tmp/idx_test"), analyser, true);
>
>             Document jack = new Document();
>             jack.add(new Field("caption", "jack", Field.Store.YES,
>                     Field.Index.TOKENIZED));
>             writer.addDocument(jack);
>
>             Document jackBauer = new Document();
>             jackBauer.add(new Field("caption", "jack-bauer",
> Field.Store.YES,
>                     Field.Index.TOKENIZED));
>             writer.addDocument(jackBauer);
>
>             writer.close();
>
>             // try to search
>             IndexSearcher s = new
> IndexSearcher(IndexReader.open(FSDirectory
>                     .getDirectory("/tmp/idx_test")));
>
>             // The next assertion is ok
>             Hits jackHits = s
>             .search(new TermQuery(new Term("caption", "jack")));
>             Assert.assertEquals(jackHits.length(), 2);
>
>             // The next assertion fails !!!
>             Hits jackBauerHits = s.search(new TermQuery(new Term("caption",
>             "jack-bauer")));
>             Assert.assertEquals(jackBauerHits.length(), 1);
>
>         } catch (Exception e) {
>             Assert.fail();
>         }
>
>     }
> }
>
>
>
>
>
>

Re: Index and search terms containing character "-"

Reply via email to