Re: Index and search terms containing character "-"

Erick Erickson Wed, 03 Jun 2009 06:05:11 -0700

Just be aware that KeywordAnalyzer won't tokenize at all. That is,if you
expect to index "jack-bauer" and hit on "jack" or "bauer" it won't.


Best
Erick

On Wed, Jun 3, 2009 at 2:25 AM, legrand thomas <thomaslegran...@yahoo.fr>wrote:

> Hi,
>
> A KeywordAnalyzer solved my problem.
> Luke allowed me to understand the queries and the content of the index.
>
> Thanks  (Erick & Balasubramanian Sudaakeran)
> Tom
>
> --- En date de : Dim 31.5.09, Erick Erickson <erickerick...@gmail.com> a
> écrit :
>
> De: Erick Erickson <erickerick...@gmail.com>
> Objet: Re: Index and search terms containing character "-"
> À: java-user@lucene.apache.org
> Date: Dimanche 31 Mai 2009, 15h33
>
> Simple analyzer does two things: splits tokens on non-letter characters and
> lowercases them.
>
> So, in your test you've indexed the tokens "jack" and "bauer" in your
> second
> document, the hyphen is completely lost during tokenization and you have
> two tokens for that document.
>
> Using the term query "jack-bauer" is looking for a *single* token that is
> exactly "jack-bauer", which is nowhere in your index. Term queries assume
> that you know exactly what result you want, they don't try to transform the
> input at all. Had you run "jack-bauer" through the query parser with
> SimpleAnalyzer, you'd be searching for a document with the two terms
> "jack" and "bauer", and you'd have hit your second document but not your
> first.
>
> I'd strongly recommend you get a copy of Luke, it's invaluable for
> questions
> like this
> because it lets you look at what's actually in your index. It'll also show
> you how
> queries get broken down when pushed through various analyzers...
>
> BTW, nice test case for demonstrating what you were seeing, it makes
> answering
> *vastly* easier.....
>
> HTH
> Erick
>
> On Sun, May 31, 2009 at 5:55 AM, legrand thomas <thomaslegran...@yahoo.fr
> >wrote:
>
> > Hi,
> >
> > I have a problem using TermQuery and FuzzyQuery for terms containing the
> > character "-". Considering I've indexed "jack" and "jack-bauer" as 2
> > tokenized captions, I get no result when searching for "jack-bauer".
> > Moreover, "jack" with a TermQuery returns the two captions.
> >
> > What should I do to get "jack-bauer" with new TermQuery("jack-bauer") ?
> >
> > A full test case is given below.
> >
> > Thanks,
> > Tom
> >
> >
> > import junit.framework.Assert;
> >
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.SimpleAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.index.IndexReader;
> > import org.apache.lucene.index.IndexWriter;
> > import org.apache.lucene.index.Term;
> > import org.apache.lucene.search.Hits;
> > import org.apache.lucene.search.IndexSearcher;
> > import org.apache.lucene.search.TermQuery;
> > import org.apache.lucene.store.FSDirectory;
> > import org.junit.Test;
> >
> > public class IDebugIndexTest {
> >
> >     @Test
> >     public void TermQueryTest() {
> >
> >         Analyzer analyser = new SimpleAnalyzer();
> >
> >         try {
> >             // write docs to new index
> >             IndexWriter writer = new IndexWriter(FSDirectory
> >                     .getDirectory("/tmp/idx_test"), analyser, true);
> >
> >             Document jack = new Document();
> >             jack.add(new Field("caption", "jack", Field.Store.YES,
> >                     Field.Index.TOKENIZED));
> >             writer.addDocument(jack);
> >
> >             Document jackBauer = new Document();
> >             jackBauer.add(new Field("caption", "jack-bauer",
> > Field.Store.YES,
> >                     Field.Index.TOKENIZED));
> >             writer.addDocument(jackBauer);
> >
> >             writer.close();
> >
> >             // try to search
> >             IndexSearcher s = new
> > IndexSearcher(IndexReader.open(FSDirectory
> >                     .getDirectory("/tmp/idx_test")));
> >
> >             // The next assertion is ok
> >             Hits jackHits = s
> >             .search(new TermQuery(new Term("caption", "jack")));
> >             Assert.assertEquals(jackHits.length(), 2);
> >
> >             // The next assertion fails !!!
> >             Hits jackBauerHits = s.search(new TermQuery(new
> Term("caption",
> >             "jack-bauer")));
> >             Assert.assertEquals(jackBauerHits.length(), 1);
> >
> >         } catch (Exception e) {
> >             Assert.fail();
> >         }
> >
> >     }
> > }
> >
> >
> >
> >
> >
> >
>
>
>
>
>

Re: Index and search terms containing character "-"

Reply via email to