Thanks Jack. I'll try this out. I'll have to see if that creates other side effects :-(. Tokenization is already causing a great deal of confusion. I want to make it as intuitive as possible.
On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky <j...@basetechnology.com> wrote: > Yes, the white space tokenizer will preserve all punctuation, but... then > the query for DevNm00* will fail. A "smarter" set of filters is probably > needed here... start with white space tokenization, keep that overall > token, then trim external punctuation and keep that token as well, and then > use word delimiter filter to split out the embedded words, like DevNm00, > and add them. > > The word delimiter filter will do most of that, but not the part of > trimming out external punctuation. But depending on your use case, it may > be close enough. > > See: > http://lucene.apache.org/core/4_9_0/analyzers-common/org/ > apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html > > -- Jack Krupansky > > -----Original Message----- From: Michael Sokolov > Sent: Wednesday, August 27, 2014 10:26 AM > To: java-user@lucene.apache.org > Subject: Re: Why does this search fail? > > > Tokenization is tricky. You might consider using whitespace tokenizer > followed by word delimiter filter (instead of standard tokenizer); it > does a kind of secondary tokenization pass that can preserve the > original token in addition to its component parts. There are some weird > side effects to do with term frequencies and phrase-like queries, but it > would make all these wildcard queries work I think. > > -Mike > > On 08/27/2014 09:54 AM, Milind wrote: > >> I see. This is going to be extremely difficult to explain to end users. >> It doesn't work as they would expect. Some of the tokenizing rules are >> already somewhat confusing. Their expectation is that it should work the >> way their searches work in Google. >> >> It's difficult enough to recognize that because the period is surrounded >> by >> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets >> tokenized. So I'd have expected that C0001.DevNm00* would effectively >> become a search for C0001 OR DevNm00*. But now, because of the presence >> of >> the wildcard, it's considered as 1 term and the period is not a tokenizer. >> That's actually good, but now the fact that it's still considered as 2 >> terms for wildcard searches makes it very unintuitive. I don't suppose >> that I can do anything about making wildcard search use multiple terms if >> joined together with a tokenizer. But is there any way that I can force >> it >> to go through an analyzer prior to doing the search? >> >> >> >> >> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <j...@basetechnology.com> >> wrote: >> >> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001" >>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't >>> match any term (at least in this case.) >>> >>> Also, if your query term includes a wildcard, it will not be fully >>> analyzed. Some filters such as lower case are defined as "multi-term", so >>> they will be performed, but the standard tokenizer is not being called, >>> so >>> the dot remains and this whole term is treated as one term, unlike the >>> index analysis. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Milind >>> Sent: Tuesday, August 26, 2014 12:24 PM >>> To: java-user@lucene.apache.org >>> Subject: Why does this search fail? >>> >>> >>> I have a field with the value C0001.DevNm001. If I search for >>> >>> C0001.DevNm001 --> Get Hit >>> DevNm00* --> Get Hit >>> C0001.DevNm00* --> Get No Hit >>> >>> The field gets tokenized on the period since it's surrounded by a letter >>> and and a number. The query gets evaluated as a prefix query. I'd have >>> thought that this should have found the document. Any clues on why this >>> doesn't work? >>> >>> The full code is below. >>> >>> Directory theDirectory = new RAMDirectory(); >>> Version theVersion = Version.LUCENE_47; >>> Analyzer theAnalyzer = new StandardAnalyzer(theVersion); >>> IndexWriterConfig theConfig = >>> new IndexWriterConfig(theVersion, >>> theAnalyzer); >>> IndexWriter theWriter = new IndexWriter(theDirectory, theConfig); >>> >>> String theFieldName = "Name"; >>> String theFieldValue = "C0001.DevNm001"; >>> Document theDocument = new Document(); >>> theDocument.add(new TextField(theFieldName, theFieldValue, >>> Field.Store.YES)); >>> theWriter.addDocument(theDocument); >>> theWriter.close(); >>> >>> String theQueryStr = theFieldName + ":C0001.DevNm00*"; >>> Query theQuery = >>> new QueryParser(theVersion, theFieldName, >>> theAnalyzer).parse(theQueryStr); >>> System.out.println(theQuery.getClass() + ", " + theQuery); >>> IndexReader theIndexReader = DirectoryReader.open(theDirectory); >>> IndexSearcher theSearcher = new IndexSearcher(theIndexReader); >>> TopScoreDocCollector collector = TopScoreDocCollector.create(10, >>> true); >>> theSearcher.search(theQuery, collector); >>> ScoreDoc[] theHits = collector.topDocs().scoreDocs; >>> System.out.println("Hits found: " + theHits.length); >>> >>> Output: >>> >>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00* >>> Hits found: 0 >>> >>> >>> -- >>> Regards >>> Milind >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards Milind