Hi Rahil, Rahil wrote: > I couldnt figure out a valid regular expression to write a valid > Pattern.compile(String regex) which can tokenise a string into "O/E - > visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity", > "R", "-", "eye", "=", "6", "/", "24".
The following regular expression should match boundaries between word and non-word, or between space and non-space, in either order, and includes contiguous whitespace: \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s* Note that with the above regex, the "(%$#!)" in "some (%$#!) text" will be tokenized as a single token. Hope it helps, Steve > Erick Erickson wrote: > >> Well, I'm not the greatest expert, but a quick look doesn't show me >> anything >> obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you? >> Although I don't remember whether WhiteSpaceAnalyzer lowercases or not. >> >> It sure looks like you're getting reasonable results given how you're >> tokenizing. >> >> If not that, you might want to think about PatternAnalyzer. It's in the >> memory contribution section, see import >> org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the >> regex identifies what is NOT a token, rather than what is. This threw >> me for >> a bit. >> >> I still claim that you could break the tokens up like "6", "/", "12", and >> make SpanNearQuery work with a span of 0 (or 1, I don't remember right >> now), >> but that may well be more trouble than it's worth, it's up to you of >> course. >> What you get out of this is, essentially, is a query that's only >> satisfied >> if the terms you specify are right next to each other. So you'd find both >> your documents in your example, since you would have tokenized "6", "/", >> "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But >> since they're tokens that are next to each other in each doc, >> searching with >> a SpanNearQuery for "6", "/", and "12" that are "right next to each >> other", >> which you specify with a slop of 0 as I remember you should get both. >> >> Alternatively, if you tokenize it this way, a PhraseQuery might work as >> well, Thus, searching for "6 / 12" (as a phrase query and note the >> spaces) >> might be just what you want. You'd have to tokenize the query, but that's >> relatively easy. This is probably much simpler than a SpanNearQuery >> now that >> I think about it..... >> >> Be aware that if you use the *TermEnums we've been talking about, you'll >> probably wind up wrapping them in a ConstantScoreQuery. And if you >> have no >> *other* terms, you won't get any relevancy out of your search. This >> may be >> important..... >> >> Anyway, that's as creative as I can be Sunday night <G>. Best of luck.... >> >> Erick >> >> On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi Erick >>> >>> Thanks for your response. There's a lot to chew on in your reply and Im >>> looking at the suggestions you've made. >>> >>> Yeah I have Luke installed and have queried my index but there isn't any >>> great explanation Im getting out of it. A query for "6/12" is sent as >>> "TERM:6/12" which is quite straight-forward. I did an explanation of the >>> query in my code though and got some more information but that too >>> wasn't of much help either. >>> -- >>> Explanation explain = searcher.explain(query,0); >>> >>> OUTPUT: >>> query: +TERM:6/12 >>> explain.getDescription() : weight(TERM:6/12 in 0), product of: >>> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of: >>> 2.0986123 = idf(docFreq=1) >>> 0.47650534 = queryNorm >>> >>> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of: >>> 0.0 = tf(termFreq(TERM:6/12)=0) >>> 2.0986123 = idf(docFreq=1) >>> 0.5 = fieldNorm(field=TERM, doc=0) >>> >>> Number of results returned: 1 >>> SampleLucene.displayIndexResults >>> SCORE DESCRIPTIONSTATUS CONCEPTID TERM >>> 1.0 0 260278007 6/12 (finding) >>> -- >>> >>> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to >>> retain all non whitespace characters and not just letters and digits, I >>> introduced the following block of code in the overridden tokenStream( ) >>> >>> -- >>> public TokenStream tokenStream(String fieldName, Reader reader) { >>> >>> return new CharTokenizer(reader) { >>> >>> protected char normalize(char c) { >>> return Character.toLowerCase(c); >>> } >>> protected boolean isTokenChar(char c) { >>> boolean type = false; >>> boolean space = Character.isWhitespace(c); >>> boolean letDig = Character.isLetterOrDigit(c); >>> >>> if(letDig && !space) //letter or digit but not >>> whitespace >>> type = true; >>> else if(!letDig && !space) //not letter,digit >>> or whitespace (retain non-whitespace characters) >>> type = true; >>> else if( !letDig && space) //is not >>> a letter or digit but is a whitespace >>> type = false; >>> return type; >>> } >>> }; >>> } >>> >>> --- >>> The problem is that when the term "6/12 (finding)" is tokenised, two >>> tokens are generated viz. '6/12' and '(finding)'. Therefore when I >>> search for '6/12' this term is returned as in a way it is an EXACT token >>> match. >>> >>> However when the term "R-eye=6/12 (finding)" is tokenised it again >>> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I >>> look for '6/12' its no more an exact match since there is no token with >>> this EXACT value. A simple searcher.search(query) isnt useful to pull >>> out the partial token match. >>> >>> I think it wont be useful to create separate tokens for "6", "/", "12" >>> or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum >>> and WildcardTermEnum as they might possibily help. >>> >>> Would appreciate your comments on the BaseAnalyzer tokenizer and query >>> explanation Ive received so far. >>> >>> Thanks >>> Rahil >>> >>> Erick Erickson wrote: >>> >>> > Most often, from what I've seen on this e-mail list, unexpected >>> > results are >>> > because you're not indexing on the tokens you *think* you're indexing. >>> Or >>> > not searching on them. By that I mean that the analyzers you're using >>> are >>> > behaving in ways you don't expect. >>> > >>> > That said, I think you're getting exactly what you should. I suspect >>> > you're >>> > indexing tokens as follows >>> > doc1: "6/12" and "(finding)" >>> > doc2: "R-eye=6/12" and "(finding)" >>> > >>> > So it makes perfect sense that searching in 6/12 returns doc1 and >>> > search on >>> > R-eye=6/12 returns doc 2 >>> > >>> > So, first question: Have you actually used something like Luke (google >>> > luke >>> > lucene) to examine your index and see if what you've put in there is >>> what >>> > you expect? What analyzer is your custom analyzer built upon and is it >>> > doing >>> > anything you're unaware of (for instance, lower-casing the 'R' in your >>> > second example)? >>> > >>> > Here's what I'd do. >>> > 1> get Luke and see what's actually in your index. >>> > 2> use searcher.explain (?) to see the query you're actually emitting. >>> > 3> if you make no headway, post the smallest code snippets you can >>> that >>> > illustrate the problem. Folks would need the indexing AND searching >>> code. >>> > >>> > As far as queryies like "contains" in java.... Well sure. Write a >>> filter >>> > that filters on regular expressions or wildcards (you'll need >>> > WildcardTermEnum and RegexTermEnum). Or index things differently (e.g. >>> > index >>> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on >>> > doc 2. >>> > Now your searches for "6/12" will work. Or index "6" "/", "12" and >>> > "finding" >>> > on doc1, index similarly for doc2, and use a SpanNearQuery with an >>> > appropriate span value. Or.... >>> > >>> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In >>> > Action", >>> > which you should read in order to get the most out of Lucene. It's for >>> > the >>> > 1.4 code base, but the 2.0 Lucene code base isn't that much different. >>> > More >>> > importantly, it ties lots of stuff together. Also, the junit tests >>> > that come >>> > along with the Lucene code can be invaluable to show you how to do >>> > something. >>> > >>> > Hope this helps >>> > Erick >>> > >>> > On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote: >>> > >>> >> >>> >> Hi >>> >> >>> >> I have a custom-built Analyzer where I tokenize all non-whitespace >>> >> characters as well available in the field "TERM" (which is the only >>> >> field being tokenised). >>> >> >>> >> If I now query my index file for a term "6/12" for instance, I get >>> back >>> >> only ONE result >>> >> >>> >> SCORE DESCRIPTIONSTATUS CONCEPTID TERM >>> >> 1.0 0 260278007 6/12 (finding) >>> >> >>> >> instead of TWO. There is another token in the index file of the form >>> >> >>> >> 2561280012 0 163939000 R-eye=6/12 (finding) 0 3 en >>> >> >>> >> At first it wasn't quite obvious to me why this was happening. But >>> after >>> >> playing around a bit I realised that if I pass a query "R-eye=6/12" >>> >> instead, I will get this second result (but not the first one >>> now). Is >>> >> there a way to tweak the Query query = parser.parse(searchString) >>> >> method so that I can get both the records if I query for "6/12". >>> >> Something like a 'contains' query in Java. >>> >> >>> >> Will appreciate all help. Thanks a lot >>> >> >>> >> Regards >>> >> Rahil --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]