Hi Paul, What version of Lucene are you using? The JFlex spec you quote below looks pre-v3.1?
Steve > -----Original Message----- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Wednesday, October 19, 2011 6:50 AM > To: Steven A Rowe; java-user@lucene.apache.org >> "'java- > u...@lucene.apache.org'" > Subject: Re: How do you see if a tokenstream has tokens without consuming > the tokens ? > > On 18/10/2011 05:19, Steven A Rowe wrote: > > Hi Paul, > > > > You could add a rule to the StandardTokenizer JFlex grammar to handle > this case, bypassing its other rules. > THis seemed to be working, just to test it out I changed the EMAIL one > to this > > EMAIL = ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+ > > And changed the order the tokens were checked > > %% > > {ALPHANUM} { return > ALPHANUM; } > {APOSTROPHE} { return > APOSTROPHE; } > {ACRONYM} { return > ACRONYM; } > {COMPANY} { return > COMPANY; } > {HOST} { return > HOST; } > {NUM} { return > NUM; } > {CJ} { return > CJ; } > {ACRONYM_DEP} { return > ACRONYM_DEP; } > {EMAIL} { return > EMAIL; } > > /** Ignore the rest */ > . | {WHITESPACE} { /* > ignore */ } > > > So then if I passed "!!!' to the tokenizer, it kept it which was exactly > what I wanted > > However if I passed it 'fred!!!' it split it into two tokens > > 'fred' and '!!!' > > which is not what I wanted, I just wanted to get back > > fred > > > I tried chnaging EMAIL to > > EMAIL = ^("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+ > > but use of ^ and $ seem to be disallowed, so I cant see if there is > anyway to do what I want in the jflex, if thats the case can I drop the > 2nd filter somehow in a subsequent filter ? > > > Paul > > > > > > > > > > Another option is to create a char filter that substitutes PUNCT- > EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but > only when the entire input consists exclusively of whitespace and > punctuation. These symbols would then be left intact by > StandardTokenizer. > > > > Steve > > > >> -----Original Message----- > >> From: Paul Taylor [mailto:paul_t...@fastmail.fm] > >> Sent: Monday, October 17, 2011 8:13 AM > >> To: 'java-user@lucene.apache.org' > >> Subject: How do you see if a tokenstream has tokens without consuming > the > >> tokens ? > >> > >> > >> We have a modified version of a Lucene StandardAnalyzer , we use it > for > >> tokenizing music metadata such as as artist names& song titles, so > >> typically only a few words. On tokenizing it usually it strips out > >> punctuations which is correct, however if the input text consists of > >> only punctuation characters then we end up with nothing, for these > >> particular RARE cases I want to use a mapping filter. > >> > >> So what I try to do is have my analyzer tokenize as normal, then if > the > >> results is no tokens retokenize with the mapping filter , I check it > has > >> no token using incrementToken() but then cant see how I > >> decrementToken(). How can I do this, or is there a more efficient way > of > >> doing this. Note of maybe 10,000,000 records only a few 100 records > will > >> have this problem so I need a solution which doesn't impact > performance > >> unreasonably. > >> > >> NormalizeCharMap specialcharConvertMap = new NormalizeCharMap(); > >> specialcharConvertMap.add("!", "Exclamation"); > >> specialcharConvertMap.add("?","QuestionMark"); > >> ............... > >> > >> public TokenStream tokenStream(String fieldName, Reader reader) > { > >> CharFilter specialCharFilter = new > >> MappingCharFilter(specialcharConvertMap,reader); > >> > >> StandardTokenizer tokenStream = new > >> StandardTokenizer(LuceneVersion.LUCENE_VERSION); > >> try > >> { > >> if(tokenStream.incrementToken()==false) > >> { > >> tokenStream = new > >> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter); > >> } > >> else > >> { > >> //TODO **************** set tokenstream back as it > was > >> before increment token > >> } > >> } > >> catch(IOException ioe) > >> { > >> > >> } > >> TokenStream result = new LowercaseFilter(result); > >> return result; > >> } > >> > >> thanks for any help > >> > >> > >> Paul > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org