Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-20 Thread Paul Taylor
On 19/10/2011 15:17, Steven A Rowe wrote: Hi Paul, What version of Lucene are you using? The JFlex spec you quote below looks pre-v3.1? Yes, we copied a version of StandardTokenizer from 2.4 to make some changes, we are actually on 3.1 now but haven't spent any time looking at the new token

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
t;> "'java- > u...@lucene.apache.org'" > Subject: Re: How do you see if a tokenstream has tokens without consuming > the tokens ? > > On 18/10/2011 05:19, Steven A Rowe wrote: > > Hi Paul, > > > > You could add a rule to the StandardTokenizer JFlex

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, On 10/19/2011 at 5:26 AM, Paul Taylor wrote: > On 18/10/2011 15:25, Steven A Rowe wrote: > > On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > > > On 18/10/2011 06:19, Steven A Rowe wrote: > > > > Another option is to create a char filter that substitutes > > > > PUNCT-EXCLAMATION for exclam

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Paul Taylor
y when the entire input consists exclusively of whitespace and punctuation. These symbols would then be left intact by StandardTokenizer. Steve -Original Message- From: Paul Taylor [mailto:paul_t...@fastmail.fm] Sent: Monday, October 17, 2011 8:13 AM To: 'java-user@lucene.a

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Paul Taylor
On 18/10/2011 15:25, Steven A Rowe wrote: Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: On 18/10/2011 06:19, Steven A Rowe wrote: Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., Yes that is how I firs

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Steven A Rowe
Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > On 18/10/2011 06:19, Steven A Rowe wrote: > > Another option is to create a char filter that substitutes > > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, > > etc., > > Yes that is how I first did it No, I don't think

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Paul Taylor
On 18/10/2011 06:19, Steven A Rowe wrote:On 18/10/2011 06:19, Steven A Rowe wrote: Hi Paul, You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing its other rules. Hmm, dont really understand jflex, but that is a possibility, but would prefer to do in Java c

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Steven A Rowe
exclusively of whitespace and punctuation. These symbols would then be left intact by StandardTokenizer. Steve > -Original Message- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Monday, October 17, 2011 8:13 AM > To: 'java-user@lucene.apache.org' > Sub

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Sujit Pal
Hi Paul, Since you have modified the StandardAnalyzer (I presume you mean StandardFilter), why not do a check on the term.text() and if its all punctuation, skip the analysis for that term? Something like this in your StandardFilter: public final boolean incrementToken() throws IOException { Ch

How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Paul Taylor
We have a modified version of a Lucene StandardAnalyzer , we use it for tokenizing music metadata such as as artist names & song titles, so typically only a few words. On tokenizing it usually it strips out punctuations which is correct, however if the input text consists of only punctuation