On 19/10/2011 15:17, Steven A Rowe wrote:
Hi Paul,
What version of Lucene are you using? The JFlex spec you quote below looks
pre-v3.1?
Yes, we copied a version of StandardTokenizer from 2.4 to make some
changes, we are actually on 3.1 now but haven't spent any time looking
at the new token
t;> "'java-
> u...@lucene.apache.org'"
> Subject: Re: How do you see if a tokenstream has tokens without consuming
> the tokens ?
>
> On 18/10/2011 05:19, Steven A Rowe wrote:
> > Hi Paul,
> >
> > You could add a rule to the StandardTokenizer JFlex
Hi Paul,
On 10/19/2011 at 5:26 AM, Paul Taylor wrote:
> On 18/10/2011 15:25, Steven A Rowe wrote:
> > On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
> > > On 18/10/2011 06:19, Steven A Rowe wrote:
> > > > Another option is to create a char filter that substitutes
> > > > PUNCT-EXCLAMATION for exclam
y when the
entire input consists exclusively of whitespace and punctuation. These symbols
would then be left intact by StandardTokenizer.
Steve
-Original Message-
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Monday, October 17, 2011 8:13 AM
To: 'java-user@lucene.a
On 18/10/2011 15:25, Steven A Rowe wrote:
Hi Paul,
On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
On 18/10/2011 06:19, Steven A Rowe wrote:
Another option is to create a char filter that substitutes
PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
etc.,
Yes that is how I firs
Hi Paul,
On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
> On 18/10/2011 06:19, Steven A Rowe wrote:
> > Another option is to create a char filter that substitutes
> > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
> > etc.,
>
> Yes that is how I first did it
No, I don't think
On 18/10/2011 06:19, Steven A Rowe wrote:On 18/10/2011 06:19, Steven A
Rowe wrote:
Hi Paul,
You could add a rule to the StandardTokenizer JFlex grammar to handle
this case, bypassing its other rules.
Hmm, dont really understand jflex, but that is a possibility, but would
prefer to do in Java c
exclusively of whitespace and punctuation. These symbols
would then be left intact by StandardTokenizer.
Steve
> -Original Message-
> From: Paul Taylor [mailto:paul_t...@fastmail.fm]
> Sent: Monday, October 17, 2011 8:13 AM
> To: 'java-user@lucene.apache.org'
> Sub
Hi Paul,
Since you have modified the StandardAnalyzer (I presume you mean
StandardFilter), why not do a check on the term.text() and if its all
punctuation, skip the analysis for that term? Something like this in
your StandardFilter:
public final boolean incrementToken() throws IOException {
Ch
We have a modified version of a Lucene StandardAnalyzer , we use it for
tokenizing music metadata such as as artist names & song titles, so
typically only a few words. On tokenizing it usually it strips out
punctuations which is correct, however if the input text consists of
only punctuation
10 matches
Mail list logo