Because I wanted to use the javaCC input code from Lucene. 99.99% of
what the standard parser did was VERY GOOD. having worked with
computer-generated compilers in the past, I realized that if I were to
modify the parser itself, I would eventually get into real trouble. So
I took the time to
Your problem is that StandardTokenizer doesn's fit your requirements.
Since you know how to implement a new one, just do it.
If you just want to modify StandardTokenizer, you can get the codes and
rename it to your class, then modify something that you dislike. I think
it's a so simple stuff, why
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote:
2. The ParseException that is generated when making the
StandardAnalyzer must be killed because there is another
ParseException class (maybe in queryparser?) that must be used
instead. The lucene build file excludes the StandardAnalyzer
Parse
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name,
LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(reader)); fails bec
Tucked away in the contrib section of Lucene (I'm using 2.0) there is
org.apache.lucene.index.memory.PatternAnalyzer
which takes a regular expression as and tokenizes with it. Would that help?
Word of warning... the regex determines what is NOT a token, not what IS a
token (as I remember),
Bill Taylor wrote:
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
ri
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
As I understand
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
-Hoss
--
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
right up.
that almos
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.
Same kind of thing for a Query.
Erick
On 8/29/06, Bill Taylor <[EM
ss this tokenstream through other filters you are
> > interested in */
> > }
> > }
> >
> > Krovi.
> >
> > -----Original Message-
> > From: Bill Taylor [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 29, 2006 8:10 PM
> > To:
interested in */
}
}
Krovi.
-Original Message-
From: Bill Taylor [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 29, 2006 8:10 PM
To: java-user@lucene.apache.org
Subject: Installing a custom tokenizer
I am indexing documents which are filled with government jargon. As
one would expect
ubject: Installing a custom tokenizer
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer break
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so
16 matches
Mail list logo