Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
Because I wanted to use the javaCC input code from Lucene. 99.99% of what the standard parser did was VERY GOOD. having worked with computer-generated compilers in the past, I realized that if I were to modify the parser itself, I would eventually get into real trouble. So I took the time to

Re: Installing a custom tokenizer

2006-08-29 Thread yueyu lin
Your problem is that StandardTokenizer doesn's fit your requirements. Since you know how to implement a new one, just do it. If you just want to modify StandardTokenizer, you can get the codes and rename it to your class, then modify something that you dislike. I think it's a so simple stuff, why

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote: 2. The ParseException that is generated when making the StandardAnalyzer must be killed because there is another ParseException class (maybe in queryparser?) that must be used instead. The lucene build file excludes the StandardAnalyzer Parse

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(reader)); fails bec

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
Tucked away in the contrib section of Lucene (I'm using 2.0) there is org.apache.lucene.index.memory.PatternAnalyzer which takes a regular expression as and tokenizes with it. Would that help? Word of warning... the regex determines what is NOT a token, not what IS a token (as I remember),

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you ri

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote: : Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. As I understand

Re: Installing a custom tokenizer

2006-08-29 Thread Chris Hostetter
: Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. -Hoss --

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. that almos

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. Same kind of thing for a Query. Erick On 8/29/06, Bill Taylor <[EM

Re: Installing a custom tokenizer

2006-08-29 Thread Ronnie Kolehmainen
ss this tokenstream through other filters you are > > interested in */ > > } > > } > > > > Krovi. > > > > -----Original Message- > > From: Bill Taylor [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 29, 2006 8:10 PM > > To:

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
interested in */ } } Krovi. -Original Message- From: Bill Taylor [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 29, 2006 8:10 PM To: java-user@lucene.apache.org Subject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As one would expect

RE: Installing a custom tokenizer

2006-08-29 Thread Krovi, DVSR_Sarma
ubject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer break

Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer breaks this "word" at the dashes so