Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Koji Sekiguchi
Hi Jack, Thanks! I'll look at it. Koji On 2015/02/24 22:29, Jack Krupansky wrote: This is the first mention that I have seen for that corpus on this list. There seem to be more than a few references when I google for ""brown corpus" lucene", such as: https://github.com/INL/BlackLab/wiki/Black

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Jack Krupansky
This is the first mention that I have seen for that corpus on this list. There seem to be more than a few references when I google for ""brown corpus" lucene", such as: https://github.com/INL/BlackLab/wiki/Blacklab-query-tool -- Jack Krupansky On Tue, Feb 24, 2015 at 1:40 AM, Koji Sekiguchi wro

Re: tokenizer to strip a set of characters

2013-11-21 Thread Jack Krupansky
The word delimiter filter has the ability to pass a table which specifies the type for a character: http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analy

RE: Tokenizer question: how can I force ? and ! to be separate tokens?

2009-07-17 Thread OBender
Thanks, I think I got it. -Original Message- From: John Byrne [mailto:john.by...@propylon.com] Sent: Friday, July 17, 2009 2:43 PM To: java-user@lucene.apache.org Subject: Re: Tokenizer queston: how can I force ? and ! to be separate tokens? Yes, you could even use the

Re: Tokenizer queston: how can I force ? and ! to be separate tokens?

2009-07-17 Thread John Byrne
Yes, you could even use the WhitespaceTokenizer and then look for the symbols in a token filter. You would get [you?] as a single token; your job in the token filter is then to store the [?] and return the [you]. The next time the token filter is called for the next token, you return the [?] th

Re: Tokenizer queston: how can I force ? and ! to be separate tokens?

2009-07-17 Thread Matthew Hall
I'd think extending WhiteSpaceTokenizer would be a good place to start. Then create a new Analyzer that exactly mirrors your current Analyzer, with the exception that it uses your new tokenizer instead of WhiteSpaceTokenizer (Well.. there is of course my assumption that you are using an Analyz

RE: Tokenizer

2007-07-30 Thread Ard Schrijvers
Hello, > I have two questions. > > First, Is there a tokenizer that takes every word and simply > makes a token > out of it? org.apache.lucene.analysis.WhitespaceTokenizer > So it looks for two white spaces and takes the characters > between them and makes a token out of them? > > If this to