[
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546050
]
Michael Busch commented on LUCENE-1058:
---------------------------------------
We need to change the CachingTokenFilter a bit (untested code):
{code:java}
public class CachingTokenFilter extends TokenFilter {
private List cache;
private Iterator iterator;
public CachingTokenFilter(TokenStream input) {
super(input);
this.cache = new LinkedList();
}
public Token next() throws IOException {
if (iterator != null) {
if (!iterator.hasNext()) {
// the cache is exhausted, return null
return null;
}
return (Token) iterator.next();
} else {
Token token = input.next();
addTokenToCache(token);
return token;
}
}
public void reset() throws IOException {
if(cache != null) {
iterator = cache.iterator();
}
}
protected void addTokenToCache(Token token) {
if (token != null) {
cache.add(token);
}
}
}
{code}
Then you can implement the ProperNounTF:
{code:java}
class ProperNounTF extends CachingTokenFilter {
protected void addTokenToCache(Token token) {
if (token != null && isProperNoun(token)) {
cache.add(token);
}
}
private boolean isProperNoun() {...}
}
{code}
And then you add everything to Document:
{code:java}
Document d = new Document();
TokenStream properNounTf = new ProperNounTF(new StandardTokenizer(reader));
TokenStream stdTf = new CachingTokenFilter(new StopTokenFilter(properNounTf));
TokenStrean lowerCaseTf = new LowerCaseTokenFilter(stdTf);
d.add(new Field("std", stdTf));
d.add(new Field("nouns", properNounTf));
d.add(new Field("lowerCase", lowerCaseTf));
{code}
Again, this is untested, but I believe should work?
> New Analyzer for buffering tokens
> ---------------------------------
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch,
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that
> could siphon off certain tokens and store them in a buffer to be used later
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but
> all the other analysis is the same, then you could save off the tokens to be
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how
> it plays with the new reuse API.
> See
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]