reset is optional. StandardAnalyzer does not implement it. Check out
CachingTokenFilter and wrap StandardAnalzyer in it.
Cool Coder wrote:
Currently I have extended StandardAnalyzer and counting tokens in the following
way. But the index is not getting created , though I call tokenStream.reset().
I am not sure whether reset() on token stream works or not??? I am debugging now
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
//To count tokens and put in a Map
analyzeTokens(result);
try {
result.reset();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;
}
public void analyzeTokens(TokenStream result)
{
try {
Token token = result.next();
while(token != null)
{
String tokenStr = token.termText();
if(TokenHolder.tokenMap.get(tokenStr) == null)
{
TokenHolder.tokenMap.put(tokenStr,1);
}
else
{
TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
}
token = result.next();
}
//exxtra reset
result.reset();
} catch (IOException e) {
e.printStackTrace();
}
}
Karl Wettin <[EMAIL PROTECTED]> wrote:
1 nov 2007 kl. 18.09 skrev Cool Coder:
prior to adding into index
Easiest way out would be to add the document to a temporary index and
extract the term frequency vector. I would recommend using MemoryIndex.
You could also tokenize the document and pass the data to a
TermVectorMapper. You could consider replacing the fields of the
document with CachedTokenStreams if you got the RAM to spare and
don't want to waste CPU analyzing the document twice. I welcome
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to
pass code down the IndexWriter.addDocument using a command pattern or
something, allowing one to extend the document at the time of the
analysis.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]