Re: Best way to count tokens

Mark Miller Thu, 01 Nov 2007 16:04:56 -0800

reset is optional. StandardAnalyzer does not implement it. Check outCachingTokenFilter and wrap StandardAnalzyer in it.


Cool Coder wrote:

Currently I have extended StandardAnalyzer and counting tokens in the following 
way. But the index is not getting created , though I call tokenStream.reset(). 
I am not sure whether reset() on token stream works or not??? I am debugging now
public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
  //To count tokens and put in a Map
   analyzeTokens(result);
  try {
  result.reset();
  } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
  }
  return result;
  }
public void analyzeTokens(TokenStream result)
  {
  try {
  Token token = result.next();
  while(token != null)
  {
  String tokenStr = token.termText();
  if(TokenHolder.tokenMap.get(tokenStr) == null)
  {
  TokenHolder.tokenMap.put(tokenStr,1);
  }
  else
  {
  
TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
  }
  token = result.next();
}//exxtra resetresult.reset();
  } catch (IOException e) {
  e.printStackTrace();
  }
  }
Karl Wettin <[EMAIL PROTECTED]> wrote:
1 nov 2007 kl. 18.09 skrev Cool Coder:
prior to adding into index
Easiest way out would be to add the document to a temporary index andextract the term frequency vector. I would recommend using MemoryIndex.
You could also tokenize the document and pass the data to aTermVectorMapper. You could consider replacing the fields of thedocument with CachedTokenStreams if you got the RAM to spare anddon't want to waste CPU analyzing the document twice. I welcomeTermVectorMappingChachedTokenStreamFactory. Even cooler would be topass code down the IndexWriter.addDocument using a command pattern orsomething, allowing one to extend the document at the time of theanalysis.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best way to count tokens

Reply via email to