Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call tokenStream.reset(). I am not sure whether reset() on token stream works or not??? I am debugging now public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader)); //To count tokens and put in a Map analyzeTokens(result); try { result.reset(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } return result; } public void analyzeTokens(TokenStream result) { try { Token token = result.next(); while(token != null) { String tokenStr = token.termText(); if(TokenHolder.tokenMap.get(tokenStr) == null) { TokenHolder.tokenMap.put(tokenStr,1); } else { TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1); } token = result.next(); } //exxtra reset result.reset(); } catch (IOException e) { e.printStackTrace(); } }
Karl Wettin <[EMAIL PROTECTED]> wrote: 1 nov 2007 kl. 18.09 skrev Cool Coder: > prior to adding into index Easiest way out would be to add the document to a temporary index and extract the term frequency vector. I would recommend using MemoryIndex. You could also tokenize the document and pass the data to a TermVectorMapper. You could consider replacing the fields of the document with CachedTokenStreams if you got the RAM to spare and don't want to waste CPU analyzing the document twice. I welcome TermVectorMappingChachedTokenStreamFactory. Even cooler would be to pass code down the IndexWriter.addDocument using a command pattern or something, allowing one to extend the document at the time of the analysis. -- karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com