Re: Potential bug in StandardTokenizerImpl

Eugenio Martinez Tue, 27 Nov 2007 05:11:51 -0800

 I am the guy who throw the question about the Acronym - Host detection anomaly 
in the StandardAnalyzer class.


Thanks to Shai Erera for traslating the discussion into the developers' list. I 
am surprised about Chris Hostetter's response, as this issue was treated by 
Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now, 
Lucene in Action, trying to override this issue, but i can't believe that this 
wasn't fixed yet.

As i explained at the user's list, i've found that indexing fails to include 
certain emails and words that are present in the logfile when i launch an 
IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I 
got the acronyms' interpretation issue. Maybe there will be more hidden 
anomalies in the StandardAnalyzer behavior with such a hughe load.

At this moment I can say this behavior is deterministic, so I can reproduce it 
over subsequent index and search calls, and takes place with the same words and 
emails over and over. Should it be a collateral efect of document vectorization 
as the logs are not natural language? As Lucene computes if the token conveys 
relevant info (as the vector space model states), what about that Lucene 
decided about the token not to be relevant? All of this supossing it works 
well, of course...

Any idea about this, or have you heard about?

Thanks and regards.

Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223

VIDEOCONFERENCIA: 981 173 596 

[EMAIL PROTECTED]






       
______________________________________________ 
¿Chef por primera vez?
Sé un mejor Cocinillas. 
http://es.answers.yahoo.com/info/welcome

Re: Potential bug in StandardTokenizerImpl

Reply via email to