I am the guy who throw the question about the Acronym - Host detection anomaly in the StandardAnalyzer class.
Thanks to Shai Erera for traslating the discussion into the developers' list. I am surprised about Chris Hostetter's response, as this issue was treated by Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now, Lucene in Action, trying to override this issue, but i can't believe that this wasn't fixed yet. As i explained at the user's list, i've found that indexing fails to include certain emails and words that are present in the logfile when i launch an IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I got the acronyms' interpretation issue. Maybe there will be more hidden anomalies in the StandardAnalyzer behavior with such a hughe load. At this moment I can say this behavior is deterministic, so I can reproduce it over subsequent index and search calls, and takes place with the same words and emails over and over. Should it be a collateral efect of document vectorization as the logs are not natural language? As Lucene computes if the token conveys relevant info (as the vector space model states), what about that Lucene decided about the token not to be relevant? All of this supossing it works well, of course... Any idea about this, or have you heard about? Thanks and regards. Eugenio F. Martínez Pacheco Fundación Instituto Tecnológico de Galicia - Área TIC TFN: 981 173 206 FAX: 981 173 223 VIDEOCONFERENCIA: 981 173 596 [EMAIL PROTECTED] ______________________________________________ ¿Chef por primera vez? Sé un mejor Cocinillas. http://es.answers.yahoo.com/info/welcome