I am the guy who throw the question about the Acronym - Host detection anomaly
in the StandardAnalyzer class.
Thanks to Shai Erera for traslating the discussion into the developers' list. I
am surprised about Chris Hostetter's response, as this issue was treated by
Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now,
Lucene in Action, trying to override this issue, but i can't believe that this
wasn't fixed yet.
As i explained at the user's list, i've found that indexing fails to include
certain emails and words that are present in the logfile when i launch an
IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I
got the acronyms' interpretation issue. Maybe there will be more hidden
anomalies in the StandardAnalyzer behavior with such a hughe load.
At this moment I can say this behavior is deterministic, so I can reproduce it
over subsequent index and search calls, and takes place with the same words and
emails over and over. Should it be a collateral efect of document vectorization
as the logs are not natural language? As Lucene computes if the token conveys
relevant info (as the vector space model states), what about that Lucene
decided about the token not to be relevant? All of this supossing it works
well, of course...
Any idea about this, or have you heard about?
Thanks and regards.
Eugenio F. Martínez Pacheco
Fundación Instituto Tecnológico de Galicia - Área TIC
TFN: 981 173 206 FAX: 981 173 223
VIDEOCONFERENCIA: 981 173 596
[EMAIL PROTECTED]
______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome