I am indexing with Lucene a hughe set of logfiles, about 130GB of plain text in
disk (up to now), planning to build a system capable of perform searches over
Terabytes of such info in a kind of metaindex built from a mesh of little ones,
all of them created and maintained with Lucene.
I have randomly variable file sizes, from 1KB to several hundreds of MB of
plain text, and I have done tests with files about 2GB, obtaning very good
performance in time and search. Of course, once we can get search results from
such system we get confident that Lucene was capable of doing right its job,
i.e., split all contents and index all tokens correctly.
But last week, with our first beta release in our LAN environment, some
problems arose. In certain situations we've found that the Analysis stage
"fails", or better, has anomalies in its activity. We have isolated one, that
can be reproduced with LUKE in its Search window: parsing URL domains that end
with a point, as in "www.my.domain.es." becomes in a token with the following
text: "wwwmydomaines".
Maybe this behavior extends to emails, as we aren´t able to get search results
with some emails that are indeed in the contents of the logfile, and with words
too.
Such behavior is not acceptable for nobody, as in natural speaking is possible
to find such URLs at the end of a sentence. Is this an effect of document
vectorization? I write this as log's content structure doesn't match for
natural language rules...
Any notice about this?
We are working on an Log Analyzer now, but i'm sure i'm not the only fellow
with this issue in the world... Did you know anyone else?
Thanks for your attention.
Eugenio F. Martínez Pacheco
Fundación Instituto Tecnológico de Galicia - Área TIC
TFN: 981 173 206 FAX: 981 173 223
VIDEOCONFERENCIA: 981 173 596
[EMAIL PROTECTED]
______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome