Search is pretty fast, and read only, so for my case i just created three indexes and saved every three lucene documents into one of each index. then upon a search i merge the results from the three smaller indexes. Only thing to consider is to store all parts of a source document into the same index so that booleans still work. I have even threaded out the searching so search on the three indexes are performed in parallel.
By the way; Stop word filters can also do wonders for a index full of text too...
Mvh Karl Øie
On 1. apr. 2005, at 11.43, Fabien Le Floc'h wrote:
Hello,
I want to index a 1GB file that contains a list of lines of
approximately 100 characters each, so that i can later get lines
containing some particular text. The natural way of doing it with lucene
would be to create 1 lucene Document per line. It works well except it
is too slow for my needs, even after tweaking all possible parameters of
IndexWriter and using cvs version of lucene.
I can get 10x the indexing performance by indexing the file as 1 lucene
Document. Lucene builds a good index with all the terms and I am able to
get the number of terms matching a query but not the absolute position
in the original file (I only get the token relative position). A minor
quirk with this approach is that i need to split the document in order
to avoid outofmemory exception when the document is too big. It would be
probably possible for me to customize lucene for my needs (create a more
flexible Term class), that's just a hack. But I was wondering why there
should be such a performance difference.
I see that for each document plenty of work is done, but that seems necessary, and then there is even more work while merging segments. Things could probably be faster if documents were first aggregated and then work done on them. But I think this would imply huge changes in Lucene. Any advice for indexing millions of tiny docs?
Regards,
Fabien.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]