Hello,
 
I want to index a 1GB file that contains a list of lines of
approximately 100 characters each, so that i can later get lines
containing some particular text. The natural way of doing it with lucene
would be to create 1 lucene Document per line. It works well except it
is too slow for my needs, even after tweaking all possible parameters of
IndexWriter and using cvs version of lucene. 
 
I can get 10x the indexing performance by indexing the file as 1 lucene
Document. Lucene builds a good index with all the terms and I am able to
get the number of terms matching a query but not the absolute position
in the original file (I only get the token relative position). A minor
quirk with this approach is that i need to split the document in order
to avoid outofmemory exception when the document is too big. It would be
probably possible for me to customize lucene for my needs (create a more
flexible Term class), that's just a hack. But I was wondering why there
should be such a performance difference.
 
I see that for each document plenty of work is done, but that seems
necessary, and then there is even more work while merging segments.
Things could probably be faster if documents were first aggregated and
then work done on them. But I think this would imply huge changes in
Lucene. Any advice for indexing millions of tiny docs?
 
 
 
Regards,
 
Fabien.

Reply via email to