Concurrent indexing performance problem

Jan Stette Thu, 07 Mar 2013 08:45:45 -0800

I'm seeing performance problems when indexing a certain set of data, and
I'm looking for pointers on how to improve the situation. I've read the
very helpful performance advice on the Wiki and I am carrying on doing
experiment based on that, but I'd also ask for comments as to whether I'm
heading in the right direction.


Basically, I'm indexing a collection of mostly very small documents, around
500 million of them. I'm doing this indexing from scratch, starting with an
empty index. The resulting size of the index on disk is around 50 GB after
indexing. I'm doing the indexing using a number of concurrent indexing
threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently,
running on Linux.

I'm looking at this process in a profiler, and what I'm seeing is that
after a while, the indexing process ends up spending a lot of time in merge
threads called "Lucene Merge Thread #NNN". Such merges seem to take around
50% of the overall time, during which all the indexing threads are locked
out. Having run for less than an hour, I'm seeing merge threads numbered up
to 270, so there have been frequent as well as long-running merges.

Even when no merge is happening, there is a lot of contention between the
indexing worker threads (there are around 12 of these).

My questions are:

- Is what I'm trying to do reasonable, i.e. the number of documents/overall
size/single index?
- What can I do to reduce the amount of time spent merging segments?
- What can I do to improve concurrency of indexing?

Any suggestions would be highly appreciated.

Regards,
Jan

Concurrent indexing performance problem

Reply via email to