I'm seeing performance problems when indexing a certain set of data, and I'm looking for pointers on how to improve the situation. I've read the very helpful performance advice on the Wiki and I am carrying on doing experiment based on that, but I'd also ask for comments as to whether I'm heading in the right direction.
Basically, I'm indexing a collection of mostly very small documents, around 500 million of them. I'm doing this indexing from scratch, starting with an empty index. The resulting size of the index on disk is around 50 GB after indexing. I'm doing the indexing using a number of concurrent indexing threads, and using a single Lucene index. I'm on Lucene 3.6.1 currently, running on Linux. I'm looking at this process in a profiler, and what I'm seeing is that after a while, the indexing process ends up spending a lot of time in merge threads called "Lucene Merge Thread #NNN". Such merges seem to take around 50% of the overall time, during which all the indexing threads are locked out. Having run for less than an hour, I'm seeing merge threads numbered up to 270, so there have been frequent as well as long-running merges. Even when no merge is happening, there is a lot of contention between the indexing worker threads (there are around 12 of these). My questions are: - Is what I'm trying to do reasonable, i.e. the number of documents/overall size/single index? - What can I do to reduce the amount of time spent merging segments? - What can I do to improve concurrency of indexing? Any suggestions would be highly appreciated. Regards, Jan