Java Programmer wrote:

Hi,
Maybe this question is trivial but I need to ask it. I've some problem with
indexing large number of documents, and I seek for better solution.
Task is to index about 33GB text data CSV (each record about 30kB), it
possible of course to index these data but I'm not very happy with timings
(about 26 hours), so I want to know how can i speed up this process. First I
think about splitting CVS file into smaller ones, eg 5GB and index them on 6
indexing computers, but now is my question - can I join such parts into one
index after indexing jobs on each computer is finished? I saw example wit
RAMDirectory which could be merged with
FSDirectory, but this example was about same IndexWriter, in my case I need
some separate IndexWriters on few computers. So does it possible with
Lucene?

Here are some things you can try. First, look at IndexWriter.mergeFactor and IndexWriter.minMergeDocs. These two attributes control how often IndexWriter actually writes a batch of indexed documents to disk (and therefore how big each disk piece is), and how many disk pieces get merged together. Since each merge is essentially a big read and re-write of indexed documents, the fewer times you do it, the shorter your indexing time. On the other hand, merging less often takes more RAM. In other words, it's another incarnation of the classic tradeoff between space and time. As one data point, one of my applications has documents about 10% the size of yours (about 3K each). Changing minMergeDocs to 70000 and mergeFactor to 70 cut its indexing time by more than half.

Another approach is along the lines of what you mentioned. Index subsets of the data on several machines, then merge them all together at the end (that part has to be done on 1 machine). See IndexWriter.addIndexes().

Of course, it's also possible that parsing the documents themselves is a big chunk of your time. If you're using your own Analyzer, or your data is unusual in some way, you might look at that too.

Note that these approaches are not mutually exclusive, i.e. you can combine them. Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to