Re: Making lucene indexing multi threaded

Erick Erickson Mon, 02 Sep 2013 07:38:41 -0700

Stop. Back up. Test. <G>....

The very _first_ thing I'd do is just comment out the bit that
actually indexes the content. I'm guessing you have some
loop like:

while (more files) {
  read the file
   transform the data
   create a Lucene document
   index the document
}

Just comment out the "index the document" line and see how
long _that_ takes. 9 times out of 10, the bottleneck is here.
As a comparison, I can index 3-4K docs/second on my laptop.
This is using Solr and is the Wikipedia dump so the docs
are several K each.

So, if you're going to multi-thread, you'll probably want to
multi-thread the acquisition of the data and feed that
through a separate thread that actually does the indexing,
you don't want multiple IndexWriters active at once.

FWIW,
Erick

On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
<nischal.srini...@gmail.com>wrote:

> Hi,
>
> I am thinking to make my lucene indexing multi threaded, can someone throw
> some light on the best approach to be followed for achieving this.
>
> I will give short gist about what i am trying to do, please suggest me the
> best way to tackle this.
>
> What am i trying to do?
>
> I am building an index for files (around 30000 files), and later will use
> this index to search the contents of the files. The usual sequential
> approach works fine but is taking humungous amount of time (around 30
> minutes is this the expected time or am i screwing up things somewhere?).
>
> What am i thinking to do?
>
> So to improve the performance i am thinking to make my application
> multithreaded
>
> Need suggestions :)
>
> Please suggest me best ways to do this and normally how long does lucene
> take to index 30k files?
>
> Please suggest me some links of examples (or probably best practices for
> multithreading lucene) for making my application more robust.
>
> TIA,
> Nischal Y
>

Re: Making lucene indexing multi threaded

Reply via email to