Re: Speed up indexing in Lucene.net 3.0.3

Ron Grabowski Wed, 07 Feb 2024 19:13:58 -0800

I saw you improved the code and saw good results! The PR is too big to leave a 
comment. My guess is fingerprinting the file is slow and adding the results to 
Lucene is fast. Some more ideas:

1)

Run a code profiler to figure out exactly what is slow and concentrate on 
fixing that.

2)

Current implementation submits 500 Tasks, experiment with changing that number 
higher or lower. Maybe loading a lot of rows in memory is slow.

3)

Implement the original idea in https://pastebin.com/g0QKhCb1 to experiment with 
setting a limit on the Task pool size based on your hardware. The runtime may 
be scheduling too many things to run at once.

4)

Explore batching things so one Task handles adding 25(?) Documents at a time so 
there's less switching between Tasks:

https://stackoverflow.com/questions/13731796/create-batches-in-linq

// I hope this formats correctly, I'm not replying through a traditional email 
client
foreach (var batch in dt.Rows.Batch(25)) 
{
   tasks.Add(Task.Run(() =>
   {
     foreach (DataRow in batch)
         {
           Document doc = new Document();

5)

Create a pipeline so you're constantly pulling from the database, then 
fingerprinting, then indexing. The current implementation stops processing each 
time another 500 records are queried from the database. BlockingCollection 
could be used so one producer extracts from the database then many consumers 
process then a few consumers add to Lucene.

On 2023/01/15 07:08:50 Ron Grabowski wrote:
> Sounds like more of a producer/consumer problem than a Lucene.net problem. 
> Here's some untested pseudo-code showing how to create a Task pool that has a 
> configurable size of 4 workers but 8-16 might be better on your hardware. 
> Tasks are quickly submitted to the pool then the pool works on them 4 Tasks 
> in a time until all Tasks to complete:
> 
> https://pastebin.com/g0QKhCb1
> 
> According to 
> https://lucenenet.apache.org/docs/3.0.3/class_lucene_1_1_net_1_1_index_1_1_index_writer.html#details
>  IndexWriter is thread-safe. Note that I reduced locking on the counter by 
> only updating it at the end of each small batch, not after each Document was 
> added. Batch size could change from 5000 to 2500 for more frequent status 
> updates.
> 
> On 2023/01/09 01:27:42 BradelSablink wrote:
> > I use Lucene.net 3.0.3 in an audio fingerprinting project and was wondering 
> > how I could improve the indexing speed? It takes ~1 week to make indexes of 
> > subfingerprints for 7+ million songs on a 32 core system with 64GB ram. I 
> > see that only 1 CPU core is doing 100% of the indexing. How can I use 
> > multiple cores to speed up indexing? Or maybe there's a better way to speed 
> > it up? I'm a Lucene.net novice compared to all of you so thank you for any 
> > help. The area in question where indexing is slow: 
> > https://github.com/nelemans1971/AudioFingerprinting/blob/master/CreateInversedFingerprintIndex/Worker.cs#L237
>

Re: Speed up indexing in Lucene.net 3.0.3

Reply via email to