Improving search performance for forum search

Arjen van der Meijden Mon, 12 Nov 2012 23:36:45 -0800

Hi List,

I'm working on a search engine for our forum using Lucene 4. Since its abrand new search engine, I can change it as I see fit.

We have about 1.5M topics in the various subforums and on average 20replies to each topic (i.e. about 33M in total).For now, I've opted to index all replies to topics and group the bestreply-matches based on their topic-id and only keep the top X (currentlyat most 5 per topic).

This works quite well, but the search time is fairly long. It takesabout 330ms to achieve a result with a single word that matches about45k of the topics. The index is on a ssd in my test-machine and the330ms is after repeated searches and including several other aspects.

Obviously, with an average of 20 replies per topic, that could actuallybe upwards to about 900k actual Documents being matched (I didn't lookat the actual count, but it was probably less).

According to yourkit, about 50% of the time is spent in the Scorer andCollector. And it mainly breaks down to two aspects, my custom scoringand the fact that my code is set up to retrieve all results and dofurther processing. But given the grouping on the topic-id, I doubt Ican actually escape that last part...

To enable customized scoring of the documents, I need access toper-reply and per-topic meta-data. The per-topic meta-data is stored inin-memory objects accessible via a HashMap based on the topic's id andthe per-reply meta-data is simply a unix timestamp stored in a binary field.

A fair amount of the time (about 20% is spent in Reader.document(doc,StoredFieldVisitor)) is spent retrieving the topicId, replyId and thattimestamp from the Document's. The topicId and replyId are encoded intoa single binary field.I already use a specialized StoredFieldVisitor that only retrieves thosetwo binary fields from each document.


So now the questions:

- Can I reduce the overhead of retrieving the document's fields evenfurther?-- Should I use a different Codec (perhaps Pulsing or one of the "loadthe fielddata in memory"-codecs) to fetch those binary fields?

-- Should I change them to other field types?

-- Should I encode all binary data in a single field, rather than twofields (i.e. going from 9+8 bytes to 17)?- Should I use a FieldCache to be able to retrieve the required fieldsquicker (and how do you even use a FieldCache??) once they've been read?- Is there a way to delay or skip part of the scoring, so I can skipretrieving Documents altogether? This would probably require predictingthat the results is intended for a topic which already has 5 very goodreplies, so that seems a bit far-fetched (although it would yield themost gain).


Any other tips?

Best regards,

Arjen van der Meijden
Tweakers.net B.V.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Improving search performance for forum search

Reply via email to