Indexing Speed: Documents vs. Sentences

Jochen Frey Wed, 17 Dec 2003 14:19:23 -0800

Hi,

I am using Lucene to index a large number of web pages (a few 100GB) and the
indexing speed is great.


Lately I have been trying to index on a sentence level, not the document
level. My problem is that the indexing speed has gone down dramatically and
I am wondering if there is any way for me to improve on that.

Indexing on a sentence level the overall amount of data stays the same while
the number of records increases substantially (since there is usually many
sentences to one web page).

It seems to me like the indexing speed (everything else being the same)
depends largely on the number of Documents inserted into the index, and not
so much on the size of the data within the documents (correct?).

I have played with the merge factor, using RAMDirectory, etc and I am quite
comfortable with our overall configuration, so my guess is that that is not
the issue (and I am QUITE happy with the indexing speed as long as I use
complete pages and not sentences).

Maybe there is a different way of attacking this? My goal is to be able to
execute a query and get the sentences that match the query in the most
efficient way while maintaining good/great indexing speed. I would prefer
not having to search the complete document for the sentence in question.

My current solution is to have one Lucene Document for each page (containing
the URL and other information I require) that does NOT contain the text of
the page. Then I have one Lucene Document for each sentence within that
document, which contains the text of this particular sentence in addition to
some identifying information that references the entry of the page itself.

Any and all suggestions are welcome.

Thanks!
Jochen


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing Speed: Documents vs. Sentences

Reply via email to